Journal Article10.48550/arXiv.2305.18403
Pruning Meets Low-Rank Parameter-Efficient Fine-Tuning
TL;DR: LoRAPrune as mentioned in this paper proposes a unified framework for efficient fine-tuning and deployment of pre-trained models, which utilizes the values and gradients of Low-Rank Adaption (LoRA) rather than the gradients for importance estimation.
read more
Abstract: Large pre-trained models (LPMs), such as LLaMA and ViT-G, have shown exceptional performance across various tasks. Although parameter-efficient fine-tuning (PEFT) has emerged to cheaply fine-tune these large models on downstream tasks, their deployment is still hindered by the vast model scale and computational costs. Neural network pruning offers a solution for model compression by removing redundant parameters, but most existing methods rely on computing parameter gradients. However, obtaining the gradients is computationally prohibitive for LPMs, which necessitates the exploration of alternative approaches. To this end, we propose a unified framework for efficient fine-tuning and deployment of LPMs, termed LoRAPrune. We first design a PEFT-aware pruning criterion, which utilizes the values and gradients of Low-Rank Adaption (LoRA), rather than the gradients of pre-trained parameters for importance estimation. We then propose an iterative pruning procedure to remove redundant parameters while maximizing the advantages of PEFT. Thus, our LoRAPrune delivers an accurate, compact model for efficient inference in a highly cost-effective manner. Experimental results on various tasks demonstrate that our method achieves state-of-the-art results. For instance, in the VTAB-1k benchmark, LoRAPrune utilizes only 0.76% of the trainable parameters and outperforms magnitude and movement pruning methods by a significant margin, achieving a mean Top-1 accuracy that is 5.7% and 4.3% higher, respectively. Moreover, our approach achieves comparable performance to PEFT methods, highlighting its efficacy in delivering high-quality results while benefiting from the advantages of pruning.
read more
Chat with Paper
AI Agents for this Paper
Find similar papers on Google Scholar, PubMed and Arxiv
Write a critical review of this paper
Analyze citations of this paper to find unaddressed research gaps
Figures

Table 1: The Memory costs for pruning LLaMA65b. “#GPU" indicates the number of NVIDIA A100 (80G) GPUs required. 
Table 8: Generated examples from the Pruned models 
Figure 1: Comparing LoRAPrune with other pruning methods: (a) Unstructured sparse model cannot directly merge LoRA weights, which is computationally inefficient. (b) Gradient-guided pruning requires the gradients of the pre-trained weights, which is memory-intensive. (c) LoRAPrune only needs the gradients of LoRA weights and can seamlessly merge LoRA weights into pre-trained weights, which is efficient in both memory and computation. 
Table 3: Runtime results of the structured pruned LPMs. 
Table 2: Zero-shot performance of the compressed LLaMA models. The average is calculated among seven classification datasets. Bold/Underline denotes the best performance at the same compression rate with/without fine-tuning, respectively. ⋆ denotes the results obtained by our reproduction. 
Figure 5: More ablation studies for pruning hyper-parameters: (a) λ value in moving average, (b) fine-tuning iterations.
Citations
A Survey on Efficient Inference for Large Language Models
Zixuan Zhou,Xuefei Ning,Ke Hong,Tianyu Fu,Jiaming Xu,Shiyao Li,Yuming Lou,Luning Wang,Zhihang Yuan,Xiuhong Li,Shengen Yan,Guohao Dai,Xiao-Ping Zhang,Yuhan Dong,Yu Wang +14 more
TL;DR: A survey on efficient inference for large language models covering model, data, and system-level optimization techniques.
32
A Speed Odyssey for Deployable Quantization of LLMs
Qingyuan Li,Ran Meng,Yiduo Li,Bo Zhang,Liang Li,Yifan Lu,Xiangxiang Chu,Yerui Sun,Yuchen Xie +8 more
TL;DR: It is argued that pursuing a hardware-centric approach in the construction of quantization algorithms is crucial, and the OdysseyLLM method is driven to build its compression method on top of hardware awareness, eliminating impractical algorithm choices while maximizing the benefit of hardware acceleration.
4
Multilingual Brain Surgeon: Large Language Models Can Be Compressed Leaving No Language behind
Hongchuan Zeng,Hongshen Xu,Lu Chen,Kai Yu +3 more
- 06 Apr 2024
TL;DR: Multilingual Brain Surgeon (MBS) introduces a novel calibration data sampling method for multilingual Large Language Model (LLM) compression that overcomes the English-centric limitations of existing methods and improves performance for low-resource languages.
Your Transformer is Secretly Linear
Anton Razzhigaev,Matvey Mikhalchuk,Elizaveta Goncharova,Natalia A. Gerasimenko,Ivan Oseledets,Denis Dimitrov,А. В. Кузнецов +6 more
- 19 May 2024
TL;DR: The transformer decoder exhibits a novel linear characteristic, characterized by a near-perfect linear relationship between embedding transformations between sequential layers. However, linearity decreases when the residual component is removed. The study suggests that transformers may be more linear than previously thought, and introduces a cosine-similarity-based regularization to reduce layer linearity.
ReLU Strikes Back: Exploiting Activation Sparsity in Large Language Models
Iman Mirzadeh,Keivan Alizadeh-Vahid,Sachin Mehta,C. C. D. Mundo,Oncel Tuzel,Golnoosh Samei,Mohammad Reza Rastegari,Mehrdad Farajtabar +7 more
TL;DR: It is demonstrated that using the ReLU activation function has a negligible impact on convergence and performance while significantly reducing computation and weight transfer during inference, which is particularly valuable during the memory-bound inference step, where efficiency is paramount.
References
•Posted Content
An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale
Alexey Dosovitskiy,Lucas Beyer,Alexander Kolesnikov,Dirk Weissenborn,Xiaohua Zhai,Thomas Unterthiner,Mostafa Dehghani,Matthias Minderer,Georg Heigold,Sylvain Gelly,Jakob Uszkoreit,Neil Houlsby +11 more
TL;DR: Vision Transformer (ViT) attains excellent results compared to state-of-the-art convolutional networks while requiring substantially fewer computational resources to train.
•Proceedings Article
Language Models are Few-Shot Learners
Tom B. Brown,Benjamin Mann,Nick Ryder,Melanie Subbiah,Jared Kaplan,Prafulla Dhariwal,Arvind Neelakantan,Pranav Shyam,Girish Sastry,Amanda Askell,Sandhini Agarwal,Ariel Herbert-Voss,Gretchen Krueger,Thomas Henighan,Rewon Child,Aditya Ramesh,Daniel M. Ziegler,Jeffrey Wu,Clemens Winter,Christopher Hesse,Mark Chen,Eric Sigler,Mateusz Litwin,Scott Gray,Benjamin Chess,Jack Clark,Christopher Berner,Samuel McCandlish,Alec Radford,Ilya Sutskever,Dario Amodei +30 more
- 28 May 2020
TL;DR: GPT-3 achieves strong performance on many NLP datasets, including translation, question-answering, and cloze tasks, as well as several tasks that require on-the-fly reasoning or domain adaptation, such as unscrambling words, using a novel word in a sentence, or performing 3-digit arithmetic.
Building a large annotated corpus of English: the penn treebank
TL;DR: As a result of this grant, the researchers have now published on CDROM a corpus of over 4 million words of running text annotated with part-of- speech (POS) tags, which includes a fully hand-parsed version of the classic Brown corpus.
LLaMA: Open and Efficient Foundation Language Models
Hugo Touvron,Thibaut Lavril,Gautier Izacard,Xavier Martinet,Marie-Anne Lachaux,Timothée Lacroix,Baptiste Roziere,Naman Goyal,Eric Hambro,Faisal Azhar,Aur'elien Rodriguez,Armand Joulin,Edouard Grave,Guillaume Lample +13 more
TL;DR: This article introduced LLaMA, a collection of foundation language models ranging from 7B to 65B parameters, and trained their models on trillions of tokens, and showed that it is possible to train state-of-the-art models using publicly available datasets exclusively, without resorting to proprietary and inaccessible datasets.
The Caltech-UCSD Birds-200-2011 Dataset
Catherine Wah,Steve Branson,Peter Welinder,Pietro Perona,Serge Belongie +4 more
- 01 Jul 2011
TL;DR: CUB-200-2011 as mentioned in this paper is an extended version of CUB200, which roughly doubles the number of images per category and adds new part localization annotations, annotated with bounding boxes, part locations, and at-ribute labels.