Pruning Meets Low-Rank Parameter-Efficient Fine-Tuning

doi:10.48550/arXiv.2305.18403

Journal Article10.48550/arXiv.2305.18403

Pruning Meets Low-Rank Parameter-Efficient Fine-Tuning

Mingyang Zhang, +6 more

- 28 May 2023

- arXiv.org

- Vol. abs/2305.18403

14

TL;DR: LoRAPrune as mentioned in this paper proposes a unified framework for efficient fine-tuning and deployment of pre-trained models, which utilizes the values and gradients of Low-Rank Adaption (LoRA) rather than the gradients for importance estimation.

Abstract: Large pre-trained models (LPMs), such as LLaMA and ViT-G, have shown exceptional performance across various tasks. Although parameter-efficient fine-tuning (PEFT) has emerged to cheaply fine-tune these large models on downstream tasks, their deployment is still hindered by the vast model scale and computational costs. Neural network pruning offers a solution for model compression by removing redundant parameters, but most existing methods rely on computing parameter gradients. However, obtaining the gradients is computationally prohibitive for LPMs, which necessitates the exploration of alternative approaches. To this end, we propose a unified framework for efficient fine-tuning and deployment of LPMs, termed LoRAPrune. We first design a PEFT-aware pruning criterion, which utilizes the values and gradients of Low-Rank Adaption (LoRA), rather than the gradients of pre-trained parameters for importance estimation. We then propose an iterative pruning procedure to remove redundant parameters while maximizing the advantages of PEFT. Thus, our LoRAPrune delivers an accurate, compact model for efficient inference in a highly cost-effective manner. Experimental results on various tasks demonstrate that our method achieves state-of-the-art results. For instance, in the VTAB-1k benchmark, LoRAPrune utilizes only 0.76% of the trainable parameters and outperforms magnitude and movement pruning methods by a significant margin, achieving a mean Top-1 accuracy that is 5.7% and 4.3% higher, respectively. Moreover, our approach achieves comparable performance to PEFT methods, highlighting its efficacy in delivering high-quality results while benefiting from the advantages of pruning.

Chat with Paper

AI Agents for this Paper

Find similar papers on Google Scholar, PubMed and Arxiv
Write a critical review of this paper
Analyze citations of this paper to find unaddressed research gaps

Figures

Table 1: The Memory costs for pruning LLaMA65b. “#GPU" indicates the number of NVIDIA A100 (80G) GPUs required.

Table 8: Generated examples from the Pruned models

Figure 1: Comparing LoRAPrune with other pruning methods: (a) Unstructured sparse model cannot directly merge LoRA weights, which is computationally inefficient. (b) Gradient-guided pruning requires the gradients of the pre-trained weights, which is memory-intensive. (c) LoRAPrune only needs the gradients of LoRA weights and can seamlessly merge LoRA weights into pre-trained weights, which is efficient in both memory and computation.

Table 3: Runtime results of the structured pruned LPMs.

Table 2: Zero-shot performance of the compressed LLaMA models. The average is calculated among seven classification datasets. Bold/Underline denotes the best performance at the same compression rate with/without fine-tuning, respectively. ⋆ denotes the results obtained by our reproduction.

Figure 5: More ablation studies for pruning hyper-parameters: (a) λ value in moving average, (b) fine-tuning iterations.

Citations

Journal Article•10.48550/arxiv.2404.14294

A Survey on Efficient Inference for Large Language Models

Zixuan Zhou, +14 more

- 22 Apr 2024

- arXiv.org

TL;DR: A survey on efficient inference for large language models covering model, data, and system-level optimization techniques.

...read moreread less

32

Journal Article•10.48550/arxiv.2311.09550

A Speed Odyssey for Deployable Quantization of LLMs

Qingyuan Li, +8 more

- 16 Nov 2023

- arXiv.org

TL;DR: It is argued that pursuing a hardware-centric approach in the construction of quantization algorithms is crucial, and the OdysseyLLM method is driven to build its compression method on top of hardware awareness, eliminating impractical algorithm choices while maximizing the benefit of hardware acceleration.

...read moreread less

4

Journal Article•10.48550/arxiv.2404.04748

Multilingual Brain Surgeon: Large Language Models Can Be Compressed Leaving No Language behind

Hongchuan Zeng, +3 more

- 06 Apr 2024

TL;DR: Multilingual Brain Surgeon (MBS) introduces a novel calibration data sampling method for multilingual Large Language Model (LLM) compression that overcomes the English-centric limitations of existing methods and improves performance for low-resource languages.

...read moreread less

1

Preprint•10.48550/arxiv.2405.12250

Your Transformer is Secretly Linear

Anton Razzhigaev, +6 more

- 19 May 2024

TL;DR: The transformer decoder exhibits a novel linear characteristic, characterized by a near-perfect linear relationship between embedding transformations between sequential layers. However, linearity decreases when the residual component is removed. The study suggests that transformers may be more linear than previously thought, and introduces a cosine-similarity-based regularization to reduce layer linearity.

...read moreread less

Journal Article•10.48550/arxiv.2310.04564

ReLU Strikes Back: Exploiting Activation Sparsity in Large Language Models

Iman Mirzadeh, +7 more

- 06 Oct 2023

- arXiv.org

TL;DR: It is demonstrated that using the ReLU activation function has a negligible impact on convergence and performance while significantly reducing computation and weight transfer during inference, which is particularly valuable during the memory-bound inference step, where efficiency is paramount.

...read moreread less

References

•Posted Content

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

Alexey Dosovitskiy, +11 more

- 22 Oct 2020

- arXiv: Computer Vision and Pattern Recog...

TL;DR: Vision Transformer (ViT) attains excellent results compared to state-of-the-art convolutional networks while requiring substantially fewer computational resources to train.

...read moreread less

36.9K

•Proceedings Article

Language Models are Few-Shot Learners

Tom B. Brown, +30 more

- 28 May 2020

TL;DR: GPT-3 achieves strong performance on many NLP datasets, including translation, question-answering, and cloze tasks, as well as several tasks that require on-the-fly reasoning or domain adaptation, such as unscrambling words, using a novel word in a sentence, or performing 3-digit arithmetic.

...read moreread less

25.2K

•Report•10.21236/ADA273556

Building a large annotated corpus of English: the penn treebank

Mitchell Marcus, +2 more

- 01 Jun 1993

- Computational Linguistics

TL;DR: As a result of this grant, the researchers have now published on CDROM a corpus of over 4 million words of running text annotated with part-of- speech (POS) tags, which includes a fully hand-parsed version of the classic Brown corpus.

...read moreread less

9.2K

Journal Article•10.48550/arXiv.2302.13971

LLaMA: Open and Efficient Foundation Language Models

Hugo Touvron, +13 more

- 27 Feb 2023

- arXiv.org

TL;DR: This article introduced LLaMA, a collection of foundation language models ranging from 7B to 65B parameters, and trained their models on trillions of tokens, and showed that it is possible to train state-of-the-art models using publicly available datasets exclusively, without resorting to proprietary and inaccessible datasets.

...read moreread less

6.6K

The Caltech-UCSD Birds-200-2011 Dataset

Catherine Wah, +4 more

- 01 Jul 2011

TL;DR: CUB-200-2011 as mentioned in this paper is an extended version of CUB200, which roughly doubles the number of images per category and adds new part localization annotations, annotated with bounding boxes, part locations, and at-ribute labels.

...read moreread less

5.6K

...

Expand