Parameter Efficient Fine-Tuning (PEFT)
Created: =dateformat(this.file.ctime,"dd MMM yyyy, hh:mm a") | Modified: =dateformat(this.file.mtime,"dd MMM yyyy, hh:mm a")
Tags: knowledge
Overview
Related fields
Introduction
- Various PEFT methods have been developed, such as:
- Task-Guided Prompt Tuning: This technique utilizes task-specific prompts to guide the LLM’s output, obviating the need to retrain the entire model for a specific task.
- Low-Rank Adaptation (LoRA): By approximating the LLM’s parameters with a low-rank matrix, LoRA decreases the number of fine-tuned parameters, enhancing LLM performance.
- Adapters: These small, specialized layers can be added to the LLM for task adaptation, providing flexibility and performance improvement.
- Task-Relevant Prefix Tuning: Fine-tuning the LLM on representative prefixes related to the task at hand enhances performance and task adaptability.
Used in
- Gemini Nano (Android AICore) - LoRA
LoRA: Low-Rank Adaptation of Large Language Models
- LoRA freezes the pre-trained model weights and injects trainable rank decomposition matrices into each layer of the Transformer architecture, greatly reducing the number of trainable parameters for downstream tasks.
- Intention: partial finetuning of a pre-trained model
- leads to reduced number of trainable parameters, GPU memory req
- Unlike adapters, LoRAs have no additional inference latency (adapters such as: Parameter-Efficient Transfer Learning for NLP - original adapter paper by Houlsby 2019)
- the adapter layers leads to extra compute due to those additions despite being small layers
- main reason is due to hardware parallelism, where usage of adapter layers forces compute to be sequential for those layers, leading to latency during inference
- hypothesize that updates to weights have low “intrinsic rank” during adaptation, and that this low rank matrix is sufficient to learn efficiently despite the smaller subspace
- train some dense layers by optimising rank decomposition matrices ( and ) of the dense layers’ change during adaptation, while keeping pre-trained weights frozen
- in Transformers, there are 4 weight matrices in Self-Attention module , and 2 in MLP module, but LoRA focuses on the attention weights only
- paper experiments show that applying LoRA on all 4 matrices is better than on just 1, for the same number of trainable parameters

- what is the optimal rank for LoRA?
- seems like what adapt both and , rank 1 also suffice
- but when only adapted then rank needs to be higher
- however, small rank not necessary will work on every dataset / task ⇒ e.g. completely different language, doing a full retraining (i.e ) will def outperform LoRA with small
- how does adaptation matrix compare with ?
- is stronger in correlation to W than from a random matrix
- ⇒ amplifies certain features that are already in
- only amplifies singular directions that are not emphasised in
- ⇒ amplifies important features for the specific downstream task that were learned, but not emphasised in general pre-trained model
- is stronger in correlation to W than from a random matrix
- paper experiments show that applying LoRA on all 4 matrices is better than on just 1, for the same number of trainable parameters
Practical Benefits
- Reduction of training time and space: Using the technique shown above, parameters have to be tuned during model adaption. Since , this is much lesser than the number of parameters that would have to be tuned otherwise (). This reduces the time and space required to finetune the model by a large margin. Some numbers from the paper and our experiments are discussed in the sections below.
- No additional inference time: If used in production, we can explicitly compute and store the results, performing inference as usual. This guarantees that we do not introduce any additional latency during inference.
- Easier task switching: Swapping only the LoRA weights as opposed to all the parameters allows cheaper and faster switching between tasks. Multiple customized models can be created and swapped in and out easily.
- open-source community has welcomed LoRA with open arms due to its ability to allow low-resource practitioners to adapt large models
- e.g. for Instruct-tuning LLMs and Finetuning Diffusion models.
- TLDR
- Finetune large models with low compute
- Adapt large models in a low-data regime
Limitations
- Not straightforward to batch inputs to different tasks with different A and B matrices if A and B are absorbed into W for efficiency (lower inference latency)
- Possible to dynamically swap A and B if they are not merged, but will incur latency
- Requires a good pretrained model that is trained general enough (i.e. using general enough dataset / large enough capacity) such that the downstream task exists within the same loss valley
- since adds linearly to , means that must be already “good enough” for the downstream task to further finetune
- this means if the task is too far from the original general pre-trained model then LoRA might not work
Questions
- how to obtain A and B after initialisation? ✅ 2024-01-03
- updated / learned during backprop
- if A and B are rank decomposition matrices, what does it mean to set as 0 and normal dist at initialisation? ✅ 2024-01-03
- setting the matrix as zeros, and setting as normal random sample vals of zero mean unit var. e.g. below pseudocode
# Initialization of LoRA weights nn.init.kaiming_uniform_(W_A, a=math.sqrt(5)) nn.init.zeros_(W_B)
References
- [2106.09685] LoRA: Low-Rank Adaptation of Large Language Models
- Parameter-Efficient LLM Finetuning With Low-Rank Adaptation (LoRA) - very good article!
- Low Rank Adaptation: A Technical Deep Dive
- Finetuning LLMs with LoRA and QLoRA: Insights from Hundreds of Experiments - Lightning AI - practical insights to using LoRA
- nanoLoRA.ipynb
QLoRA: Efficient Finetuning of Quantized LLMs
- [2305.14314] QLoRA: Efficient Finetuning of Quantized LLMs
- used a combination of LoRA Low-Rank Adaptation of Large Language Models and Quantization
Questions
- how does LoRA differ from SVD type factorisations per Low-Rank Factorization? when people talk about low rank factorisation are they referring to LoRA or SVD style? ✅ 2024-01-04
- LoRA is a method of finetuning / parameter efficient fine tuning
- factorisation is typically for model compression only, not so much on finetuning
- factorisation ⇒ SVD style
- what are Adapters? ✅ 2024-01-04
- other method to do PEFT ⇒ can be said to be a generalisation of LoRA (i.e. LoRA is a type of Adapter)
- see [2304.01933] LLM-Adapters: An Adapter Family for Parameter-Efficient Fine-Tuning of Large Language Models for overview of adapters (incl LoRA)
Theoretical References
Papers
Articles
- Overview of PEFT: State-of-the-art Parameter-Efficient Fine-Tuning - KDnuggets
- A guide to Parameter-efficient Fine-tuning(PEFT)
- Understanding Parameter-Efficient Finetuning of Large Language Models: From Prefix Tuning to LLaMA-Adapters
Courses
Code References
Methods
Tools, Frameworks
- GitHub - huggingface/peft: 🤗 PEFT: State-of-the-art Parameter-Efficient Fine-Tuning.
- OpenAccess-AI-Collective/axolotl
- User-friendly and powerful fine-tuning tool that is used in a lot of state-of-the-art open-source models.