Parameter Efficient Fine-Tuning (PEFT)

Created: =dateformat(this.file.ctime,"dd MMM yyyy, hh:mm a") | Modified: =dateformat(this.file.mtime,"dd MMM yyyy, hh:mm a") Tags: knowledge

Overview

Introduction

Various PEFT methods have been developed, such as:
1. Task-Guided Prompt Tuning: This technique utilizes task-specific prompts to guide the LLM’s output, obviating the need to retrain the entire model for a specific task.
2. Low-Rank Adaptation (LoRA): By approximating the LLM’s parameters with a low-rank matrix, LoRA decreases the number of fine-tuned parameters, enhancing LLM performance.
3. Adapters: These small, specialized layers can be added to the LLM for task adaptation, providing flexibility and performance improvement.
4. Task-Relevant Prefix Tuning: Fine-tuning the LLM on representative prefixes related to the task at hand enhances performance and task adaptability.

Used in

Gemini Nano (Android AICore) - LoRA

LoRA: Low-Rank Adaptation of Large Language Models

LoRA freezes the pre-trained model weights and injects trainable rank decomposition matrices into each layer of the Transformer architecture, greatly reducing the number of trainable parameters for downstream tasks.
- Intention: partial finetuning of a pre-trained model
- leads to reduced number of trainable parameters, GPU memory req
- Unlike adapters, LoRAs have no additional inference latency (adapters such as: Parameter-Efficient Transfer Learning for NLP - original adapter paper by Houlsby 2019)
  - the adapter layers leads to extra compute due to those additions despite being small layers
  - main reason is due to hardware parallelism, where usage of adapter layers forces compute to be sequential for those layers, leading to latency during inference
hypothesize that updates to weights have low “intrinsic rank” during adaptation, and that this low rank matrix is sufficient to learn efficiently despite the smaller subspace
- train some dense layers by optimising rank decomposition matrices ( $A$ and $B$ ) of the dense layers’ change during adaptation, while keeping pre-trained weights $W_{0}$ frozen
- in Transformers, there are 4 weight matrices in Self-Attention module $W_{q}, W_{k}, W_{v}, W_{o}$ , and 2 in MLP module, but LoRA focuses on the attention weights only
  - paper experiments show that applying LoRA on all 4 matrices is better than on just 1, for the same number of trainable parameters
  - what is the optimal rank $r$ for LoRA?
    - seems like what adapt both $W_{q}$ and $W_{v}$ , rank 1 also suffice
    - but when only $W_{q}$ adapted then rank needs to be higher
    - however, small rank $r$ not necessary will work on every dataset / task ⇒ e.g. completely different language, doing a full retraining (i.e $r = d_{m o d e l}$ ) will def outperform LoRA with small $r$
  - how does adaptation matrix $Δ W$ compare with $W$ ?
    - $Δ W$ is stronger in correlation to W than from a random matrix
      - ⇒ $Δ W$ amplifies certain features that are already in $W$
    - $Δ W$ only amplifies singular directions that are not emphasised in $W$
      - ⇒ $Δ W$ amplifies important features for the specific downstream task that were learned, but not emphasised in general pre-trained model

Practical Benefits

Reduction of training time and space: Using the technique shown above, $r (d + k)$ parameters have to be tuned during model adaption. Since $r << min (d, k)$ , this is much lesser than the number of parameters that would have to be tuned otherwise ( $d k$ ). This reduces the time and space required to finetune the model by a large margin. Some numbers from the paper and our experiments are discussed in the sections below.
No additional inference time: If used in production, we can explicitly compute $W ’ = W + B A$ and store the results, performing inference as usual. This guarantees that we do not introduce any additional latency during inference.
Easier task switching: Swapping only the LoRA weights as opposed to all the parameters allows cheaper and faster switching between tasks. Multiple customized models can be created and swapped in and out easily.

open-source community has welcomed LoRA with open arms due to its ability to allow low-resource practitioners to adapt large models
- e.g. for Instruct-tuning LLMs and Finetuning Diffusion models.
TLDR
- Finetune large models with low compute
- Adapt large models in a low-data regime

Limitations

Not straightforward to batch inputs to different tasks with different A and B matrices if A and B are absorbed into W for efficiency (lower inference latency)
- Possible to dynamically swap A and B if they are not merged, but will incur latency
Requires a good pretrained model that is trained general enough (i.e. using general enough dataset / large enough capacity) such that the downstream task exists within the same loss valley
- since $Δ W$ adds linearly to $W$ , means that $W$ must be already “good enough” for the downstream task to further finetune
- this means if the task is too far from the original general pre-trained model then LoRA might not work

Questions

how to obtain A and B after initialisation? ✅ 2024-01-03
- updated / learned during backprop
if A and B are rank decomposition matrices, what does it mean to set as 0 and normal dist at initialisation? ✅ 2024-01-03
- setting the matrix as zeros, and setting as normal random sample vals of zero mean unit var. e.g. below pseudocode
```
# Initialization of LoRA weights
nn.init.kaiming_uniform_(W_A, a=math.sqrt(5))
nn.init.zeros_(W_B)
```

References

[2106.09685] LoRA: Low-Rank Adaptation of Large Language Models
Parameter-Efficient LLM Finetuning With Low-Rank Adaptation (LoRA) - very good article!
Low Rank Adaptation: A Technical Deep Dive
Finetuning LLMs with LoRA and QLoRA: Insights from Hundreds of Experiments - Lightning AI - practical insights to using LoRA
nanoLoRA.ipynb

QLoRA: Efficient Finetuning of Quantized LLMs

[2305.14314] QLoRA: Efficient Finetuning of Quantized LLMs
- same author as [2208.07339] LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale
used a combination of LoRA Low-Rank Adaptation of Large Language Models and Quantization

Questions

how does LoRA differ from SVD type factorisations per Low-Rank Factorization? when people talk about low rank factorisation are they referring to LoRA or SVD style? ✅ 2024-01-04
- LoRA is a method of finetuning / parameter efficient fine tuning
- factorisation is typically for model compression only, not so much on finetuning
- factorisation ⇒ SVD style
what are Adapters? ✅ 2024-01-04
- other method to do PEFT ⇒ can be said to be a generalisation of LoRA (i.e. LoRA is a type of Adapter)
- see [2304.01933] LLM-Adapters: An Adapter Family for Parameter-Efficient Fine-Tuning of Large Language Models for overview of adapters (incl LoRA)

Theoretical References

Papers

[2304.01933] LLM-Adapters: An Adapter Family for Parameter-Efficient Fine-Tuning of Large Language Models

Articles

Courses

Code References

Methods

Tools, Frameworks

GitHub - huggingface/peft: 🤗 PEFT: State-of-the-art Parameter-Efficient Fine-Tuning.
OpenAccess-AI-Collective/axolotl
- User-friendly and powerful fine-tuning tool that is used in a lot of state-of-the-art open-source models.

Darius Knowledge Hub

Explorer

Parameter Efficient Fine-Tuning (PEFT)

Parameter Efficient Fine-Tuning (PEFT)

Overview

Introduction

Used in

LoRA: Low-Rank Adaptation of Large Language Models

Practical Benefits

Limitations

Questions

References

QLoRA: Efficient Finetuning of Quantized LLMs

Questions

Theoretical References

Papers

Articles

Courses

Code References

Methods

Tools, Frameworks

Graph View

Table of Contents

Backlinks

Darius Knowledge Hub

Explorer

Parameter Efficient Fine-Tuning (PEFT)

Parameter Efficient Fine-Tuning (PEFT)

Overview

Related fields

Introduction

Used in

LoRA: Low-Rank Adaptation of Large Language Models

Practical Benefits

Limitations

Questions

References

QLoRA: Efficient Finetuning of Quantized LLMs

Questions

Theoretical References

Papers

Articles

Courses

Code References

Methods

Tools, Frameworks

Graph View

Table of Contents

Backlinks