Parameter Efficient Fine-Tuning (PEFT)


Created: =dateformat(this.file.ctime,"dd MMM yyyy, hh:mm a") | Modified: =dateformat(this.file.mtime,"dd MMM yyyy, hh:mm a") Tags: knowledge


Overview

Introduction

  • Various PEFT methods have been developed, such as:
    1. Task-Guided Prompt Tuning: This technique utilizes task-specific prompts to guide the LLM’s output, obviating the need to retrain the entire model for a specific task.
    2. Low-Rank Adaptation (LoRA): By approximating the LLM’s parameters with a low-rank matrix, LoRA decreases the number of fine-tuned parameters, enhancing LLM performance.
    3. Adapters: These small, specialized layers can be added to the LLM for task adaptation, providing flexibility and performance improvement.
    4. Task-Relevant Prefix Tuning: Fine-tuning the LLM on representative prefixes related to the task at hand enhances performance and task adaptability.

Used in

LoRA: Low-Rank Adaptation of Large Language Models

  • LoRA freezes the pre-trained model weights and injects trainable rank decomposition matrices into each layer of the Transformer architecture, greatly reducing the number of trainable parameters for downstream tasks.
    • Intention: partial finetuning of a pre-trained model
    • leads to reduced number of trainable parameters, GPU memory req
    • Unlike adapters, LoRAs have no additional inference latency (adapters such as: Parameter-Efficient Transfer Learning for NLP - original adapter paper by Houlsby 2019)
      • the adapter layers leads to extra compute due to those additions despite being small layers
      • main reason is due to hardware parallelism, where usage of adapter layers forces compute to be sequential for those layers, leading to latency during inference
  • hypothesize that updates to weights have low “intrinsic rank” during adaptation, and that this low rank matrix is sufficient to learn efficiently despite the smaller subspace
    • train some dense layers by optimising rank decomposition matrices ( and ) of the dense layers’ change during adaptation, while keeping pre-trained weights frozen
    • in Transformers, there are 4 weight matrices in Self-Attention module , and 2 in MLP module, but LoRA focuses on the attention weights only
      • paper experiments show that applying LoRA on all 4 matrices is better than on just 1, for the same number of trainable parameters
      • what is the optimal rank for LoRA?
        • seems like what adapt both and , rank 1 also suffice
        • but when only adapted then rank needs to be higher
        • however, small rank not necessary will work on every dataset / task e.g. completely different language, doing a full retraining (i.e ) will def outperform LoRA with small
      • how does adaptation matrix compare with ?
        • is stronger in correlation to W than from a random matrix
          • amplifies certain features that are already in
        • only amplifies singular directions that are not emphasised in
          • amplifies important features for the specific downstream task that were learned, but not emphasised in general pre-trained model

Practical Benefits

  1. Reduction of training time and space: Using the technique shown above,  parameters have to be tuned during model adaption. Since , this is much lesser than the number of parameters that would have to be tuned otherwise (). This reduces the time and space required to finetune the model by a large margin. Some numbers from the paper and our experiments are discussed in the sections below.
  2. No additional inference time: If used in production, we can explicitly compute  and store the results, performing inference as usual. This guarantees that we do not introduce any additional latency during inference.
  3. Easier task switching: Swapping only the LoRA weights as opposed to all the parameters allows cheaper and faster switching between tasks. Multiple customized models can be created and swapped in and out easily.
  • open-source community has welcomed LoRA with open arms due to its ability to allow low-resource practitioners to adapt large models
    • e.g. for Instruct-tuning LLMs and Finetuning Diffusion models.
  • TLDR
    • Finetune large models with low compute
    • Adapt large models in a low-data regime

Limitations

  • Not straightforward to batch inputs to different tasks with different A and B matrices if A and B are absorbed into W for efficiency (lower inference latency)
    • Possible to dynamically swap A and B if they are not merged, but will incur latency
  • Requires a good pretrained model that is trained general enough (i.e. using general enough dataset / large enough capacity) such that the downstream task exists within the same loss valley
    • since adds linearly to , means that must be already “good enough” for the downstream task to further finetune
    • this means if the task is too far from the original general pre-trained model then LoRA might not work

Questions

  • how to obtain A and B after initialisation? ✅ 2024-01-03
    • updated / learned during backprop
  • if A and B are rank decomposition matrices, what does it mean to set as 0 and normal dist at initialisation? ✅ 2024-01-03
    • setting the matrix as zeros, and setting as normal random sample vals of zero mean unit var. e.g. below pseudocode
    # Initialization of LoRA weights
    nn.init.kaiming_uniform_(W_A, a=math.sqrt(5))
    nn.init.zeros_(W_B)
    

References

QLoRA: Efficient Finetuning of Quantized LLMs

Questions

  • how does LoRA differ from SVD type factorisations per Low-Rank Factorization? when people talk about low rank factorisation are they referring to LoRA or SVD style? ✅ 2024-01-04
    • LoRA is a method of finetuning / parameter efficient fine tuning
    • factorisation is typically for model compression only, not so much on finetuning
    • factorisation SVD style
  • what are Adapters? ✅ 2024-01-04

Theoretical References

Papers

Articles

Courses


Code References

Methods

Tools, Frameworks