Knowledge Distillation


Created: 21 Nov 2022, 10:42 AM | Modified: =dateformat(this.file.mtime,"dd MMM yyyy, hh:mm a") Tags: knowledge, MTL


Overview

Used in:

  • Alpaca, Google Gemini Nano-1 (1.8B) and Nano-2 (3.25B)

Introduction

  • compression technique in which a small model is trained to reproduce the behavior of a larger model (or an ensemble of models)
  • Designing a knowledge distillation system requires 3 considerations:
    1. What knowledge?
      • Three major types of knowledge for knowledge distillation - response-based, feature-based, and relation-based distillation.
    2. What architecture? (the teacher-student architecture)
      • typically involves a small “student” model learning to mimic a large “teacher” model and using the teacher’s knowledge to achieve similar or superior accuracy.
      • Intuition:
        • teacher model, once trained, will be able to predict a matching class for a new sample
        • relative probabilities assigned by the teacher model to the other classes express what the model has learned about generalizing on the training data
        • student model can learn from the teacher model using all the probabilities predicted, learning a soft distribution over the classes
        • different from how the teacher model learns from a hard distribution over the classes, where only the correct class is specified in the training data
        • use an objective function that is a combination of “learning from the soft class distribution provided by the teacher model, which can be faulty” and “learning from the hard class distribution provided by the training data, which is always true”
      • How to choose the student?
    3. How to distill? (i.e. the distillation algorithm)
      • Key problem! How to transfer the knowledge. Offline vs Online vs Self

Types of distillation algorithms

  • Adversarial Distillation
    • Inspired by GANs
    • Enable the teacher and student networks to have a better understanding of the true data distribution.
  • Multi-Teacher Distillation
    • Simplest way is to use the averaged response from all teachers as supervision signal.
  • Cross-Modal Distillation
    • Cross modality since data or labels for some modalities might not be available during training or testing.
  • Graph-Based Distillation
    • Explore the intra-data relationships using graphs.
    • Use the graph as the carrier of teacher knowledge or use the graph to control the message passing of the teacher knowledge.
  • Attention-Based Distillation
    • since attention can well reflect the neuron activations of CNNs, why not use in KD
  • Data-Free Distillation
    • to overcome problems with unavailable data arising from privacy, legality, security, and confidentiality concerns.
    • “data-free” == there is no training data. data is newly or synthetically generated.
  • Quantized Distillation Network
  • Lifelong Distillation
    • to address catastrophic forgetting.
  • NAS-Based Distillation
    • since success of knowledge transfer depends on not only the knowledge from the teacher but also the architecture of the student, Neural Architecture Search (NAS) has been adopted to find the appropriate student architecture.

Challenges

  • The influence of each individual type of knowledge and how different kinds of knowledge help each other in a complementary manner
  • Challenging to model different types of knowledge in a unified and complementary framework
  • Most methods focus on new types of knowledge, or distillation loss functions, teacher-student architectures are poorly investigated

Self-Distillation

DistilBERT (2019)

TinyBERT (2020)

  • built on top of distilBERT, but modified loss function to take into consideration not only WHAT both models produce but also HOW predictions are obtained
  • introduces Transformer Distillation

Transformer Distillation

  • Loss function that encompasses:
    • Output of embedding layer
    • Hidden states, attention matrices within transformer layer
    • Logits output by prediction layer

Weight Distillation

Questions

  • is it typically using KL divergence? ✅ 2023-12-29

Theoretical References

Papers

Articles

Courses


Code References

Methods

Tools, Frameworks