Knowledge Distillation

Created: 21 Nov 2022, 10:42 AM | Modified: =dateformat(this.file.mtime,"dd MMM yyyy, hh:mm a") Tags: knowledge, MTL

Overview

Lightweight AI, Embedded AI, Efficient AI

Used in:

Alpaca, Google Gemini Nano-1 (1.8B) and Nano-2 (3.25B)

Introduction

compression technique in which a small model is trained to reproduce the behavior of a larger model (or an ensemble of models)
Designing a knowledge distillation system requires 3 considerations:
1. What knowledge?
  - Three major types of knowledge for knowledge distillation - response-based, feature-based, and relation-based distillation.
2. What architecture? (the teacher-student architecture)
  - typically involves a small “student” model learning to mimic a large “teacher” model and using the teacher’s knowledge to achieve similar or superior accuracy.
  - Intuition:
    - teacher model, once trained, will be able to predict a matching class for a new sample
    - relative probabilities assigned by the teacher model to the other classes express what the model has learned about generalizing on the training data
    - student model can learn from the teacher model using all the probabilities predicted, learning a soft distribution over the classes
    - different from how the teacher model learns from a hard distribution over the classes, where only the correct class is specified in the training data
    - ⇒ use an objective function that is a combination of “learning from the soft class distribution provided by the teacher model, which can be faulty” and “learning from the hard class distribution provided by the training data, which is always true”
  - How to choose the student?
    - Normally transfer from deeper, wider networks ⇒ shallower, thinner networks
    - Mainly among these combinations:
3. How to distill? (i.e. the distillation algorithm)
  - Key problem! How to transfer the knowledge. Offline vs Online vs Self

Types of distillation algorithms

Adversarial Distillation
- Inspired by GANs
- Enable the teacher and student networks to have a better understanding of the true data distribution.
Multi-Teacher Distillation
- Simplest way is to use the averaged response from all teachers as supervision signal.
Cross-Modal Distillation
- Cross modality since data or labels for some modalities might not be available during training or testing.
Graph-Based Distillation
- Explore the intra-data relationships using graphs.
- Use the graph as the carrier of teacher knowledge or use the graph to control the message passing of the teacher knowledge.
Attention-Based Distillation
- since attention can well reflect the neuron activations of CNNs, why not use in KD
Data-Free Distillation
- to overcome problems with unavailable data arising from privacy, legality, security, and confidentiality concerns.
- “data-free” == there is no training data. data is newly or synthetically generated.
Quantized Distillation Network
- Introduces Quantization too
Lifelong Distillation
- to address catastrophic forgetting.
NAS-Based Distillation
- since success of knowledge transfer depends on not only the knowledge from the teacher but also the architecture of the student, Neural Architecture Search (NAS) has been adopted to find the appropriate student architecture.

Challenges

The influence of each individual type of knowledge and how different kinds of knowledge help each other in a complementary manner
Challenging to model different types of knowledge in a unified and complementary framework
Most methods focus on new types of knowledge, or distillation loss functions, teacher-student architectures are poorly investigated

Self-Distillation

performing knowledge distillation against an individual model of the same architecture
[1905.08094] Be Your Own Teacher: Improve the Performance of Convolutional Neural Networks via Self Distillation

DistilBERT (2019)

🏎 Smaller, faster, cheaper, lighter: Introducing DistilBERT | by Victor Sanh
DistilBERT | by Vyacheslav Efimov
retains 97% the performance of BERT and is 60% faster
training loss is a linear combination of the following (triple loss):
- masked language modeling loss
  - per standard language modelling losses
- distillation loss
  - KL Divergence loss (since optimizations are equivalent)
  - or cross-entropy loss with Softmax temperature (per below)
- similarity loss
  - cosine similarity loss between hidden state embeddings
  - in order to construct embeddings imilar to that of the teacher

TinyBERT (2020)

built on top of distilBERT, but modified loss function to take into consideration not only WHAT both models produce but also HOW predictions are obtained
introduces Transformer Distillation

Transformer Distillation

Loss function that encompasses:
- Output of embedding layer
- Hidden states, attention matrices within transformer layer
- Logits output by prediction layer

Weight Distillation

Weight Initialisation via “Weight selection / distillation”
Using a larger model to initialise the weights for a smaller model
[2311.18823] Initializing Models with Larger Ones

Questions

is it typically using KL divergence? ✅ 2023-12-29
- the distillation loss can be KL Divergence!

Theoretical References

Papers

[2006.05525] Knowledge Distillation: A Survey (2021) - survey paper
- Summary article of this paper
[2012.09816] Towards Understanding Ensemble, Knowledge Distillation and Self-Distillation in Deep Learning (ICLR 2023) - theoretical proof toward understanding
- Blog
- Video
[1503.02531] Distilling the Knowledge in a Neural Network - seminal paper by Hinton

Darius Knowledge Hub

Explorer

Knowledge Distillation

Knowledge Distillation

Overview

Used in:

Introduction

Types of distillation algorithms

Challenges

Self-Distillation

DistilBERT (2019)

TinyBERT (2020)

Transformer Distillation

Weight Distillation

Questions

Theoretical References

Papers

Articles

Courses

Code References

Methods

Tools, Frameworks

Graph View

Table of Contents

Backlinks

Darius Knowledge Hub

Explorer

Knowledge Distillation

Knowledge Distillation

Overview

Related fields

Used in:

Introduction

Types of distillation algorithms

Challenges

Self-Distillation

DistilBERT (2019)

TinyBERT (2020)

Transformer Distillation

Weight Distillation

Questions

Theoretical References

Papers

Articles

Courses

Code References

Methods

Tools, Frameworks

Graph View

Table of Contents

Backlinks