Knowledge Distillation
Created: 21 Nov 2022, 10:42 AM | Modified: =dateformat(this.file.mtime,"dd MMM yyyy, hh:mm a")
Tags: knowledge, MTL
Overview
Related fields
Used in:
- Alpaca, Google Gemini Nano-1 (1.8B) and Nano-2 (3.25B)
Introduction
- compression technique in which a small model is trained to reproduce the behavior of a larger model (or an ensemble of models)

- Designing a knowledge distillation system requires 3 considerations:
- What knowledge?
- Three major types of knowledge for knowledge distillation - response-based, feature-based, and relation-based distillation.

- Three major types of knowledge for knowledge distillation - response-based, feature-based, and relation-based distillation.
- What architecture? (the teacher-student architecture)
- typically involves a small “student” model learning to mimic a large “teacher” model and using the teacher’s knowledge to achieve similar or superior accuracy.

- Intuition:
- teacher model, once trained, will be able to predict a matching class for a new sample
- relative probabilities assigned by the teacher model to the other classes express what the model has learned about generalizing on the training data
- student model can learn from the teacher model using all the probabilities predicted, learning a soft distribution over the classes
- different from how the teacher model learns from a hard distribution over the classes, where only the correct class is specified in the training data
- ⇒ use an objective function that is a combination of “learning from the soft class distribution provided by the teacher model, which can be faulty” and “learning from the hard class distribution provided by the training data, which is always true”
- How to choose the student?
- Normally transfer from deeper, wider networks ⇒ shallower, thinner networks
- Mainly among these combinations:
- typically involves a small “student” model learning to mimic a large “teacher” model and using the teacher’s knowledge to achieve similar or superior accuracy.
- How to distill? (i.e. the distillation algorithm)
- Key problem! How to transfer the knowledge.
Offline vs Online vs Self
- Key problem! How to transfer the knowledge.
- What knowledge?
Types of distillation algorithms
- Adversarial Distillation
- Inspired by GANs
- Enable the teacher and student networks to have a better understanding of the true data distribution.
- Multi-Teacher Distillation
- Simplest way is to use the averaged response from all teachers as supervision signal.
- Cross-Modal Distillation
- Cross modality since data or labels for some modalities might not be available during training or testing.
- Graph-Based Distillation
- Explore the intra-data relationships using graphs.
- Use the graph as the carrier of teacher knowledge or use the graph to control the message passing of the teacher knowledge.
- Attention-Based Distillation
- since attention can well reflect the neuron activations of CNNs, why not use in KD
- Data-Free Distillation
- to overcome problems with unavailable data arising from privacy, legality, security, and confidentiality concerns.
- “data-free” == there is no training data. data is newly or synthetically generated.
- Quantized Distillation Network
- Introduces Quantization too
- Lifelong Distillation
- to address catastrophic forgetting.
- NAS-Based Distillation
- since success of knowledge transfer depends on not only the knowledge from the teacher but also the architecture of the student, Neural Architecture Search (NAS) has been adopted to find the appropriate student architecture.
Challenges
- The influence of each individual type of knowledge and how different kinds of knowledge help each other in a complementary manner
- Challenging to model different types of knowledge in a unified and complementary framework
- Most methods focus on new types of knowledge, or distillation loss functions, teacher-student architectures are poorly investigated
Self-Distillation
- performing knowledge distillation against an individual model of the same architecture
- [1905.08094] Be Your Own Teacher: Improve the Performance of Convolutional Neural Networks via Self Distillation
DistilBERT (2019)
- 🏎 Smaller, faster, cheaper, lighter: Introducing DistilBERT | by Victor Sanh
- DistilBERT | by Vyacheslav Efimov
- retains 97% the performance of BERT and is 60% faster
- training loss is a linear combination of the following (triple loss):
- masked language modeling loss
- per standard language modelling losses
- distillation loss
- KL Divergence loss (since optimizations are equivalent)
- or cross-entropy loss with Softmax temperature (per below)

- similarity loss
- cosine similarity loss between hidden state embeddings
- in order to construct embeddings imilar to that of the teacher

- masked language modeling loss
TinyBERT (2020)
- built on top of distilBERT, but modified loss function to take into consideration not only WHAT both models produce but also HOW predictions are obtained
- introduces Transformer Distillation
Transformer Distillation

- Loss function that encompasses:
- Output of embedding layer

- Hidden states, attention matrices within transformer layer

- Logits output by prediction layer

- Output of embedding layer
Weight Distillation
- Weight Initialisation via “Weight selection / distillation”
- Using a larger model to initialise the weights for a smaller model
- [2311.18823] Initializing Models with Larger Ones
Questions
- is it typically using KL divergence? ✅ 2023-12-29
- the distillation loss can be KL Divergence!
Theoretical References
Papers
- [2006.05525] Knowledge Distillation: A Survey (2021) - survey paper
- [2012.09816] Towards Understanding Ensemble, Knowledge Distillation and Self-Distillation in Deep Learning (ICLR 2023) - theoretical proof toward understanding
- [1503.02531] Distilling the Knowledge in a Neural Network - seminal paper by Hinton
Articles
- Rachit Singh - Deep learning model compression
- Research Guide: Model Distillation Techniques for Deep Learning | by Derrick Mwiti | Heartbeat
- A beginner’s guide to Knowledge Distillation in Deep Learning
- Three mysteries in deep learning: Ensemble, knowledge distillation, and self-distillation - Microsoft Research
Courses
Code References
Methods
Tools, Frameworks
- SforAiDl/KD_Lib: A Pytorch Knowledge Distillation library
- yoshitomo-matsubara/torchdistill: A coding-free framework built on PyTorch for reproducible deep learning studies in KD.