DL0050 Knowledge Distillation

Written by

Describe the process and benefits of knowledge distillation.

Answer

Knowledge Distillation (KD) transfers “dark knowledge” about inter-class relationships from a large, accurate teacher model to a smaller student model. The student learns via a temperature-scaled softmax and a combined distillation plus supervised loss, enabling substantial model compression and faster inference while retaining high accuracy, provided that the teacher quality, student capacity, and hyperparameters are well chosen.

Definition: Knowledge distillation is a process where a smaller model (student) learns to mimic the behavior of a larger, well-trained model (teacher).

Soft Targets: The student is trained not only on hard labels (one-hot) but also on the soft output probabilities of the teacher.

Temperature Scaling: Teacher logits are softened using a temperature $T$ to reveal more information about class similarities:
$\mbox{Softmax}(z_i / T) = \frac{e^{z_i / T}}{\sum_{j=1}^{K} e^{z_j / T}}$
Where:
$z_i$ : Raw score (logit) for the i-th class.
$K$ : Total number of classes in the classification problem.
$T$ : Temperature parameter (>0) used to soften the probabilities. Higher $T$ produces a smoother distribution, revealing relationships between classes (“dark knowledge”).

The below plot shows the Softmax probabilities for a fixed set of Teacher logits under three different temperatures. Increasing the temperature smooths the distribution.

Loss Function: Typically combines distillation loss (difference between teacher and student soft outputs) and standard cross-entropy loss with true labels.

Key Benefits of KD:
(1) Model compression: the student is smaller and faster while retaining much of the teacher’s performance, enabling deployment on resource-constrained devices.
(2) Inference Speed: Significantly decreases latency, making the model suitable for deployment on edge devices or real-time applications.
(3) Improved Generalization: The Teacher’s smooth soft targets act as a form of powerful regularization, often leading the Student to generalize better than if it were trained only on hard labels.

The plot below demonstrates the Knowledge Distillation (KD) process.

Did you solve the problem?

Basics

DL0050 Knowledge Distillation

Comments

Leave a Reply Cancel reply

More posts

MSD0007 Demand Forecasting System for Retailer

MSD0006 Video Recommendation System

MSD0005 Surveillance Video Anomaly Detection

DL0052 Rotary Positional Embedding