Knowledge Distillation: How Small Models Learn from Big Ones

Introduction

Every year, the largest AI models get bigger. GPT-4, Gemini Ultra, Claude Opus: these models run on clusters of thousands of GPUs and cost hundreds of millions of dollars to train. Deploying them at scale costs dollars per thousand requests. For a startup, a hospital system, or a developer building a mobile app, that is simply not viable.

Knowledge distillation is one of the most practical answers to this problem. Instead of training a small model from scratch and accepting that it will be less capable, distillation trains the small model to mimic a large one. The large model, called the teacher, has already learned a rich internal representation of the world. The small model, called the student, learns not just from raw data labels but from the teacher's own output distributions, which contain far more information than a simple correct-or-incorrect signal.

The result is a student model that often punches well above its weight class. DistilBERT, distilled from BERT, retains about 97 percent of BERT's performance on standard benchmarks while being 40 percent smaller and 60 percent faster. Microsoft's Phi-3 Mini, a 3.8 billion parameter model, outperforms models many times its size on reasoning tasks, partly because it was trained on carefully distilled data derived from much larger models.

This guide explains how distillation works mechanically, why it works at all, and how to decide when it is the right tool for your deployment problem.

Problem Statement: The Cost Gap Between Training and Deployment

The AI industry faces a structural tension. State-of-the-art performance requires large models with hundreds of billions of parameters. But most real-world deployment constraints, latency budgets, memory limits, cost per query, edge hardware, offline inference, pull in the opposite direction. You cannot run a 70 billion parameter model on a smartphone. You cannot serve it at 10 milliseconds per response on a single CPU server. You cannot afford it if your application processes millions of queries per day on thin margins.

The naive solution is to train a smaller model. But smaller models trained from scratch on raw data are just less capable. The training signal available from a dataset of labelled examples has a ceiling, and small models hit that ceiling lower than large ones.

The insight behind knowledge distillation is that the large model, after training, has already extracted and compressed a great deal of knowledge about the problem into its parameters. Its output probabilities across all classes, not just the top prediction, encode subtle relationships: how similar concepts relate, which errors are plausible, which distinctions matter. A small model trained directly on this richer signal can learn more than one trained on raw labels alone, approaching the performance of the teacher at a fraction of its size and cost.

This problem is not unique to deep learning. The underlying idea, that a knowledgeable expert can teach a novice faster than the novice could learn from first principles, is as old as apprenticeship. Distillation is its formal implementation in machine learning.

Core Concepts and Terminology

Term	Plain English Definition
Teacher model	The large, pre-trained model whose knowledge is being transferred. It is frozen during distillation; only used to generate training signals for the student.
Student model	The smaller model being trained. Its goal is to approximate the teacher's behaviour while using fewer parameters and less compute at inference time.
Soft labels	The teacher's output probability distribution over all classes, as opposed to a hard label which is simply the correct class. Soft labels contain information about which wrong answers are plausible and how similar different outputs are.
Hard labels	The ground-truth correct answers from the training dataset. A hard label for an image of a cat is simply "cat," with no information about the model's uncertainty or the cat's similarity to a dog.
Temperature (T)	A parameter applied to the teacher's softmax output that controls how "soft" the distribution becomes. Higher temperature spreads probability more evenly across all classes, revealing more information in the distribution. Set to 1 at test time.
Distillation loss	A loss function measuring the difference between the student's output distribution and the teacher's output distribution. Most often computed as KL divergence.
KL divergence	A measure of how different one probability distribution is from another. Used to push the student's output distribution toward the teacher's distribution.
Feature distillation	A variant where the student is also trained to match the teacher's internal intermediate representations, not just its final outputs.
Data-free distillation	A family of methods that perform distillation without the original training data, generating synthetic inputs using the teacher itself.
Logits	The raw unnormalized scores produced by the final layer of a neural network, before the softmax function converts them into probabilities.

How It Works: The Distillation Process

The mechanics of knowledge distillation are straightforward once you understand what soft labels contain and why temperature matters.

Train or obtain a large teacher model. This model is trained to high performance on your task using standard methods. It could be a model you trained yourself, or a pre-trained model like GPT-4 or Llama 3 that you are licensing or accessing via API. The teacher does not change during distillation.
Prepare the training data. You pass your training dataset through the teacher model and collect its output probabilities for every example. These probability distributions become the soft labels. For a classification task with 1,000 classes, each soft label is a vector of 1,000 numbers that sum to one.
Apply temperature scaling to the teacher's outputs. Before using the teacher's softmax outputs as training targets, you divide the logits by a temperature value T (typically 2 to 20). At T=1, the distribution is the standard softmax. At T=5, the distribution spreads out, and the small probabilities on non-top classes become larger and more informative. This is where the "dark knowledge" lives: the teacher's belief that a cat image is 3% dog and 1% fox tells you something meaningful about visual similarity.
Define the student's combined loss function. The student is trained on a weighted combination of two losses. The first is the standard cross-entropy loss against the hard labels from your dataset. The second is the distillation loss, measuring the KL divergence between the student's temperature-scaled outputs and the teacher's temperature-scaled outputs. A typical weight is 90% distillation loss, 10% hard label loss, but this is tuned per task.
Train the student normally. With this combined loss, you train the student model using standard gradient descent. The student learns simultaneously from the ground truth data and from the teacher's probabilistic judgements.
Restore temperature to 1 for inference. When the student model is deployed, temperature is reset to 1 and the model produces standard probability distributions. The temperature was only needed during training to amplify the soft label signal.

Practical Example: Distilling a Sentiment Classifier

Imagine you have a large BERT-large model (340 million parameters) that classifies customer reviews as positive, neutral, or negative with 94% accuracy. It takes 80 milliseconds per review on your server. You need sub-10 millisecond latency for a real-time dashboard.

You decide to distill it into a smaller 4-layer transformer with 66 million parameters. Here is what the process looks like in practice.

First, you run all 100,000 training reviews through BERT-large at temperature T=4. For a strongly positive review, the teacher might output: positive 91%, neutral 8%, negative 1%. At T=4, this becomes approximately positive 62%, neutral 29%, negative 9%. The neutral and negative signals, invisible in the hard label "positive," are now visible to the student.

The student then trains with 90% weight on these soft labels and 10% weight on the original hard labels. After 3 epochs, the student reaches 91% accuracy on the test set, compared to 94% for the teacher. But the student runs in 7 milliseconds, well within the latency budget, uses one-fifth the memory, and costs a fraction as much to serve.

The 3-percentage-point accuracy gap is the cost of the compression. Whether that trade-off is acceptable depends on your specific product requirements. For many applications, 91% accuracy with 10x faster inference is the right answer.

Advantages

Smaller Models Than Training from Scratch Justifies

A small model trained from scratch on your dataset is limited by what the dataset can teach it. Distillation lets the student access the teacher's implicit knowledge about relationships, ambiguities, and uncertainty, which is not present in the raw labels. The student can therefore achieve accuracy that a from-scratch trained model of the same size could not.

Faster Inference at Deployment

The primary reason to distill is deployment economics. A student that is 3x to 10x smaller runs proportionally faster and cheaper. For applications where the teacher's full capability is not needed on every query, this is a straightforward win.

Works Across Modalities

Distillation is not specific to text classification. It has been applied to image classification, object detection, speech recognition, code generation, and large language model fine-tuning. The core mechanism, training a student on the teacher's output distributions, applies anywhere the teacher produces probability distributions.

Preserves Model Interpretability Options

Because the student is a standard neural network of your choosing, you can select an architecture that supports interpretability methods. You could distill a black-box ensemble into a smaller model with attention mechanisms that are easier to audit.

Enables On-Device and Edge Deployment

Models that would never fit in a smartphone's memory or a browser's WebAssembly environment can be distilled into versions that do. Apple uses distillation extensively to build on-device models for features like Siri and autocorrect that run without a network connection.

Limitations and Trade-offs

Performance Gap Below Teacher

Distillation narrows the gap between a small and large model, but it does not close it entirely. The student will almost always be somewhat less accurate than the teacher. If your task requires the absolute maximum performance and you have the infrastructure to serve a large model, distillation may not be the right choice.

Requires Access to Teacher Outputs

Standard distillation requires that you can run inference on the teacher model and collect its output probabilities. If your teacher is a closed model accessible only through an API, you may not have access to full probability distributions. Some APIs return only the top prediction or a confidence score, not the full softmax distribution, which limits what you can extract.

Training Cost Is Not Zero

Distillation requires running your entire training dataset through the teacher model (which may be expensive via API) and then training the student. For very large datasets and very large teachers, the teacher inference pass alone can be costly. You are trading training cost for deployment cost savings, and the payoff requires sufficient deployment volume to justify the upfront expense.

Distribution Shift Sensitivity

If the teacher was trained on data from a different distribution than your deployment data, the soft labels it produces may not generalise well to your use case. Distilling a general-purpose language model into a domain-specific student works best when the teacher has at least some competence on the target domain.

Hyperparameter Sensitivity

The temperature T and the weighting between soft and hard label losses are significant hyperparameters that require tuning. The optimal values vary substantially across tasks and architectures. Getting distillation to work well requires experimentation, which adds time to your development cycle.

Common Mistakes

Using Temperature 1 for the Soft Labels

At temperature 1, the teacher's output distribution for a correctly predicted example is already very peaked, with nearly all probability mass on the correct class. The soft label is almost identical to the hard label, and the student gains almost no additional information. Always experiment with temperatures above 1, typically between 2 and 10, to reveal the dark knowledge in the distribution.

Choosing a Student Architecture That Is Too Small

Distillation cannot create something from nothing. A student with a fraction of a percent of the teacher's capacity will hit a capacity wall regardless of the quality of the training signal. The student must be large enough to represent the behaviour the teacher is demonstrating. A rule of thumb is to start with a student that is 20% to 50% the size of the teacher and compress further only if initial results are acceptable.

Distilling to a Completely Different Architecture Without Feature Matching

Output-only distillation works well when student and teacher share a similar architectural family. When they are very different (distilling a large transformer into a convolutional network, for example), output-only distillation often struggles. In these cases, adding intermediate feature matching, training the student to also match the teacher's internal representations, significantly improves outcomes.

Skipping Evaluation on Task-Specific Metrics

Distillation is often evaluated on benchmark accuracy, but your actual task may care about precision, recall, F1, calibration, or latency at a specific percentile. A student that matches the teacher on accuracy may perform very differently on these metrics. Always evaluate against what actually matters in your deployment.

Assuming Distillation Fixes a Bad Teacher

Distillation transfers what the teacher knows, including its biases, failure modes, and calibration errors. If the teacher is poorly calibrated or biased on certain subpopulations, the student will inherit these problems. Distillation is not a model improvement technique; it is a model compression technique.

Best Practices

Start with Output Distillation, Add Feature Distillation If Needed

Output distillation (matching only the final softmax distributions) is simpler to implement and often sufficient. Start there. If the performance gap between student and teacher is larger than acceptable, add intermediate layer matching: pick one or two internal layers in the teacher and train the student to match their activations via an adapter projection.

Tune Temperature with a Validation Set

Run a quick sweep over temperatures (2, 4, 8, 16) and measure validation accuracy for each. The optimal temperature varies significantly by task. Higher temperatures work better when the teacher is very confident on most examples. Lower temperatures work better when the teacher's distributions are already soft.

Use the Teacher for Data Augmentation

Generate additional synthetic training examples by prompting the teacher on edge cases, out-of-distribution inputs, or augmented versions of your data. Label these with the teacher's soft outputs. This is particularly effective for language tasks where you can generate varied phrasings of the same underlying query.

Consider Progressive Distillation for Very Large Compression Ratios

If you need to compress a model by more than 10x, distilling directly to the final size often leaves too large a performance gap. Consider distilling in stages: first from the teacher to a medium intermediate model, then from the intermediate model to the final small student. Each step is a more tractable compression ratio.

Comparison: Model Compression Approaches

Method	How It Works	Best For	Key Trade-off
Knowledge Distillation	Train a smaller student model to mimic a larger teacher's output distributions	Achieving near-teacher accuracy in a smaller model; any modality	Requires teacher inference pass; student is still a trained model of its own
Quantization	Reduce the numerical precision of model weights from 32-bit floats to 8-bit or 4-bit integers	Reducing memory and speeding up inference on the same model	Some accuracy loss; may require calibration data; hardware dependent
Pruning	Remove individual weights, neurons, or attention heads that contribute little to model output	Creating sparse models; reducing parameter count without retraining from scratch	Irregular sparsity is hard to accelerate; structured pruning loses more accuracy
Architecture Search (NAS)	Automatically find a smaller architecture that achieves good performance on your task	Finding the most efficient architecture for a given accuracy target	Very computationally expensive to run; requires task-specific search
LoRA / Adapter Fine-tuning	Add small trainable modules to a frozen large model instead of updating all parameters	Efficient fine-tuning of large models; does not reduce deployment model size	Does not reduce inference cost; base model must still be served

Frequently Asked Questions

Does distillation always make a worse model than the teacher?

Almost always, yes, in the sense that the student will have somewhat lower performance on the teacher's original training distribution. However, there are documented cases where a distilled student outperforms a teacher of the same size trained from scratch, because the teacher's soft labels provide a richer training signal than raw dataset labels. The student is better than a model of its size trained without distillation, even if it does not surpass the teacher.

What is "dark knowledge" and why does it matter?

Dark knowledge is Geoffrey Hinton's term for the information encoded in the non-top probabilities of a teacher's softmax distribution. When a teacher classifies an image of a BMW as "automobile" with 98% confidence, it also assigns small probabilities to "truck" (1.5%) and "van" (0.3%). These small values encode the teacher's learned understanding that automobiles, trucks, and vans share visual features. A student trained only on the hard label "automobile" never sees this relationship. Dark knowledge transfers structural understanding of the problem, not just which answer is correct.

Can you distill from a model you do not have weights for, like GPT-4?

Yes, but with limitations. If you only have API access, you can use the model's text outputs as training data for a smaller model. This approach, called output distillation or dataset distillation, generates a large set of (prompt, response) pairs using the teacher and trains the student on them directly. You lose the soft label signal (you only get generated text, not probability distributions), but you still benefit from the teacher's knowledge being encoded in the generated outputs. This is how many smaller instruction-following models are trained.

How is Phi-3 related to distillation?

Microsoft's Phi-3 models are a prominent example of what the Phi team calls "textbook quality" data distillation. Rather than distilling softmax distributions, the approach generates a very large corpus of high-quality synthetic training data using GPT-4, then trains a small model (3.8B parameters) almost exclusively on this curated dataset. The result is a model that performs remarkably well on reasoning and coding benchmarks despite its small size, because it was trained on data that reflects the implicit structure of GPT-4's understanding. It is a form of data-level distillation rather than logit-level distillation.

When should I use distillation versus quantization?

These are not mutually exclusive and are often combined. Quantization is faster to apply (no retraining required, just post-processing) and works well when you need to reduce memory usage of an existing model. Distillation requires retraining but produces a smaller model that is more portable and can be further quantized afterward. Use quantization when you have a trained model and want to reduce its footprint with minimal engineering effort. Use distillation when you have flexibility in the student architecture and want to maximise performance at a target size. Use both when you need the deepest compression ratio.

References

Hinton, G., Vinyals, O., and Dean, J. (2015). Distilling the Knowledge in a Neural Network. arXiv preprint arXiv:1503.02531. The foundational paper introducing the temperature-scaled soft label framework.
Sanh, V., Debut, L., Chaumond, J., and Wolf, T. (2019). DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter. arXiv preprint arXiv:1910.01108. Demonstrated that BERT could be compressed 40% with only a 3% performance drop.
Abnar, S., and Zuidema, W. (2020). Quantifying Attention Flow in Transformers. ACL 2020. Relevant background on intermediate feature matching in transformer distillation.
Gunasekar, S., Zhang, Y., Aneja, J., Mendes, C. C. T., Del Giorno, A., Gopi, S., Javaheripi, M., Carignan, P., de Rosa, G., Saarikivi, O., Salim, A., Shah, S., Behl, H. S., Wang, X., Bubeck, S., Eldan, R., Kalai, A. T., Lee, Y. T., and Li, Y. (2023). Textbooks Are All You Need. arXiv preprint arXiv:2306.11644. Describes the Phi-1 approach to data-quality-driven small model training.
Romero, A., Ballas, N., Kahou, S. E., Chassang, A., Gatta, C., and Bengio, Y. (2014). FitNets: Hints for Thin Deep Nets. arXiv preprint arXiv:1412.6550. Introduced intermediate feature matching as an extension to output-only distillation.

Key Takeaways

Knowledge distillation trains a small student model to mimic a large teacher's output distributions, not just match hard labels. This richer training signal allows the student to exceed what its size alone would normally allow.
Temperature scaling is the key mechanism that makes soft labels informative. Higher temperature spreads probability mass across all classes, revealing the teacher's beliefs about similarity and ambiguity.
The student's loss is a weighted combination of distillation loss (matching the teacher's soft outputs) and standard cross-entropy loss (matching hard labels). The distillation component typically receives most of the weight.
Distillation does not require the student to have the same architecture as the teacher. Any architecture that produces probability distributions can be the student.
Distillation is not a replacement for architecture search, quantization, or pruning. It is most effective when combined with those methods, applied at different stages of the compression pipeline.
The main practical decision is access to teacher outputs. Full probability distributions give the best training signal. If only API text outputs are available, dataset distillation (training on the teacher's generated text) is still substantially better than training on unfiltered data alone.

Decision Trees: A Complete Guide with Hand-Worked Examples

Decision trees split data by finding the best question at each node....

LLM as Judge: How to Evaluate AI Models Automatically at Scale

Human evaluation of LLM outputs is slow and expensive. LLM-as-judge uses a...

Found this useful?