(PRO) LLM Knowledge Distillation, Simply Explained

AI, But Simple Issue #78

 

LLM Knowledge Distillation, Simply Explained (PRO)

AI, But Simple Issue #78

This week’s PRO issue is accompanied by a Colab notebook with a simple example for knowledge distillation. As always, the code in this issue can be executed on a normal home PC and does not require a large amount of VRAM.

Imagine a restaurant kitchen, where there is one head chef and many of his apprentices. The head chef spent decades perfecting hundreds of his own recipes.

This involved fine-tuning the ratios of his ingredients, testing subtle variations, and experimenting with different spices and techniques until every recipe had a clear formula to be followed.

The apprentices, on the other hand, are young, quick, and eager, but they don’t have the technical expertise or the breadth of knowledge that the head chef possesses.

Over time, the apprentices learn a few specific dishes each from the head chef. They observe the most common patterns and behaviors of the head chef in making these specific dishes and pick up his intuition as well.

Over time, each apprentice is qualified in making a subset of dishes to a quality similar to the head chef’s, but none of them cover the vast array of recipes and techniques that the head chef knows.

This is how knowledge distillation works for large language models (LLMs).

Rather than use an entire powerful but expensive LLM, we can use a smaller model to “mimic” the precise behavior of the LLM. Typically, knowledge distillation is used for more specific use cases, rather than general applicability.

Knowledge distillation was introduced in a 2015 paper by Geoffrey Hinton and his team called “Distilling the Knowledge in a Neural Network” but has only recently started to catch on for LLM applications.

Knowledge distillation can be used to cut compute costs, improve inference speed, and shrink model size. It is a new method to perform a common machine learning technique called transfer learning.

An important point to note is that this smaller model does not just learn the outputs of the main LLM; it learns the actual patterns, the internal reasoning, and the structure of the output distributions as well.

This means that for specific purposes, it offers similar performance as the main LLM itself.

This ability to capture the behavior of a larger model and transfer it to a smaller model is what makes knowledge distillation so powerful.

Why Use Knowledge Distillation?

LLMs are trained on trillions of tokens, absorbing information and identifying patterns across natural language processing (NLP), coding, mathematics, and general knowledge.

The cost for running these models is high. They require hundreds or thousands of high-end GPUs, significant memory bandwidth, and expensive computation for training and deployment.

For real-world deployment, such as for mobile apps, virtual assistants, or any environment where resources are scarce, it is impractical to run a full-sized LLM.

By using knowledge distillation, the main LLM can be used as the head chef from our earlier analogy to “teach” a smaller model (the apprentice) its most useful skills.

  • The head chef is given the title of “teacher,” and the apprentice is given the title of “student.” This is how knowledge distillation models are commonly notated.

The student model will be able to mimic the main (teacher) LLM’s behaviors for specific tasks, such as summarizing text, writing code, or having general conversation.

The main LLM generates high-quality outputs for the specific tasks, and the student model learns to imitate them with an impressive degree of accuracy.

How Knowledge Distillation Actually Works

The intuition behind how this imitation works is straightforward.

Student models are trained with the full probability distribution produced by the teacher LLM. The probability distribution forms what we call “soft labels,” and they are a large factor of knowledge distillation’s high performance.

  • Learning from the probability distribution means they are not just learning from the outputs of the teacher LLMs, but are also considering confidence levels behind decisions.

For instance, if the main LLM is 98% confident in the correct token but assigns 1% to an alternate output, that uncertainty is also fed into the student model.

These soft labels contain way more information than a simple output (e.g., cat or dog) and help the student model approximate where the teacher LLM’s decision boundaries are when reaching a conclusion.

The student models, due to this training, become very good at the specific tasks they were trained to do. They won’t be as powerful as the teacher model due to size constraints, but the performance for specific tasks will be about the same with a lower computational cost.

The size reduction of knowledge distillation is actually feasible since typical LLMs tend to be overparameterized. This means they contain more capacity than is strictly necessary to perform many tasks.

This extra capacity yields better generalization in training but is not as useful during inference time, when the model is being used for practical purposes.

However, a student model trained to specialize for specific tasks will not face this problem.

Distillation of a large model into a small model can also result in more consistent results. Inheritance of high-quality teacher traits is why distilled models often appear more aligned and more consistent than smaller models trained from scratch.

Real-World Application: DistilBERT

A well-known real-world example of knowledge distillation is DistilBERT, a smaller version of BERT created by distilling the larger BERT-base model.

BERT-base contains around 110 million parameters, making it powerful but expensive to run. DistilBERT reduces this size by about 40% to about 60 million parameters and runs roughly 60% faster during inference.

It still achieves more than 97% of BERT’s performance on the GLUE benchmark, which is a widely used collection of language understanding tasks that measure how well a model can properly interpret text.

This was one of the first demonstrations that a carefully distilled model can preserve almost all of the capability of a much larger one.

The way DistilBERT was trained shows how the distillation process actually works. The soft labels serve as a more complete signal that helps the smaller model understand the teacher’s behavior more closely than simple labels ever could.

The student is also trained on the same masked language modelling (MLM) used in the original BERT, so it retains strong general language skills rather than becoming too narrow or specialized.

Finally, part of the training encourages the student’s internal representations to line up with the teacher’s, allowing it to mirror how the teacher reaches certain outputs.

Knowledge Distillation for the Future

Taken together, this shows that knowledge distillation is a reliable and practical way to compress large models into smaller ones without losing much performance.

It is the same principle that now allows modern frontier models to train compact, efficient versions of themselves that can be deployed almost anywhere.

In the LLM ecosystem today, we see the rise of models that seem to be increasingly powerful while simultaneously being significantly smaller in size (e.g., small language models) compared to models from only a couple of years ago.

This is why so many companies have been able to introduce powerful and specialized AI-powered assistants to phones, offline applications, and specific browsers. The process of knowledge distillation has paved the way for the future of efficient AI model development: train big, deploy small.

Large frontier models, such as GPT-5, Claude, Gemini etc., exist not only as powerful general-purpose AI systems but also to act as the main LLMs to distill knowledge into smaller student models.

These student models are the ones that end up embedded in products, customer workflows, or any environments in which LLMs would be too impractical to deploy.

Chain-of-Thought (CoT) Distillation

One of the more exciting ways this process will likely be used more often is in chain-of-thought (CoT) distillation. Here, the main LLM provides both the final output and the intermediate steps that led to it making its decision.

This allows the distilled student model to understand multi-step reasoning more efficiently, capturing the main LLM’s logical structure. The current research suggests that smaller CoT models can even outperform larger models that are trained without these reasoning traces.

As all these techniques continue to mature and develop, the shift to smaller, more effective models will likely affect the general industry landscape of AI implementation.

Large models will remain vital, but more for the purpose of distilling knowledge into smaller student models than for regular use.

Here’s a special thanks to our biggest supporters:

  • Sushant Waidande

  • Sai Krishna Pashapu

  • Bob Boncella

If you enjoy our content, consider supporting us so we can keep doing what we do. Please share this with a friend!

Want to reach 50000+ ML engineers? Let’s work together:

Feedback, inquiries? Send us an email at [email protected].

If you like, you can also donate to our team to push out better newsletters every week!

That’s it for this week’s issue of AI, but simple. See you next week!

—AI, but simple team