- AI, But Simple
- Posts
- (PRO) LLM KV Caching, Simply Explained
(PRO) LLM KV Caching, Simply Explained
AI, But Simple Issue #67

LLM KV Caching, Simply Explained (PRO)
AI, But Simple Issue #67
This week’s PRO issue is accompanied by a Colab notebook that includes a walkthrough of KV cache and a speed comparison between not using and using KV cache. As always, the code in this issue can be executed on a normal home PC and does not require a large amount of VRAM.
A challenge that transformer-based Large Language Models (LLMs) face is efficiency: efficiency in generating outputs and storing memory.
Key-Value (KV) Caching is an optimization technique used in transformers that stores portions of their self-attention mechanism in memory, allowing LLMs to generate text faster.
The power of transformer-based models has increased greatly in the past three years and will continue to grow in the next few as well. With it, there is an increased necessity for models to produce viable outputs with precision and speed.
If you had asked ChatGPT two years ago to generate a long piece of text or analyze long passages, it would have struggled to do so.
The reason boils down to the fact that earlier transformer models had a shorter context window and had an inference runtime complexity cost that scaled quadratically with respect to input tokens.
As input tokens increased, the runtime would take longer and longer—the runtime would increase very quickly with respect to the tokens, as noted in the original transformer paper.

The Basics of Self-Attention
Self-attention is the core of decoder-style (autoregressive) transformer models, which generate text by predicting the next word/token based on what they’ve already seen.
Something important to highlight is that autoregressive models do not have access to future context as they generate the output, meaning that they are causally masked.

The above example shows causal masking with the phrase “The cat sat on the mat.”
To predict "cat" → model can only see "The"
To predict "sat" → model can only see "The cat"
To predict "on" → model can only see "The cat sat"
To predict "the" → model can only see "The cat sat on"
To predict "mat" → model can only see "The cat sat on the"
The masking is to ensure that the transformer can only look at past tokens for context. It uses past tokens to generate new ones, which is how LLMs generate large bodies of text!

Here, understand that mathematically, vector operations are taking place.
Self-attention essentially converts these input token vectors into output vectors, using a neural network with weights and biases that influence the generation of new tokens.
In the process, the model learns meaning and context within the input—it can look back at any token to derive the meaning of a current token.
Self-attention is mathematically computed as shown below:

The Q is made up of query vectors based on the tokens. You can interpret what a query vector does by thinking it asks, “what information do I need?”
Alternatively, the K is made up of key vectors, which ask, using the same interpretation, “what information can I provide?”
In reality, these vectors are all just numbers derived from linear transformations of token embeddings. They don’t actually have semantic roles, such as asking about information, but serve as a helpful metaphor.
Understanding the inner workings of models just like this is the question of interpretability within AI.
These query and key vectors work together in matrix multiplication to generate attention scores, as they simultaneously consider the information each token can provide.

In the above example, let’s highlight “dog.” The query vector for “dog” would ask, “what type of dog am I?” The token “red” would be the most important key vector, supplying the query vector of “dog” with its context.
The attention score generated here would be very high, as “dog” would pay attention to the word “red.”
The key insight here is every token does this with every other token:
"red" → looks at "dog" (what am I describing?)
"barks" → looks at "dog" (who is barking?)
"The" → looks at "dog" (what am I introducing?)
Along with these attention scores, the actual value/information in each token is passed on in value vectors, which are semantically defined in vector space. Multiplying these value vectors with attention is what is used in output generation.
The Concept Behind Key-Value Caching
Now that we understand the self-attention mechanism and the calculations behind the scenes, let’s discuss the technique of KV caching.
As mentioned before, autoregressive models produce one token at a time, only allowing themselves to view tokens they’ve already generated. This process is known as inference, as the model is predicting or inferring the output.
With inference, the model performs the matrix calculations for self-attention each time a new word is generated.
However, it doesn’t just do an additional matrix calculation for each word; it recalculates values for all the previously generated words, along with the new one.
Example: I run fast.
After tokenizing the sentence, we would get “I”, “run”, and “fast”.

Notice that by the time the “fast” token is generated, the “I” token is generated three times (in green) and the “run” token is generated twice (in turquoise/cyan). While there are indeed three masked attention values, these are still calculated but are never shown to the model.
If a new token is introduced, the key and value vectors for all the previous three tokens in the attention pattern must be recomputed.
The issue is very clear. In practice, the compute power required and time needed to create long text generations would skyrocket—there would be unnecessary calculations being processed with each word generated.
This is a consequence of inference having a quadratic scaling effect, shown at the beginning. In this example, for just three tokens, nine attention scores were computed; now imagine 1,000 tokens or even 1,000,000 tokens?

This is exactly where KV Caching comes in. KV caching mitigates this issue by storing each iteration’s key vector and value vector in memory. The cache is updated as new keys and values are generated.
By storing the information, the amount of matrix calculations drops dramatically, allowing computation to be more efficient and remain accurate even for longer texts. The time complexity of inference is reduced by one order, from quadratic to linear with respect to input tokens.

However, the tradeoff is that the KV cache will take up memory (lots of memory), so more compute is needed to maintain the cache and overhead. In fact, in large models, the KV cache will be larger than the model itself (in terms of parameter storage size).
If the memory trade-off due to caching becomes too substantial, remember that it can be exchanged for loss of model accuracy.
It would come down to creating a balance between the two methods, but with KV caching abilities, generation speeds would speed up substantially.
Key-Value Caching in Practice
Preserving the model’s context is important to LLM applications where tasks are more than just simple question answering. Chatbots, assistants, and other tools need to preserve as much context as possible to support the user.
However, if we keep the context/memory for every generation step, how is this more efficient than regular inference once all the memory is cached?
The answer lies in cleverly deciding when to clear the cache, freeing up space for further generation to occur.
In the scenario of the customer service bot, you could clear the cache once the user exits the interface, a reasonable choice considering the interaction would be over once the customer left.

To combat this, clearing the cache would restore the model’s “state of mind,” allowing it to service the customer fully. Of course, potential logistics issues may arise, but that would be on account of how much memory the model is provided to cache.
Another significant aspect of KV cache management to consider is cache reuse. In settings where the same data may be requested at different times, cache that was generated at any point in time can be kept—it might be requested sometime in the near future.
This and other cache techniques are explored in the PagedAttention paper, a paper released by UC Berkeley researchers in 2023.

The needed key and value pairs would be identified from analyzing the query vector of the user input and matching them to previously stored cache.
This allows the model to create quicker responses at a moment's notice. This is useful in applications where assistants are to aid in high-pressure settings, such as medical offices, to keep track of inventory.
The Bottom Line
KV caching is an important technique in LLM optimization and efficiency. By storing key-value pairs from previous computations, this technique speeds up slow inference by reducing the time complexity.
The trade-off is straightforward: local memory for speed and consistency. KV caching preserving attention computations allows models to maintain their performance given longer inputs, such as extended conversations or complex reasoning tasks.
As LLMs become increasingly integrated into real-world applications, KV caching allows them to output tokens much faster. This reduces the latency to the end user and enables more possibilities.
KV caching has now become a necessary architectural technique, used in all recent LLMs such as DeepSeek, ChatGPT, and Gemini—it makes decoder-only inference actually feasible.
Here’s a special thanks to our biggest supporters:
Sushant Waidande
Sai Krishna Pashapu
If you enjoy our content, consider supporting us so we can keep doing what we do. Please share this with a friend!
Want to reach 50000+ ML engineers? Let’s work together:
Feedback, inquiries? Send us an email at [email protected].
If you like, you can also donate to our team to push out better newsletters every week!
That’s it for this week’s issue of AI, but simple. See you next week!
—AI, but simple team