LLM KV Caching, Simply Explained

AI, But Simple Issue #67

Hello from the AI, but simple team! If you enjoy our content, consider supporting us so we can keep doing what we do.

Our newsletter is no longer sustainable to run at no cost, so we’re relying on different measures to cover operational expenses. Thanks again for reading!

LLM KV Caching, Simply Explained

AI, But Simple Issue #67

A challenge that transformer-based Large Language Models (LLMs) face is efficiency: efficiency in generating outputs and storing memory.

Key-Value (KV) Caching is an optimization technique used in transformers that stores portions of their self-attention mechanism in memory, allowing LLMs to generate text faster.

The power of transformer-based models has increased greatly in the past three years and will continue to grow in the next few as well. With it, there is an increased necessity for models to produce viable outputs with precision and speed.

If you had asked ChatGPT two years ago to generate a long piece of text or analyze long passages, it would have struggled to do so.

The reason boils down to the fact that earlier transformer models had a shorter context window and had an inference runtime complexity cost that scaled quadratically with respect to input tokens.

  • As input tokens increased, the runtime would take longer and longer—the runtime would increase very quickly with respect to the tokens, as noted in the original transformer paper.

Subscribe to keep reading

This content is free, but you must be subscribed to AI, But Simple to continue reading.

Already a subscriber?Sign in.Not now