Quantization for LLMs, Simply Explained

AI, But Simple Issue #61

Hello from the AI, but simple team! If you enjoy our content, consider supporting us so we can keep doing what we do.

Our newsletter is no longer sustainable to run at no cost, so we’re relying on different measures to cover operational expenses. Thanks again for reading!

Quantization for LLMs, Simply Explained

AI, But Simple Issue #61

Model quantization is a technique used to reduce the size of large neural networks—especially Large Language Models (LLMs)—by modifying the precision of their weights.

Large Language Models’ size is determined by the number of parameters they have. Most LLMs live up to their name—they are in fact large. You might’ve heard that GPT-3 has 175 billion parameters.

To account for the sheer number of parameters, LLMs generally need many GPUs with huge amounts of VRAM. Additionally, inference at scale also benefits from more of these GPUs.

Parameters are simply numbers stored in memory—each parameter has a distinct storage size. This size is determined by the number’s precision.

There are different levels of numerical precision, such as 32-bit floats, 16-bit floats, or 8-bit floats, with 8-bit floats taking up the least space in memory.

The idea lies here: if we decrease the precision and thus the size of our numbers storing the parameters, we can decrease the size of the model overall and therefore need less compute to store and run it.

Subscribe to keep reading

This content is free, but you must be subscribed to AI, But Simple to continue reading.

Already a subscriber?Sign in.Not now