Google Advances KV Cache Quantization

Why this is here: The KV cache, responsible for accelerating calculations in large language models, can quickly become the largest memory consumer because it is not automatically quantized like model weights.

Google researchers in the United States are developing TurboQuant, a method to reduce the memory usage of KV caches in large language models. KV caches accelerate LLMs by storing and reusing previous calculations, but they can become very large with longer contexts. While models often use 4-bit quantization, KV caches typically remain at 8 bits due to potential accuracy loss from outliers.

TurboQuant builds on existing techniques like KVQuant and KIVI, aiming to aggressively quantize the KV cache. This approach saves valuable VRAM, allowing for the operation of more powerful models or processing of larger data sets like extensive documents. The method addresses a key limitation, as the KV cache often consumes the most memory and isn’t automatically quantized like model weights.

Hardware support for 4-bit floating point formats, such as NVFP4 and MXFP4 on new GPUs, is also gaining traction. Further research is needed to balance the benefits of extreme quantization with potential impacts on model accuracy and performance.