LLM Weights Compress to 4.5-6 Bits with New Method

Why this is here: The new method achieves near-lossless fidelity at 4.5 to 6 bits per weight, potentially reducing storage needs for large language models.

Researchers propose Scaled Outer Product, a post-training quantization method for large language models, targeting hardware with per-layer lookup table decode. The technique searches for fixed and dynamic codebook pairs for each layer. It uses per-block selection bits and signed scales to compress model weights.

The methodology combines several features including activation-weighted cosine selection and outlier correction. Fixed codebooks tested include NF4, BOF4, Split87, and SH4. Optimized codebooks, called DD4, reside in lookup table SRAM.

Testing across six open model families shows the recommended FP6 operating point achieves lower weight reconstruction error than the conventional FP8 baseline. This compression lowers storage costs by roughly 1.5 bits per weight. A companion paper details full evaluation and layer promotion strategies.