VPTQ: Extreme low-bit Quantization for real LLMs
SMRTR summary
Vector Post-Training Quantization (VPTQ) is a novel method for compressing large language models to 1-2 bits without retraining, preserving high accuracy. It can handle models up to 405 billion parameters, quantizing the largest in about 17 hours. VPTQ enables substantial memory savings and faster inference for deploying massive language models. Accepted for EMNLP 2024, its open-source code is available on GitHub for researchers to use and extend.
SMRTR provides this summary for quick context. The original article belongs to Hacker News.
Read the original article