Key Points
1. The paper aims to evaluate the performance of instruction-tuned Large Language Models (LLMs) across various quantization methods on models ranging from 7B to 405B. Using 13 benchmarks, the paper assesses performance across six task types: commonsense Q&A, knowledge and language understanding, instruction following, hallucination detection, mathematics, and dialogue.
2. The key findings of the evaluation reveal that quantizing a larger LLM to a similar size as a smaller FP16 LLM generally performs better across most benchmarks, except for hallucination detection and instruction following. Performance varies significantly with different quantization methods, model size, and bit-width, with weight-only methods often yielding better results in larger models. Task difficulty does not significantly impact accuracy degradation due to quantization.
3. The study highlights that the MT-Bench evaluation method has limited discriminatory power among recent high-performing LLMs.
4. Quantized LLMs generally outperform smaller models across most benchmarks, except in hallucination detection and instruction-following tasks.
5. Performance varied significantly by quantization method and precision, with weight-only quantization performing better in the 405B model.
6. There was no significant relationship between task difficulty and accuracy degradation due to quantization. Quantized models maintained similar performance to full-precision models across both easy and challenging tasks, showing that quantization does not disproportionately affect complex tasks.
7. Larger models experienced more significant accuracy loss from quantization, especially in the second turn of multi-turn evaluations, and the MT-Bench struggled to differentiate recent LLMs due to their performance nearing GPT-4.
8. The appropriate quantization bit precision depended on the dataset and the model size.
9. The study also highlights that the MT-Bench evaluation is particularly useful for dialogue evaluation using models designed for chat, and instruction tuning affects the quality of free-form text generation.
Summary
The research paper "A Comprehensive Evaluation of Quantized Instruction-Tuned Large Language Models: An Experimental Analysis up to 405B" evaluates the performance of instruction-tuned language models (LLMs) across various quantization methods, ranging from 7B to 405B models. The study employs 13 benchmarks across six task types: commonsense Q&A, knowledge and language understanding, instruction following, hallucination detection, mathematics, and dialogue. The key findings include that quantizing larger LLMs to a similar size as a smaller FP16 LLM generally performs better across most benchmarks, except for hallucination detection and instruction following. Additionally, the study reveals that performance varies significantly with different quantization methods, model size, and bit-width, with weight-only methods often yielding better results in larger models. It's also found that task difficulty does not significantly impact accuracy degradation due to quantization, and the MT-Bench evaluation method has limited discriminatory power among recent high-performing LLMs.
The paper highlights the limitations of prior research in evaluating quantized LLMs, which have primarily relied on limited metrics such as perplexity and outdated datasets, leading to a lack of in-depth understanding of cutting-edge LLMs. To address these limitations, the study comprehensively evaluates the performance of instruction-tuned LLMs when quantized, using 13 benchmarks, including five datasets not covered in previous studies, and models ranging from 7B to 405B. The quantization methods considered are GPTQ, AWQ, SmoothQuant, and FP8.
The results reveal that quantized LLMs generally outperform smaller models in most benchmarks, except for hallucination detection and instruction-following tasks. Furthermore, the performance varies with different quantization methods and precision levels, with weight-only quantization methods preserving accuracy better than quantizing both weights and activations in larger models. The study also finds that task difficulty has little impact on accuracy degradation due to quantization, and the MT-Bench evaluation method has limited discriminatory power among recent LLMs.
The paper identifies the significance of evaluating the impact of quantization on model parameters as large as 405B, particularly in terms of accuracy degradation in LLMs when tested on recent datasets with minimal risk of data contamination. The study meticulously examines the application of different quantization methods, including AutoGPTQ, AutoAWQ, llmcompressor with SmoothQuant, and llmcompressor with FP8, across various LLM sizes and benchmarks. Overall, the findings provide critical insights into the performance of quantized instruction-tuned models and extend the understanding of their applicability in real-world scenarios.
Reference: https://arxiv.org/abs/2409.11055