Key Points
1. The clipping range [?, ?] determines the FP32 scaling factor and offset for language model compression. Calibration of ? and ? needs to be done adequately for decent representation of original real-valued inputs.
2. Uniform quantization methods are sub-categorized into affine quantization and scale quantization based on the symmetry of clipping range. Affine quantization benefits from precise representation of original values, while scale quantization is preferred if reducing the computational cost is the most important factor.
3. Non-uniform quantization has been spotlighted for its improved representation of original parameters over uniform quantization. It includes low-bit floating point, log-scaled, binary-coded, and codebook-based quantization methods.
4. Precision of the resulting outputs and datatype of outputs are important features of quantization algorithms. Extreme quantization techniques focus on reducing bit precision under 4-bit but are vulnerable to accuracy degradation, leading to the use of mixed-precision quantization algorithms.
5. Quantization cost is decided based on whether the algorithms entail retraining or fine-tuning after quantization. High-cost methods require entire or partial training of quantized models to recover from accuracy degradation, while low-cost methods do not require the training of quantized models.
6. Low-rank approximation is a model compression technique that reduces the size of the model by decomposing a high-dimensional matrix or tensor with lower-dimensional ones. It has demonstrated the highest compression rates of parameters in high-cost compression scenarios.
7. Parameter sharing is a technique that uses the same parameters in different parts of a model, promoting better generalization and fast convergence by reducing the number of parameters.
8. Efficient architecture design techniques involve the method of designing a Transformer layer with an efficient structure or automatically searching a model architecture from a pre-trained model that meets constraints.
9. Low-cost iterative algorithms, directly optimizing the objective, integrating PEFT with high-cost algorithms, accurate pruning algorithms for LLMs, and unifying compression algorithms are identified as promising research areas for language model compression algorithms.
Summary
Overview of Compression Algorithms for Language Models
The paper provides an extensive survey of compression algorithms for language models, focusing on the transition from high-cost to low-cost compression algorithms applicable to large language models (LLMs). The authors compare various compression algorithms, including pruning, quantization, knowledge distillation, low-rank approximation, parameter sharing, and efficient neural architecture design. They summarize an overall trend of compression algorithms, select representative algorithms for in-depth analysis, and provide valuable discussion and future research directions.
Necessity of Compressing Language Models
The paper highlights the necessity of compressing language models due to challenges such as increased carbon emissions and expensive maintenance fees associated with their remarkable advancements. The authors conduct a comprehensive survey of diverse compression algorithms, including low-cost compression algorithms applicable to LLMs. They cover diverse types of compression algorithms, compare their performance, and provide in-depth analyses of representative algorithms. The paper discusses the contribution of each field of compression algorithms and introduces desired properties for successful low-cost compression algorithms for LLMs. Additionally, promising future research topics are proposed based on the discussion.
The manuscript provides insights into the formal definition of the pretrained language model (PLM) compression problem and describes preliminaries of algorithms to solve the PLM compression problem. The paper outlines the Transformer architecture dominantly used for PLMs, elaborates on PLM compression algorithms, and describes the background to solve the PLM compression problem. The authors summarize numerous compression algorithms and provide a detailed explanation of representative algorithms for pruning, quantization, and other compression algorithms. They also discuss the current compression algorithms and promising future research areas, concluding with future research directions in Section 6.
Reference: https://arxiv.org/abs/2401.15347