Key Points

- Large language models have become integral to natural language processing, but they come with substantial costs in terms of compute and memory resources.

- Sparsification provides a solution to alleviate these resource constraints, and recent works have shown that trained models can be sparsified post-hoc.

- Existing sparsification techniques face challenges as they need additional data structures and offer constrained speedup with current hardware.

- The paper presents SliceGPT, a new post-training sparsification scheme which replaces each weight matrix with a smaller (dense) matrix, reducing the embedding dimension of the network.

- Through extensive experimentation, the paper shows that SliceGPT can remove up to 25% of the model parameters for L LAMA - 2 70B, OPT 66B, and Phi-2 models while maintaining a high zero-shot task performance of the dense model.

- The sliced models run on fewer GPUs and run faster without any additional code optimization, reducing the total compute for inference on L LAMA - 2 70B to 64% of that of the dense model on 24GB consumer GPUs and to 66% on 40GB A100 GPUs.

- The paper introduces the idea of computational invariance in transformer networks, which enables SliceGPT and may inspire future avenues to reduce memory and computation demands for pre-trained models.

- The paper provides an extensive discussion and analysis of various factors such as the impact of calibration set size and sequence length on model performance, the spectrum analysis of L LAMA - 2 and OPT models, and the benchmarking of SliceGPT inference time against SparseGPT.

Summary

Computational Requirements of Large Language Models (LLMs)
The paper focuses on addressing the computational requirements of large language models (LLMs) through a post-training sparsification scheme called SliceGPT. Large language models, such as LAMA-2 and OPT, present challenges due to their size and autoregressive nature, and the high costs associated with deployment. Existing model compression techniques, including distillation, tensor decomposition, pruning, and quantization, while effective, have limitations in terms of reduced performance and computational demands.

Introduction of SliceGPT
The paper introduces SliceGPT, a new post-training sparsification technique that reduces the embedding dimension of the network by replacing weight matrices with smaller matrices. Through extensive experimentation, the authors demonstrate that SliceGPT can remove up to 25% of the model parameters while maintaining model performance. Their findings show that SliceGPT enables compressed models to run faster and require fewer GPUs for inference. The paper also highlights the computational invariance in transformer networks, which enables SliceGPT and its potential to inspire future approaches for reducing memory and computation demands for large language models.

Potential Impact and Analysis of SliceGPT
The authors also explore the potential impact of SliceGPT on addressing the computational demands of LLMs, particularly in generating text responses and the benefits of using structured sparsity in reducing the computational complexity of the compressed models. They provide a comprehensive analysis of the computational requirements, inference times, and benchmarking throughput for SliceGPT in comparison to existing sparsity methods. Additionally, the paper presents an in-depth exploration of the influence of calibration set size and sequence length on model performance and provides spectrum analyses of LAMA-2 and OPT models to further understand the impact of sparsification on model parameters.

Overall, the paper introduces SliceGPT as a promising post-training sparsification technique for reducing the computational demands of large language models, enabling superior performance while achieving speedup and reduced memory requirements. The study provides insightful observations and findings that contribute to the ongoing efforts to optimize the efficiency of deep learning models and inspire new theoretical insights.

Reference: https://arxiv.org/abs/2401.15024v1