Key Points

1. Large language models (LLMs) require high throughput and low latency for real-world applications, but the huge memory consumption of LLMs has been a major bottleneck in achieving a large batch size and high throughput generation.

2. The key-value (KV) cache for the attention mechanism in the transformer architecture consumes a significant amount of memory, especially when the number of layers is large for deep language models.

3. A novel method is proposed to reduce the memory consumption and improve inference throughput by only computing and caching the KVs of a small number of layers, thus significantly saving memory consumption and improving the throughput.

4. The proposed method achieves up to 26× higher throughput than standard transformers and competitive performance in language modeling and downstream tasks. It is also orthogonal to existing transformer memory-saving techniques and can be integrated with them for further improvement in inference efficiency.

5. The proposed method reduces the memory consumption of the KV cache by dramatically reducing the number of cached layers and makes parallel training feasible by designing a novel approximate training method.

6. Integration with StreamingLLM achieves lower latency and memory consumption compared to the original StreamingLLM on different cache sizes, demonstrating its effectiveness in improving inference efficiency when integrated with other memory-saving techniques.

7. Empirical analysis shows that the placement of warmup layers in the proposed method significantly impacts the model's performance in language modeling, with the sandwich style placement (top and bottom layers as warmup layers) performing the best.

8. The proposed method achieves significant memory reduction and throughput improvement as well as competitive performance in language modeling and downstream tasks compared with standard transformers, demonstrating its effectiveness in improving inference efficiency for LLMs.

9. Although the training process of the proposed method is more complicated than standard transformers due to sequential dependencies, the method achieves competitive pre-training results with a minor slowdown in training.

Summary

The research paper introduces a novel method to reduce memory consumption in high-throughput large language models (LLMs) by focusing on the key-value (KV) cache component. The KV cache is responsible for storing keys and values in each transformer layer during generation to avoid re-computation. The authors note that the huge memory consumption of LLMs, especially in the KV cache, has been a major bottleneck for deploying high-throughput models in real-world applications.

Proposed Method
The proposed method involves computing and caching the KVs of a small number of layers, thereby significantly reducing memory consumption and enhancing inference throughput. The paper highlights that the method achieves up to 26× higher throughput than standard transformers while maintaining competitive performance in language modeling and downstream tasks. Additionally, the method is compatible with existing memory-saving techniques, providing further improvement in inference efficiency.

Memory Consumption Reduction Approach
The approach taken to reduce memory consumption is to pair the queries of all layers with keys and values of just the top layer, thereby only caching and computing the keys and values of one layer, compared to tens of layers in a typical LLM. The authors draw inspiration from the interpretation of the stacking layer structure of a transformer as an iterative process of improving token representation, emphasizing that the representation at the top layer is the most informative.
In experiments on large language models, the method is shown to achieve up to 32× larger batch sizes and up to 26× higher throughput than standard transformers. Additionally, the authors demonstrate that the method can effectively integrate with other memory-saving techniques. The availability of the code for implementation is also highlighted, providing the means for practical implementation of the proposed method.

Addressing Model Challenges
The paper also covers the challenges faced by the model, including training and the need for an approximate training method due to sequential dependencies. The authors derived a parallel training process to address these challenges, ensuring that training remains feasible for the proposed model.

Experimental Evaluation
Furthermore, the paper presents comprehensive experimental results, comparing the performance of the proposed method with standard transformers and other memory-saving techniques. The evaluation includes measurements of memory consumption, inference throughput, and performance in language modeling and downstream tasks. The findings demonstrate the effectiveness of the proposed method in reducing memory consumption and enhancing inference efficiency.

In conclusion, the research paper proposes a novel method for reducing memory consumption in high-throughput large language models, presenting detailed insights into the rationale, approach, experimental findings, and compatibility with existing memory-saving techniques. The experimental results and the availability of the code for implementation emphasize the practical applicability and effectiveness of the proposed method.

Reference: https://arxiv.org/abs/2405.10637