Key Points
- The research paper introduces a new decoder-decoder architecture, named YOCO, for large language models, which only caches key-value (KV) pairs once, reducing the demands on GPU memory and improving inference efficiency.
- The YOCO architecture consists of a self-decoder and a cross-decoder. The self-decoder efficiently encodes global KV caches that are reused by the cross-decoder via cross-attention, allowing for autoregressive generation tasks such as language modeling.
- Experimental results show that YOCO achieves favorable performance compared to Transformers in various settings of scaling up model size and number of training tokens.
- YOCO is scaled up to 1M context length with near-perfect needle retrieval accuracy, showing strong long-context modeling capability.
- The profiling results demonstrate that YOCO improves inference memory, prefill latency, and throughput by orders of magnitude across context lengths and model sizes.
- The proposed YOCO architecture is designed for autoregressive modeling and is shown to reduce GPU memory demands, decrease prefilling time, and improve inference performance compared to traditional Transformer architectures. Additionally, YOCO demonstrates scalability in terms of training tokens, model size, and context length.
Summary
The paper introduces YOCO, a decoder-decoder architecture for large language models, which only caches key-value pairs once. YOCO comprises a self-decoder and a cross-decoder. The self-decoder efficiently encodes global key-value (KV) caches that are reused by the cross-decoder via cross-attention. YOCO substantially reduces GPU memory demands while retaining global attention capability. The computation flow enables prefilling to early exit without changing the final output, significantly speeding up the prefill stage. The experimental results demonstrate that YOCO achieves favorable performance compared to Transformer in various settings of scaling up model size and number of training tokens. YOCO is also extended to 1M context length with near-perfect needle retrieval accuracy. Profiling results show that YOCO improves inference memory, prefill latency, and throughput by orders of magnitude across context lengths and model sizes. By utilizing YOCO, the overall GPU memory consumption is notably reduced, allowing for better deployment of long-sequence models.
Improvements and Scalability of YOCO
The paper also details improvements realized by YOCO across multiple dimensions. YOCO reduces inference memory complexity by orders of magnitude, and its prefilling latency is significantly lower across various context lengths, thus improving the overall inference throughput. Furthermore, the paper presents experimental evidence showcasing YOCO's scalability and its ability to process trillions of training tokens with results on par with prominent Transformer language models. YOCO's performance is extended to 1M context length with near-perfect needle retrieval accuracy, demonstrating its strong long-context modeling capability.
YOCO's GPU Memory Footprint, Prefilling Latency, and Future Advancements
The paper also highlights YOCO's improvements in GPU memory footprint, prefilling latency, and throughput. Specifically, YOCO significantly reduces the memory overhead and enhances serving capacity while improving inference efficiency. The paper also suggests future advancements for YOCO, such as integration with BitNet and Groq for further efficiency improvements, application to multimodal large language models, and optimized mechanisms for the KV cache module. Additionally, the paper acknowledges the use of an internal version of a GPU cluster and Triton kernel for gated retention. Overall, the results position YOCO as a strong candidate model architecture for future large language models with native long-sequence support.
Summary and Implications
The summary provides a comprehensive understanding of the paper's key findings and implications, highlighting the novel contributions of the YOCO architecture and its potential for improving large language models' efficiency and performance across various contexts and applications.
Reference: https://arxiv.org/abs/2405.052...