Key Points

1. The paper introduces an efficient method to scale Transformer-based Large Language Models (LLMs) to infinitely long inputs with bounded memory and computation, incorporating a new attention technique called Infini-attention.

2. Infini-attention incorporates a compressive memory into the vanilla attention mechanism and builds in both masked local attention and long-term linear attention mechanisms in a single Transformer block.

3. The approach enables Transformer LLMs to effectively process infinitely long inputs with a bounded memory footprint and computation, allowing for fast streaming inference for LLMs.

4. The paper shows that the proposed approach outperforms baseline models on long-context language modeling benchmarks, achieving a 114x comprehension ratio in terms of memory size.

5. The Infini-Transformer, which operates on a sequence of segments, demonstrates improved performance compared to Transformer-XL in long-context language modeling and book summarization tasks.

6. The Infini-attention mechanism supports plug-and-play continual pre-training and long-context adaptation, allowing LLMs to scale to infinitely long context in a streaming fashion.

7. The paper presents experimental results demonstrating that Infini-Transformer models successfully solve tasks with up to 1M context length and achieve a new state-of-the-art result on a 500K length book summarization task after continual pre-training and task fine-tuning.

8. The proposed Infini-attention allows for both local and global context states computation and successfully integrates compressive memory systems, enabling LLMs to effectively model both long and short-range contextual dependencies.

9. The paper highlights the potential of the Infini-attention approach and the Infini-Transformer model in enabling processing of infinitely long contexts with bounded memory and computation resources, including promising length generalization capabilities and improved efficiency in memory resource utilization.

Summary

The research paper introduces a method for scaling Transformer-based Large Language Models (LLMs) to infinitely long inputs with bounded memory and computation. The authors propose a new attention technique, Infini-attention, which incorporates a compressive memory into the vanilla attention mechanism and integrates masked local and long-term linear attention mechanisms in a single Transformer block. They demonstrate the effectiveness of their approach on long-context language modeling benchmarks, passkey context block retrieval, and book summarization tasks using 1B and 8B LLMs. The key outcomes of this approach are minimal bounded memory parameters and fast streaming inference for LLMs.

Challenges of Constrained Context-Dependent Memory
The paper discusses the challenges of constrained context-dependent memory in Transformers and Transformer-based LLMs due to the quadratic complexity of the attention mechanism in terms of memory footprint and computation time. The paper introduces compressive memory as a more scalable and efficient alternative to the attention mechanism for processing extremely long sequences. The proposed approach enables Transformer LLMs to effectively process infinitely long inputs with bounded memory footprint and computation.

The researchers introduce the practical and powerful attention mechanism of Infini-attention with long-term compressive memory and local causal attention for efficiently modeling both long and short-range contextual dependencies. They also demonstrate that Infini-attention supports plug-and-play continual pre-training and long-context adaptation by design. Their approach also enables Transformer LLMs to scale to infinitely long contexts with bounded memory and compute resources by processing extremely long inputs in a streaming fashion.

The paper also presents the detailed architecture of Infini-attention, including the computation of both local and global context states and their combination for the final attention output. The authors compare their Infini-Transformer with other models and demonstrate its performance in long-context language modeling and book summarization tasks. They also highlight the efficiency of their approach through experiments, such as a 114x compression ratio in terms of memory size and achieving a new state-of-the-art result on a 500K length book summarization task.

The paper concludes by highlighting the contributions of their work and the potential implications for efficient attention mechanisms in large language models.

Reference: https://arxiv.org/abs/2404.071...