Never Miss A Beat: An Efficient Recipe for Context Window Extension of Large Language Models with Consistent "Middle" Enhancement (AI summary)

Key Points

1. Recently developed methods for extending the context length of pre-trained large language models (LLMs) often require fine-tuning at the target length (≫ 4K) and struggle to effectively utilize information from the middle part of the context. PE-based methods have emerged as a practical solution for extending operational range of LLMs in tasks involving larger context windows, offering advantages of straightforward implementation and rapid adaptation.

2. "Lost-in-the-Middle" problem faced by long-context LLMs is a significant concern, highlighting the necessity to efficiently extend the context window size of pre-trained LLMs while optimizing their effectiveness in processing "in-the-middle" content.

3. The proposed CREAM method, Continuity-Relativity indExing with gAussian Middle, efficiently extends the context window size of LLMs while preserving continuity and relativity. CREAM introduces two indexing strategies for continuity and relativity, and utilizes truncated Gaussian distribution for middle segment sampling to encourage the model to prioritize the information in the middle positions during positional interpolation.

4. CREAM successfully extends LLMs to the target length for both Base and Chat versions of Llama2-7B with “Never Miss A Beat”, outperforming existing baseline methods, such as PoSE, and showing promising results in long-context evaluation tasks.

5. CREAM-Base alleviates the Lost-in-the-Middle issue, performing well in retrieving information from long contexts of varying lengths. It outperforms PoSE in the “Lost in the Middle” task, indicating its effectiveness in handling middle-focused tasks.

6. CREAM-Chat demonstrates superior performance in long-context understanding, surpassing strong baseline models and displaying efficient usage of the extended context size. It outperforms SkipAlign in context window expansion and demonstrates promising results in various long-context evaluation tasks.

7. CREAM achieves nearly the same NLU abilities as the pre-trained base model, demonstrating stability across different target context lengths and retaining all NLU abilities of the base Llama2-7B.

8. CREAM indicates promising results in the ablation study, validating the effectiveness of modeling choices and demonstrating a good balance between continuity and relativity.

9. CREAM has been presented as a simple yet effective method to extend the context of large language models, achieving a trade-off between continuity and relativity, and effectively mitigating the issue of “lost in the middle”. It has shown superiority over other methods and has the potential to significantly enhance the performance of long-context LLMs.

Summary

Introduction of CREAM
The paper introduces a novel method named Continuity-Relativity indExing with gAussian Middle (CREAM) to extend the context length of pre-trained large language models (LLMs) efficiently and effectively. The existing methods for extending the context length of LLMs often require extensive fine-tuning and struggle to utilize information from the middle part of the context. CREAM addresses these issues by interpolating positional encodings and introducing a truncated Gaussian sampling technique to encourage focusing on the middle part of the context during fine-tuning. The method is training-efficient, requiring fine-tuning at the pre-trained context window and extending LLMs to a much longer target context length, e.g., 256K tokens.

Truncated Gaussian Distribution Technique
To ensure that the model focuses more on the information in the middle, the paper introduces a truncated Gaussian distribution for middle segment sampling. This technique enables the LLM to prioritize the information in the middle positions even when performing positional interpolation within the pre-trained context window size. CREAM incorporates several key strategies, including division of the context window, effective positional encoding, and efficient transformers and extra memory attention layers. It is shown to have successful results in extending LLMs to longer context lengths, outperforming other methods.

Experimental Results and Evaluation
The experimental results demonstrate the effectiveness of CREAM in extending the LLMs to the target length for both Base and Chat versions of Llama2-7B with "Never Miss A Beat." CREAM is further evaluated on a number of tasks, such as LongChat-Lines and "Lost in the Middle," where it consistently outperforms other methods. Furthermore, CREAM-Chat shows superior performance in tasks like Needle in A Haystack and LongBench, even after instruction-tuning for only 100 steps. The method also maintains the NLU abilities of the pre-trained base model and demonstrates stability across different target context lengths, exemplified by extending the context length to 256K tokens.

The paper concludes by emphasizing that CREAM achieves a trade-off between continuity and relativity, thereby effectively mitigating the issue of "lost in the middle" and outperforming other methods on both Base and Chat models.

Reference: https://arxiv.org/abs/2406.07138

ML and AI papers

Never Miss A Beat: An Efficient Recipe for Context Window Extension of Large Language Models with Consistent "Middle" Enhancement (AI summary)

Recent posts

Foundational Models Defining a New Era in Vision: A Survey and Outlook (AI summary)

MedAgents: Large Language Models as Collaborators for Zero-shot Medical Reasoning (AI summary)

If LLM Is the Wizard, Then Code Is the Wand: A Survey on How Code Empowers Large Language Models to Serve as Intelligent Agents (AI summary)