Key Points
- The paper introduces YaRN (Yet another RoPE extensioN method) as a compute-efficient method to extend the context window of transformer-based language models, allowing for utilization and extrapolation to longer context lengths than their original pre-training. YaRN requires 10x less tokens and 2.5x less training steps than previous methods.
- The study discusses the limitations of position encodings in transformer-based language models in generalizing past the context window seen during training.
- Different positions interpolation methods including "NTK-aware" and "NTK-by-parts" interpolation methods were proposed to extend the context length of models trained with Rotary Position Embeddings (RoPE).
- "Dynamic NTK" interpolation method, which allows for more than 2x context window extension without any fine-tuning, is introduced, along with the "Dynamic Scaling" inference-time technique.
- The YaRN method, which combines attention scaling and the "NTK-by-parts" interpolation, surpasses previous methods in both fine-tuned and non-fine-tuned scenarios, achieving context window extension with only 400 training steps, representing approximately 0.1% of the model’s original pre-training corpus.
- Evaluations of the YaRN method demonstrate its successful context window extension, with strong performance across the entire targeted context size and minimal performance degradation compared to the Llama 2 baselines.
- The research provides a thorough evaluation of the YaRN method on various benchmarks and datasets, including long sequence language modeling performances, passkey retrieval task, and the Hugging Face Open LLM benchmark suite.
- The study concludes that YaRN improves upon all existing RoPE interpolation methods, preserving the models' abilities on multiple benchmarks while attending to very large context sizes. The paper also provides the code used for training YaRN and implementing various extension methods for reproducibility purposes.
Summary
The paper discusses the limitations of transformer-based language models in generalizing past the sequence length they were trained on and presents a new method called YaRN (Yet another RoPE extensioN method) to extend the context window of such models. YaRN offers significant efficiency improvements, requiring 10x fewer tokens and 2.5x fewer training steps than previous methods. The study demonstrates that using YaRN, language models can effectively utilize and extrapolate to context lengths much longer than their original pre-training would allow, surpassing the state-of-the-art at context window extension.
Furthermore, the YaRN method exhibits the capability to extrapolate beyond the limited context of a fine-tuning dataset. The paper introduces various techniques such as "NTK-aware" interpolation, "NTK-by-parts" interpolation, Dynamic Scaling, and YaRN method, which are designed to address the limitations of existing positional encoding schemes in transformer models for context window extension. The proposed YaRN method surpasses all previous methods in both fine-tuned and non-fine-tuned scenarios, achieving context window extension with minimal training steps and data, making it highly compute-efficient. The study also includes evaluations of YaRN using various benchmarks and datasets, demonstrating strong performance across extended context window sizes and on different tasks, including language modeling and passkey retrieval.
The paper concludes by highlighting the effectiveness of YaRN as a drop-in replacement for existing methods, preserving original abilities on multiple benchmarks while being able to attend to a very large context size. Overall, the findings suggest that YaRN represents a significant advancement in extending the context window of language models, with implications for improved performance and efficiency in various natural language processing tasks.
Reference: https://arxiv.org/abs/2309.00071