Summary
This paper explores different methods for extending the context length of large language models (LLMs) that use attention mechanisms. The authors conduct a survey of existing context length extrapolation methods and propose some new techniques, including a truncated basis for position encodings. They test these methods on three new evaluation tasks and measure their performance using perplexity. The authors find that linear scaling is the most effective method for extending context length, and they also discover promising results with the truncated basis approach. They release three new long-context models, called Giraffe, and provide the code to replicate their results.
The article discusses the dominance of transformers in natural language modeling tasks and the importance of incorporating positional information into LLMs. The authors argue that the ability to extrapolate to longer context lengths is crucial for tasks like reading long documents, having longer conversations with chatbots, and working with larger codebases. They categorize context length extrapolation into finetuned extrapolation, where a model pretrained on shorter contexts is updated with longer context lengths, and zero-shot extrapolation, where a model pretrained on short contexts is immediately evaluated on longer contexts. The article focuses on zero-shot extrapolation and evaluates the performance of different methods using new evaluation tasks.
The authors find that linear scaling is the best method for extending context length, and they show that using longer scales at evaluation time can further improve performance. They also discover promising results with the truncated basis approach. The authors release the weights of the Giraffe models and provide three new evaluation datasets to assess long context performance. They argue that perplexity is not a sufficient measure of long context performance and that the evaluation tasks they introduce provide a more accurate assessment.
Overall, this article presents a comprehensive survey of context length extrapolation methods and provides insights into the performance of different techniques on various evaluation tasks.
Key points
1. Modern large language models (LLMs) with attention mechanisms have fixed context lengths.
2. Techniques for extending context length include modifying positional encodings.
3. Linear scaling is the best method for extending context length.
4. Longer scales at evaluation time can further improve performance.
5. Truncation of the basis for position encoding shows promising extrapolation capabilities.
6. The use of long-context models improves performance in tasks such as FreeFormQA, AlteredNumericQA, and LongChat-Lines.
7. Perplexity is a less fine-grained measure of long-context performance compared to the evaluation tasks.
8. The linear scaling method can be applied after finetuning with the truncated basis for additional performance gains.
9. There is room for future research on the degradation of accuracy as context length increases and the limitations and possibilities of different positional encoding methods.