Key Points

1. The paper introduces a series of large language models (LLMs) with effective context windows of up to 32,768 tokens, achieved through continual pretraining from L LAMA 2 with longer training sequences and upsampling long texts in the dataset.

2. The authors extensively evaluate the models using language modeling, synthetic tasks, and a wide range of real-world benchmarks, showing consistent improvements on most regular tasks and significant improvements on long-context tasks compared to L LAMA 2.

3. The paper provides an in-depth analysis of the individual components of the method, delving into L LAMA's position encodings and their limitations in modeling long dependencies, as well as the impact of various design choices in the pretraining process.

4. The models demonstrate clear power-law scaling behavior with respect to context lengths, indicating their ability to consistently benefit from more contexts and suggesting context length as an important axis of scaling LLMs.

5. The paper highlights the importance of maintaining strong performance on standard short-context tasks and demonstrates the model's ability to achieve robust performance in these scenarios.

6. The research identifies a key limitation in L LAMA 2's positional encoding for attention aggregation and proposes a modification to the RoPE positional encoding, which outperforms a concurrent approach for extending L LAMA’s context length.

7. The study shows that the quality of data plays a more critical role than the length of texts for long-context continual pretraining and demonstrates the efficiency of long-context pretraining compared to training from scratch with long sequences.

8. The authors explore various strategies for instruction finetuning the pre-trained long-context model without requiring any supervised long data, showcasing considerable performance improvements in downstream tasks.

9. The paper evaluates the safety capability of the instruction fine-tuned model using standard academic benchmarks, highlighting its maintenance of similar safety performance compared to L LAMA 2 C HAT and its relative safety compared to other open-source LLMs.

Summary

The paper presents a series of long-context Large Language Models (LLMs) that support context windows of up to 32,768 tokens. The models are built through continual pretraining from Llama 2 with longer training sequences and on a dataset where long texts are upsampled. The paper extensively evaluates the models on language modeling, synthetic context probing tasks, and a wide range of research benchmarks. The evaluations show consistent improvements on most regular tasks and significant improvements on long-context tasks over Llama 2. The models also surpass the overall performance of gpt-3.5-turbo-16k on long-context benchmarks.

The methodology includes a continual pretraining approach and a lightweight instruction tuning procedure. The paper explores the scaling behavior of the models with respect to context lengths, comparative performance with Llama 2 on research benchmarks, and the ability to achieve stronger overall performance than gpt-3.5-turbo-16k on long-context benchmarks. The research also delves into the impact of various design choices in the pretraining process, including data mix and training curriculum of sequence lengths.

Model Evaluation and Safety Capability
The models are evaluated with language modeling, synthetic tasks, and real-world benchmarks covering both long and short context tasks. Additionally, the paper assesses the safety capability of the instruction fine-tuned models and leverages safety benchmarks such as TruthfulQA, ToxiGen, and BOLD to evaluate model safety and bias. The research provides an in-depth analysis of the model's components, including positional encodings, data mix, and training curriculum, and evaluates their contributions to model performance.

Overall, the study demonstrates the effectiveness of long-context LLMs in achieving strong long-context performance, surpassing existing open-source models, and addressing safety and bias concerns. The paper provides insights for improving long-context LLMs and aims to make long-context LLMs more accessible for future advancements in the field.

Reference: https://arxiv.org/abs/2309.16039