Key Points
1. Current long context large language models (LLMs) can process inputs up to 100,000 tokens, but struggle to generate outputs exceeding 2,000 words.
2. The model's output length limitation is due to the scarcity of long-output examples in existing supervised fine-tuning (SFT) datasets, rather than the model's inherent capacity.
3. The authors introduce AgentWrite, an agent-based pipeline that decomposes ultra-long generation tasks into subtasks, enabling off-the-shelf LLMs to generate coherent outputs exceeding 20,000 words.
4. The authors construct LongWriter-6k, a dataset containing 6,000 SFT data with output lengths ranging from 2k to 32k words.
5. By incorporating LongWriter-6k into model training, the authors successfully scale the output length of existing models to over 10,000 words while maintaining output quality.
6. The authors develop LongBench-Write, a comprehensive benchmark for evaluating ultra-long generation capabilities.
7. The authors' 9B parameter model, further improved through DPO, achieves state-of-the-art performance on the LongBench-Write benchmark, surpassing even much larger proprietary models.
8. The authors demonstrate that existing long context LLMs already possess the potential for a larger output window, and only need data with extended output during model alignment to unlock this capability.
9. The authors make their code and models publicly available at https://github.com/THUDM/LongW....
Summary
This research paper addresses the limitations of current large language models (LLMs) in generating long-form outputs. The key findings and methodological approach are as follows:
Research Objectives
Research Objectives: The researchers aimed to explore the limitations of current long context LLMs in generating outputs exceeding even 2,000 words in length, despite their ability to process inputs up to 100,000 tokens. Proposed Solution (AgentWrite): To overcome this limitation, the researchers introduce AgentWrite, an agent-based pipeline that decomposes ultra-long generation tasks into subtasks. This enables off-the-shelf LLMs to generate coherent outputs exceeding 20,000 words.
LongWriter-6k Dataset
LongWriter-6k Dataset: Leveraging AgentWrite, the researchers construct LongWriter-6k, a dataset containing 6,000 supervised fine-tuning (SFT) examples with output lengths ranging from 2,000 to 32,000 words.
Evaluation
The researchers develop LongBench-Write, a comprehensive benchmark for evaluating ultra-long generation capabilities. Their 9B parameter model, further improved through direct preference optimization (DPO), achieves state-of-the-art performance on this benchmark, surpassing even much larger proprietary models. The key contributions of this work are: 1) Analysis of the primary factor limiting current LLMs' output length, which is the constraint on the output length in the SFT data; 2) Development of AgentWrite to overcome this limitation by automatically constructing SFT data with ultra-long outputs; and 3) Successful scaling of the output window size of existing models to 10,000+ words by incorporating the LongWriter-6k dataset into training, further enhanced by DPO.
Reference: https://arxiv.org/abs/2408.07055