Key Points
1. The authors present NeedleBench, a framework for evaluating long-context capabilities of large language models (LLMs). The framework includes progressively more challenging tasks for assessing bilingual long-context capabilities, spanning multiple length intervals and different depth ranges, to rigorously test retrieval and reasoning capabilities of models in diverse contexts. The authors also propose the Ancestral Trace Challenge (ATC) to mimic complexity of logical reasoning challenges in real-world long-context tasks.
2. The importance of LLMs in handling long texts is emphasized, and the development of LLMs with long-context capabilities, such as GPT-4 Turbo, Claude 2.1, and Gemini 1.5, is discussed. Existing datasets like LongBench provide a benchmark for long text comprehension but accurately assessing the performance of LLMs at the 1M token level remains a challenge.
3. The paper compares and evaluates different evaluation methods for long-context LLMs, such as the passkey testing approach, InfiniteBench, and the Needle In A Haystack (NIAH) test, stressing the need for accurate information retrieval and strong reasoning capabilities.
4. NeedleBench comprises tasks for evaluating the bilingual long-context capabilities of LLMs, including Single Retrieval Task (S-RT), Multi-Retrieval Task (M-RT), and Multi-Reasoning Task (M-RS) to comprehensively evaluate models’ abilities to extract and analyze information within the context of long texts. The ATC test is introduced as a simplified proxy for measuring multi-step logical reasoning.
5. The Ancestral Trace Challenge (ATC) experiment is detailed, focusing on constructing a problem using simple first-order logical inferences, forming an information chain that LLMs need to fully understand to answer the question. This method can be expanded to more challenging logical relationships, allowing for stress testing the multi-step reasoning capabilities of LLMs.
6. The paper discusses the design of the Chinese and English haystack, the Chinese haystack being formulated using the ChineseDomainModelingEval dataset, covering a wide range of topics from finance to technology. The design of needles in the retrieval tasks and reasoning tasks is also described in detail, ensuring a high-quality dataset for evaluating the bilingual reasoning capabilities of the models.
7. The performance of mainstream open-source LLMs on NeedleBench at various token lengths (4K, 8K, 32K, and 200K) is evaluated. The Levenshtein distance is used to evaluate the similarity between predictions and references under specific tasks, and main results for 32K and 200K context lengths are presented.
8. A detailed experimental setting is described, including the recall accuracy of needles placed at different positions as a metric to evaluate the performance of the models, and the impact of question prompt’s placement on the experiments.
9. The main findings of the paper, including the strong capabilities of InternLM2-7B-200K in Single-Retrieval, the substantial parameter count advantage of Qwen-1.5-72B-vLLM in Multi-Reasoning, and the overall performance of models with larger parameter counts, are presented.
Summary
The paper introduces the NeedleBench framework to evaluate the long-context capabilities of large language models (LLMs) in bilingual long texts and assess their ability to identify relevant information and apply it to reasoning. The framework comprises progressively challenging tasks spanning different length intervals and depth ranges to rigorously test the retrieval and reasoning capabilities of models in diverse contexts. The authors use the NeedleBench framework to assess leading open-source models and propose the Ancestral Trace Challenge (ATC) to evaluate LLMs in complex long-context situations. The study emphasizes the importance of LLMs' ability to process long texts in applications such as legal document retrieval, academic research, and business intelligence analysis. To meet this need, recent LLMs support longer context windows, with some models accommodating text lengths of up to millions of tokens. Existing datasets, such as the LongBench dataset, provide a benchmark for evaluating LLMs' comprehension of long texts. However, accurately evaluating LLMs' performance, especially at the 1M token level, remains a significant challenge.
The NeedleBench Framework Subtasks and Ancestral Trace Challenge
The NeedleBench framework consists of three subtasks: Single-Needle Retrieval Task (S-RT), Multi-Needle Retrieval Task (M-RT), and Multi-Needle Reasoning Task (M-RS). These tasks assess LLMs' abilities to recall single or multiple pieces of information and engage in complex reasoning across long texts. Additionally, the Ancestral Trace Challenge (ATC) is introduced to test LLMs' multi-step logical reasoning capabilities, demonstrating that current LLMs struggle with complex reasoning challenges in real-world long-context tasks, even with texts shorter than 2K tokens.
Evaluation of LLM Performance on NeedleBench
The study evaluates the performance of mainstream open-source LLMs on NeedleBench at token lengths of 4K, 8K, 32K, and 200K and includes leading API models in the ATC experiment. The findings indicate that models with large parameter counts tend to achieve higher average scores, suggesting that LLMs may struggle with reasoning tasks and require improvement in practical long-context applications. The authors provide a comprehensive evaluation and analysis of the performance of mainstream models in identifying key question-relevant information and reasoning.
Additionally, the paper introduces the ATC, a simplified proxy for measuring multi-step logical reasoning, demonstrating the struggle of current LLMs with reasoning tasks in complex long-context scenarios. The authors make all reproducible scripts, code, and datasets available for further research and evaluation.
Reference: https://arxiv.org/abs/2407.11963