Key Points

1. Recent progress in efficient attention mechanisms has led to the expansion of the context length of large language models, enabling them to process much longer sequences of text.

2. Retrieval Augmented Generation (RAG) has emerged as an alternative to long-context LLMs, using a retriever to dynamically select relevant context for the generator.

3. Evaluating the output quality of long-context LLMs and RAG systems remains challenging, as tasks like Needle-in-a-Haystack lack complexity.

4. The authors propose to use summarization as a testbed for evaluating long-context models and RAG systems, as it requires reasoning over long contexts and understanding relative importance of content.

5. The authors designed a procedure to synthetically generate "Haystacks" of documents, ensuring that specific insights repeat across documents, and created the "Summary of a Haystack" (SummHay) task.

6. The SummHay task requires systems to process the Haystack and generate a summary that identifies the relevant insights and precisely cites the source documents.

7. The authors implemented an automatic evaluation protocol to score summaries on Coverage (presence of expected insights) and Citation (quality of document attribution).

8. The authors' large-scale evaluation of 10 LLMs and 50 RAG systems shows that SummHay is an open challenge, with even systems provided an Oracle signal lagging human performance.

9. The authors hope that future systems can equal and surpass human performance on SummHay, providing more reliable and trustworthy answer engines.

Summary

The paper introduces a new task called "Summary of a Haystack" (SummHay) to assess the performance of large language models (LLMs) and retrieval-augmented generation (RAG) systems on long-context tasks. The goal of the SummHay task is for a system to process a large "Haystack" of documents (typically around 100 documents totaling 100k tokens) and generate a summary that identifies the relevant insights across the documents and precisely cites the source documents.

The researchers developed a careful procedure to synthesize these Haystacks, ensuring that specific insights repeat across the documents. By precisely controlling the distribution of insights across the documents, the researchers can automatically evaluate how well a system's summary covers the expected insights and how accurately it cites the source documents.

SummHay Task Results
The paper reports the results of evaluating 10 LLMs and 50 RAG systems on the SummHay task. The findings indicate that SummHay is a open challenge for current systems. Even when provided with an oracle signal of document relevance, the top-performing models lag the researchers' estimate of human performance by over 10 points on the joint coverage and citation score.

System Performance Evaluation
Without a retriever, long-context LLMs like GPT-4o and Claude 3 Opus score below 20% on the joint metric. RAG systems are able to improve citation quality compared to the LLMs, but this is at the cost of insight coverage. Using more advanced RAG components like Cohere's Rerank3 model leads to performance boosts, confirming SummHay as a viable option for holistic RAG evaluation.

The researchers also demonstrate that SummHay can be used to study biases in long-context models, showing that most LLMs exhibit a position bias, favoring information at the top or bottom of the context window over the middle.

Overall, the paper argues that summarization can play a central role in evaluating the capabilities of long-context LLMs and RAG systems. The SummHay benchmark, with its precise control over information distribution and automated evaluation, provides a robust framework for driving progress towards systems that can match or surpass human performance on long-context summarization tasks.

Reference: https://arxiv.org/abs/2407.01370