Key Points

1. The traditional RAG framework operates on short retrieval units, forcing the retriever to search over a large corpus to find the relevant piece. LongRAG, on the other hand, processes the entire Wikipedia into 4K-token units and significantly reduces the total units from 22M to 600K, leading to a notable retrieval score. LongRAG also achieves an EM of 62.7% on NQ and 64.3% on HotpotQA without requiring any training.

2. LongRAG involves a 'long retriever' and a 'long reader' and significantly reduces the corpus size, lowering the burden of the retriever and improving answer recall. The long retrieval unit amalgamates comprehensive information from related documents, which can be used directly to answer multi-hop questions without iterative retrieval.

3. The study demonstrated that utilizing long-context retrieval significantly alleviates the burden on the retriever model, enhances top-1 answer recall, and requires significantly fewer retrieval units to achieve comparable results.

4. The long retriever will identify coarse relevant information for the given query by searching through all the long retrieval units in the corpus. Only the top 4 to 8 retrieval units (without re-ranking) are used for the next step. The long reader further extracts answers from the concatenation of retrievals, which is normally around 30K tokens.

5. LongRAG's retrieval performance is measured using Answer Recall (AR) and Recall (R) and achieves notable improvements in both metrics. It is tested on two Wikipedia-related question answering datasets: Natural Questions and HotpotQA, and the results showed significant improvements in recall and end question-answering performance.

6. Different settings of LongRAG were compared on the NQ and HotpotQA datasets, and the results indicated that the most suitable context length fed into the reader is around 30K tokens. The semantic integrity is important when comparing the performance of different retrieval unit selections, highlighting the advantage of using longer and more complete retrieval units.

7. LongRAG was compared with several groups of strong previous models as baselines and achieved high exact match rates on both NQ and HotpotQA datasets, showing performance on par with the strongest fully trained RAG models.

8. The study highlighted the limitations of the proposed LongRAG framework, including the need for stronger long embedding models, the requirement for a reader that supports long input, and a more general grouping method beyond hyperlinks in the Wikipedia corpus.

9. The study compared the performance of different readers within the LongRAG framework, demonstrating that GPT-4o is the most effective long reader within the framework due to its superior ability to process and comprehend lengthy contexts.

Summary

The paper proposes a novel LongRAG framework aimed at addressing the imbalance between the retriever and reader components. In contrast to the traditional RAG framework, LongRAG introduces a "long retriever" and a "long reader" and processes the entire Wikipedia into 4K-token units, which is 30 times longer than before. By increasing the unit size, the total units are significantly reduced from 22M to 600K, resulting in a remarkable improvement in the retrieval score: answer recall@1=71% on NQ and HotpotQA. The top-k retrieved units (approx. 30K tokens) are fed to an existing long-context LLM to perform zero-shot answer extraction, achieving an EM of 62.7% on NQ and 64.3% on HotpotQA. The study also offers insights into the future roadmap for combining RAG with long-context LLMs.

The LongRAG framework addresses the previous imbalance in traditional RAG by significantly reducing the total corpus size, leading to improved retrieval scores. The paper includes detailed experiments and comparisons that show the effectiveness of LongRAG, along with insights into the methodology's performance and challenges. Additionally, the study explores potential future improvements, such as the need for stronger long embedding models and more general grouping methods to enhance the framework's efficiency and capabilities.

The paper provides comprehensive details of the LongRAG framework, including the changes from traditional RAG, the impact of the increased unit size and reduction in total units on retrieval performance, the zero-shot answer extraction using top-k retrieved units, and the improvements achieved in EM scores on NQ and HotpotQA. The study demonstrates the significant potential of LongRAG in addressing the imbalance between retriever and reader components and provides valuable insights into the future directions for integrating RAG with long-context LLMs. The findings are supported by in-depth experiments and comparisons, offering researchers a detailed understanding of the LongRAG framework and its implications for open-domain question answering tasks.

Reference: https://arxiv.org/abs/2406.153...