Key Points

1. The article compares Retrieval Augmented Generation (RAG) and long-context (LC) Large Language Models (LLMs) to leverage their strengths, with a focus on the trade-off between performance and computational cost.

2. RAG acts as a prior, regulating the attention of LLMs onto retrieved segments, while long-scale pretraining may enable LLMs to develop stronger long-context capabilities.

3. RAG has been shown to efficiently process lengthy contexts by retrieving relevant information based on the query and prompting an LLM to generate a response in the context of the retrieved information at a significantly lower cost compared to LC.

4. However, recent LLMs like Gemini 1.5 and GPT-4 have demonstrated exceptional capabilities in understanding long contexts directly, prompting the need for a systematic comparison between RAG and LC LLMs.

5. The study benchmarks RAG and LC across various public datasets using three latest LLMs and reveals that, when sufficiently resourced, LC consistently outperforms RAG in terms of average performance. However, RAG’s significantly lower cost remains a distinct advantage.

6. Observations show that the majority of queries can be answered efficiently by RAG, with only a small subset requiring the more expensive long-context prediction step, leading to substantial cost reductions without sacrificing overall performance.

7. Failure analysis reveals that RAG lags behind LC due to various reasons such as multi-step reasoning, general queries, complexity, and implicit queries. Addressing these failure reasons could lead to improvements in RAG performance.

8. A simple yet effective method called SELF-ROUTE is proposed, which dynamically routes queries based on model self-reflection, effectively combining the strengths of both RAG and LC and achieving comparable performance to LC at a significantly reduced cost.

9. The study provides valuable insights for the practical application of long-context LLMs, highlighting the benefits of both RAG and LC and paving the way for future research in optimizing RAG techniques.

Summary

This paper presents a comprehensive comparison of Retrieval Augmented Generation (RAG) and long-context (LC) Large Language Models (LLMs) in processing extensive contexts. The key findings are: 1. Overall Performance: When sufficiently resourced, LC consistently outperforms RAG across various public datasets. The performance gap is more significant for the latest LLMs like Gemini-1.5 and GPT-4, highlighting the exceptional long-context understanding capabilities of these models. 2. Computational Cost: Despite the suboptimal performance, RAG remains a viable option due to its significantly lower computational cost. RAG reduces the input length to LLMs, leading to reduced costs which are typically based on the number of input tokens. 3. Prediction Overlap: Surprisingly, the predictions from LC and RAG are identical for over 60% of the queries. This suggests that for these queries, RAG can reduce cost without sacrificing performance.

Proposed Hybrid Approach: SELF-ROUTE

Based on this observation, the paper proposes SELF-ROUTE, a simple yet effective method that routes queries to RAG or LC based on model self-reflection. SELF-ROUTE significantly reduces the computation cost (e.g. 65% reduction for Gemini-1.5) while maintaining comparable performance to LC.

Analysis of RAG Failure Patterns

The paper also provides a comprehensive analysis of the common failure patterns of RAG, such as its limitations in multi-step reasoning, handling general queries, and understanding implicit information in the context. These insights can inspire future improvements of RAG techniques.

Conclusion and Contributions

Overall, this work presents a valuable guideline for leveraging the strengths of both RAG and LC in long-context applications of LLMs, highlighting the trade-offs between performance and computational cost, and introducing an effective hybrid approach to combine their advantages.

Reference: https://arxiv.org/abs/2407.16833