Key Points

1. The paper examines how language models utilize external knowledge from retrieval-augmented generation (RAG) versus their own parametric information for factual queries.

2. The authors use causal mediation analysis to show that language models (LLaMa and Phi) minimally rely on their parametric memory when RAG context is available, in contrast to a vanilla setting without RAG.

3. The paper finds that the last token residual stream in the language models derives more enriched information from the attribute token present in the RAG context, rather than the subject token in the original query.

4. This suggests language models exhibit a "shortcut" behavior, prioritizing the external RAG context over internal parametric knowledge when answering factual queries.

5. The authors validate these findings using attention contribution and attention knockout techniques, which further confirm the reduced reliance on subject token information when RAG context is present.

6. This work provides a novel mechanistic understanding of how language models leverage RAG to complement their parametric knowledge for factual reasoning.

7. The authors compare the differences in parametric knowledge utilization between the LLaMa and Phi language model families, highlighting the generalizability of their observations.

8. The study is limited to short RAG contexts due to the computational overhead of causal tracing on longer inputs, and future work will explore the impact of longer contexts.

9. The authors also suggest extending this analysis to instruction-tuned models and those fine-tuned on objectives like RLHF, as well as examining the sensitivity to retriever/ranker quality in practical settings.

Summary

This research paper investigates the mechanisms behind how retrieval augmented generation (RAG) models rely more on the provided external context rather than their own parametric memory when generating responses. The authors use causal mediation analysis and attention contribution analysis to uncover this "shortcut" behavior in large language models like LLaMa and Phi.

Rationale for the Study

The paper starts by noting that RAG models have become popular in practical natural language systems as they can significantly improve the performance of language model applications by integrating external context. However, the exact nature of how this approach works is not well understood. The authors aim to analyze and interpret the dependency of language models on parametric knowledge versus the retrieved information presented via RAG.

Causal Tracing Findings

Using causal tracing, the authors find that parametric knowledge stored in the model's multilayer perceptrons (MLPs) is minimally used in the presence of retrieved context. Specifically, they observe a 5-fold decrease in the average indirect effect (AIE) of the last subject token in the query when RAG context is available, compared to the vanilla setting without RAG. This indicates that the subject tokens within the query do not elicit much parametric memory when external context is provided.

Attention Contribution Analysis

The authors also analyze the attention contribution from the subject token in the query to the last token position. They find that the attention contribution from the subject token decreases significantly (by around 2-7 times) when RAG context is available, compared to the vanilla setting. In contrast, the attention contribution from the attribute token present in the RAG context is 2-5 times higher than the subject token. Further, the authors use attention knockout experiments to show that knocking out the attention weights from the subject token to the last token has a minimal effect (less than 5% probability drop) on the predicted output in the presence of RAG context. However, knocking out the attribute token leads to a much stronger drop in probability (20-25%) of the original predicted token.

Main Takeaway

The main takeaway is that in the presence of retrieved RAG context, language models rely primarily on the context information and minimize the use of their parametric memory to answer factual queries. This "shortcut" behavior is observed consistently across the LLaMa and Phi language models.

Discussion and Future Work

The authors discuss the limitations of the study, such as the computational overhead of handling long RAG contexts, and suggest future work to explore the impact of subject token and attribute token positions, as well as the analysis of instruction-tuned and RLHF-finetuned models. Additionally, they note the need to examine the performance of RAG models with noisy, real-world retrieved outputs, in contrast to the well-controlled synthetic context used in this study.

Reference: https://arxiv.org/abs/2406.12824