Key Points

1. Retrieval Augmented Language Models (RALMs) have become a critical enhancement for Large Language Models (LLMs) by integrating external knowledge during inference, which mitigates the factual hallucinations inherent in LLMs.

2. Despite the advancements, challenges persist in the implementation of RALMs, particularly concerning their reliability and traceability. Irrelevant document retrieval can lead to unhelpful response generation or even deteriorate the performance of LLMs, while the lack of proper citations in generated outputs complicates efforts to verify the trustworthiness of the models.

3. The authors propose a novel self-reasoning framework called SELF-REASONING to improve the reliability and traceability of RALMs, leveraging reasoning trajectories generated by the LLM itself.

4. The SELF-REASONING framework involves three key processes: a Relevance-Aware Process, an Evidence-Aware Selective Process, and a Trajectory Analysis Process.

5. The Relevance-Aware Process instructs the LLM to judge the relevance between retrieved documents and the given question.

6. The Evidence-Aware Selective Process directs the LLM to choose and cite relevant documents, and automatically select snippets of key sentences as evidence.

7. The Trajectory Analysis Process requires the LLM to generate a concise analysis based on the self-reasoning trajectories from the previous processes and provide the final inferred answer.

8. The authors evaluate the SELF-REASONING framework on four public datasets (two short-form QA, one long-form QA, and one fact verification) and demonstrate that it outperforms existing state-of-the-art models, achieving this with only 2,000 training samples.

9. The SELF-REASONING framework improves the robustness of RALMs without the need for external models or tools, and enhances the interpretability and traceability of the generated outputs.

Summary

The paper proposes a novel self-reasoning framework aimed at improving the reliability and traceability of Retrieval-Augmented Language Models (RALMs) to address challenges related to irrelevant document retrieval and lack of proper citations. The authors highlight the limitations associated with RALMs, particularly concerning their reliability and traceability. These limitations include the potential impact of irrelevant data on model performance, as well as the lack of explicit citations in the generated outputs, complicating the process of verifying the trustworthiness of the models. To address these limitations, the proposed self-reasoning framework leverages reasoning trajectories generated by the Language Model itself, involving three processes: a relevance-aware process, an evidence-aware selective process, and a trajectory analysis process. The authors evaluate their framework across four public datasets, demonstrating its ability to outperform existing state-of-the-art models and achieve comparable performance with GPT-4, while only using 2,000 training samples. The authors also conduct an ablation study to analyze the individual contributions of each process within the proposed self-reasoning framework, highlighting the importance of the relevance-aware process, evidence-aware selective process, and trajectory analysis process in improving the performance of RALMs. Additionally, the paper explores the effectiveness of the gradual learning method and the quality control of data generation, providing detailed results and insights into the impact of each component.

Evaluation and Analysis
Furthermore, the paper presents an extensive experimental evaluation of the proposed framework on various datasets, demonstrating its superiority over existing state-of-the-art models. The authors illustrate the robustness of the proposed framework in dealing with noisy documents and provide a thorough analysis of citation quality through human evaluation, aligning well with the results obtained from automatic evaluation.

In conclusion, the paper presents a comprehensive and detailed proposal for a self-reasoning framework to improve the reliability and traceability of RALMs, with extensive experimental results and insights into its effectiveness and robustness. The authors acknowledge the potential for further exploration in more challenging reasoning tasks in future work.

Reference: https://arxiv.org/abs/2407.19813