Summary

This scientific article proposes a novel evaluation framework called FRAMES (Factuality, Retrieval, And reasoning MEasurement Set) for comprehensive evaluation of Large Language Models (LLMs) in end-to-end Retrieval-Augmented Generation (RAG) scenarios. The FRAMES dataset consists of 824 test samples designed to test LLMs' ability to retrieve and reason across multiple documents in a unified framework. The dataset includes challenging multi-hop questions that require integration of information from multiple sources. The article presents baseline results demonstrating that even state-of-the-art LLMs struggle with this task, achieving 0.40 accuracy with no retrieval.

However, the accuracy is significantly improved with the proposed multi-step retrieval pipeline, achieving an accuracy of 0.66 (>50% improvement). The authors hope that this work will help bridge evaluation gaps and assist in developing more robust and capable RAG systems. The article emphasizes the importance of evaluating RAG systems comprehensively and addresses the limitations in existing benchmarks. It compares FRAMES with other existing datasets and highlights the integrated evaluation that challenges models across dimensions simultaneously. The paper details the process of data collection, including synthetic data generation attempts and human annotation, to ensure the dataset's reliability and challenging nature. Human annotators were asked to create questions that required information from multiple Wikipedia articles, following a similar structure to the synthetic prompts but with greater reliability and accuracy.

Furthermore, the article presents detailed quality checks implemented to ensure the dataset's high quality and effectiveness in evaluating RAG capabilities. Several experiments are outlined, including single-step evaluations and multi-step evaluations, which highlight the impact of retrieval on performance and the potential for performance improvements with multi-step retrieval and reasoning strategies. The article also discusses future work and the potential limitations of the FRAMES dataset, addressing concerns such as pretraining data contamination, potential lack of diversity, and the need to explore more sophisticated retrieval and reasoning capabilities.

In summary, the article introduces a novel evaluation framework, FRAMES, designed to comprehensively evaluate the capabilities of Retrieval-Augmented Generation systems and highlights the limitations and potential improvements in current state-of-the-art LLMs. The work provides valuable insights into the challenges and opportunities for future research and development of more robust and efficient RAG systems.

Reference: https://arxiv.org/abs/2409.12941