Key Points
- RAGEval is introduced as a framework for generating scenario-specific datasets to evaluate the knowledge usage ability of Large Language Models (LLMs) in dealing with data from different vertical domains such as finance, healthcare, and legal sectors.
- Existing retrieval-augmented generation (RAG) benchmarks focus on evaluating factual correctness in open-domain QA tasks, which may not be suitable for assessing RAG models in vertical domains.
- RAGEval proposes a novel framework that summarizes a schema from seed documents, applies configurations to generate diverse documents, and constructs question-answering pairs based on the articles and configurations to better evaluate the knowledge usage ability of LLMs in different scenarios.
- RAGEval introduces three novel metrics, Completeness, Hallucination, and Irrelevance, to evaluate the responses generated by LLMs and avoid the confusion regarding the source of knowledge in answering questions.
- The paper discusses the limitations of traditional RAG evaluation, highlights the need for a universal framework to generate scenario-specific evaluation datasets, and introduces the DRAGONBall dataset, which encompasses a wide array of texts and related RAG questions from the finance, law, and medical domains in Chinese and English.
- The paper emphasizes the importance of a close-domain RAG evaluation dataset, given the prohibitive cost of collecting and annotating vertical documents and the necessity for comprehensive evaluation.
- RAGEval employs a hybrid approach combining rule-based and LLM-based methods to generate diverse document configurations and focuses on generating virtual texts with rich factual information, logical coherence, and internal consistency.
- The paper introduces a comprehensive evaluation framework with multiple metrics to evaluate the model's effectiveness and efficiency in the retrieval phase, such as Recall, Expected Information Retrieval (EIR), Signal-to-Noise Ratio (SNR), Completeness, Hallucination, and Irrelevance
- The study presents experimental results for various retrieval and generation models, highlighting the suitability of RAGEval in addressing the limitations of existing RAG benchmarks and the potential for open-source models to close the performance gap with further advancements.
Summary
This paper introduces RAGEval, a framework designed to automatically generate evaluation datasets for assessing the knowledge usage ability of Large Language Models (LLMs) in different scenarios. Existing benchmarks mainly focus on general knowledge, but are unable to effectively evaluate the performance of Retrieval-Augmented Generation (RAG) systems in dealing with data from various vertical domains like finance, healthcare, and legal.
RAGEval addresses this limitation by automating the process of dataset generation. The framework first summarizes a schema from a small set of seed documents to capture the essential domain-specific knowledge. It then generates diverse configurations based on this schema and uses them to produce corresponding documents.
Finally, it constructs question-answer-reference (QAR) triples from the generated documents and configurations.
Reference: https://arxiv.org/abs/2408.01262