Key Points

1. LongFact is introduced as a prompt set for evaluating long-form factuality in a wide variety of domains, and it is made publicly available.

2. SAFE utilizes LLMs to decompose long-form responses into individual facts, sends fact-checking queries to Google Search, and evaluates the accuracy of each fact.

3. The paper empirically demonstrates that SAFE achieves superhuman rating performance and is more than 20 times cheaper than human annotators.

4. Thirteen language models across four model families are benchmarked on LongFact, with larger language models generally achieving better long-form factuality.

5. The paper introduces F1 @K as an aggregated metric for long-form factuality, which balances the percentage of supported facts in a response with the percentage of provided facts relative to a user's preferred response length.

6. The model evaluations indicate that larger language models generally exhibit better long-form factuality, and SAFE significantly outperforms human annotators while being cost-efficient.

7. The work is open-sourced, and the paper encourages further research into measuring and improving language models in long-form domains.

Summary

The research paper proposed a method called Search-Augmented Factuality Evaluator (SAFE) to evaluate the factuality of long-form content generated by large language models (LLMs). The main contributions include the creation of a prompt set named LongFact, the development of SAFE, and the introduction of F1 @K as a metric for long-form factuality. The paper benchmarks thirteen language models and demonstrates that SAFE achieves superhuman rating performance, agreeing with human annotators 72% of the time, and is more than 20 times cheaper than human annotators.

SAFE utilizes an LLM to autonomously evaluate long-form factuality by breaking down a response into individual facts and using a multi-step reasoning process involving sending search queries to Google Search to determine the accuracy of each fact. The F1 @K metric measures both precision and recall in evaluating long-form factuality, taking into account the human-preferred "ideal" number of facts in a response.

The paper presents detailed results from benchmarking thirteen language models and shows that larger models generally perform better in long-form factuality. The code for LongFact, SAFE, and all experimental data are open-sourced for reproducibility and further research.

The paper addresses potential limitations of SAFE due to its reliance on Google Search as a knowledge source and the assumption that a response does not contain repeating facts. The authors also provide detailed analysis of the causes of error for SAFE and suggest potential improvements by using a model with stronger reasoning abilities.

Overall, the paper provides a novel approach to evaluating long-form factuality in large language models, offers a comprehensive benchmark of language models, and introduces a new metric for measuring long-form factuality. The findings demonstrate the potential of LLMs as scalable auto-raters and provide valuable insights for future research in the field of long-form factuality evaluation.

Reference: https://arxiv.org/abs/2403.18802