Key Points

1. Large language models (LLMs) lack the ability to dynamically adapt to changing world knowledge, leading to factuality issues in generated text.

2. FRESH QA is introduced as a dynamic question-answering benchmark to evaluate factuality of LLMs, encompassing various question types including fast-changing knowledge and false premises.

3. Human evaluations involving more than 50k judgments shed light on limitations of LLMs and demonstrate significant room for improvement in factuality.

4. FRESH PROMPT, a few-shot prompting method, substantially boosts LLM performance on FRESH QA by incorporating relevant and up-to-date information retrieved from a search engine into the prompt.

5. The number and order of retrieved evidences play a key role in influencing the correctness of LLM-generated answers, with concise and direct answers reducing hallucination.

6. FRESH QA consists of 600 questions across diverse difficulty levels, requiring models to "understand" up-to-date knowledge to answer correctly.

7. LLMs struggle with fast-changing knowledge, false premises, and multi-hop questions, with flat scaling curves on questions involving fast-changing knowledge.

8. FRESH PROMPT significantly improves LLM performance on FRESH QA over competing search engine-augmented approaches, with factors such as number of evidences and their order influencing correctness.

9. The evaluation protocol involves human judgments and a simple automatic metric, FRESH EVAL, using few-shot in-context learning to judge model responses, demonstrating high inter-rater agreement and reproducibility.

Summary

Introduction to FRESHQA Benchmark and Proposed FRESHPROMPT Method
The research paper introduces the FRESHQA benchmark, evaluating the factuality of large language model-generated text in the context of answering questions that test current world knowledge. It encompasses diverse question and answer types, including fast-changing world knowledge and false premises. The paper benchmarks various models and sheds light on their limitations, such as struggling with questions involving fast-changing knowledge and false premises.

A few-shot prompting method called FRESHPROMPT is proposed, which substantially boosts model performance on FRESHQA by incorporating up-to-date information retrieved from a search engine into the prompt. Moreover, the number and order of retrieved evidences play a key role in influencing the correctness of model-generated answers. Instructing models to generate concise and direct answers also helps reduce hallucination. The paper concludes by releasing the FRESHQA benchmark for future research purposes.

Model Performance Evaluation and Insight for Future Research
The study involves a detailed evaluation of model performance on FRESHQA, shedding light on their limitations and proposing FRESHPROMPT, a few-shot prompting method to improve performance. The findings demonstrate the impact of the number and order of retrieved evidences, as well as the effectiveness of instructing models to produce concise and direct answers in reducing hallucination. Therefore, the study offers valuable insights for future research.

FRESHQA Benchmark Overview and Proposed FRESHPROMPT Method
The research paper introduces the FRESHQA benchmark, which aims to evaluate the factuality of large language model-generated text in the context of answering questions that test current world knowledge. The paper identifies limitations of large language models in handling questions involving fast-changing knowledge and false premises, and it proposes the FRESHPROMPT method to enhance the performance of large language models on FRESHQA. It also explores the role of the number and order of retrieved evidences and the impact of instructing the models to generate concise and direct answers.

Additional Findings and Release of FRESHQA Benchmark
Additionally, the paper presents findings that reveal the models' awareness of recent knowledge beyond their training cutoff date in September 2021. It highlights the importance of using separate prompts for relaxed and strict evaluations and the incorporation of retrieved evidences for improved performance. The paper also discusses the release of the FRESHQA benchmark for future research purposes.

Conclusion and Future Research Implications
Overall, the paper provides insights into the challenges of evaluating the factuality of large language model-generated text in answering questions related to current world knowledge and presents the FRESHQA benchmark as a potential solution for assessing the performance of large language models in this context.

Reference: https://arxiv.org/abs/2310.03214