Key Points

1. Large Language Models (LLMs), despite their abilities for closed book QA and longform text generation, can still produce incorrect information, termed "hallucinations", especially for lesser known facts.

2. The CoVe method involves four core steps: a) generating a baseline response, b) planning verifications by generating a list of questions, c) executing verifications by answering the questions and d) generating a final verified response.

3. The CoVe method is shown to decrease hallucinations and improve correctness across list-based questions, closed book MultiSpanQA, and longform text generation tasks.

4. Training-time correction, generation-time correction, and augmentation are the three main categories of methods used to reduce hallucination in LLMs.

5. CoVe demonstrates substantial performance gains over the original language model response by asking the same model to deliberate on its answers.

6. Factored and 2-step CoVe approach outperforms the joint approach in decreasing hallucinations.

7. Shortform verification questions are more accurately answered than longform queries.

8. LLM-based verification questions outperform rule-based heuristics in quality.

9. Open verification questions outperform yes/no-based questions.

Summary

The research paper investigates the issue of factual hallucinations in large language models (LLMs) and presents the Chain-of-Verification (CoVe) method as a solution to reduce the rate of hallucinatory content in LLM-generated responses. The paper discusses how even large language models can still fail, particularly on lesser-known facts, which can lead to factually incorrect generations referred to as hallucinations. To address this, CoVe involves generating verification questions to fact-check the initial response, systematically answering those questions independently, and producing an improved revised response based on the verification results. The study shows that CoVe decreases hallucinations across multiple tasks, including list-based questions, closed book MultiSpanQA, and longform text generation.

The paper presents various experimental benchmarks to measure the efficacy of CoVe, revealing that CoVe provides substantial gains in precision on list-based tasks, improves performance on closed book QA, and increases the FACT S CORE metric in longform generation. The study also introduces different variants of CoVe, such as factored and 2-step approaches, which outperform the joint method. Furthermore, the factor+revise approach, which involves explicitly cross-checking verification answers for inconsistencies, yields significant improvements in mitigating hallucinations.

The paper also compares the CoVe approach to other methods addressing hallucination, indicating that CoVe provides larger gains than other techniques. Additionally, the research discusses the challenge of addressing different forms of hallucinations and explores future directions, such as equipping CoVe with tool-use for further improvements. An important finding is the higher accuracy of shortform verification questions compared to longform queries, indicating the effectiveness of CoVe in fact-checking individual facts. Finally, the paper discusses the limitations of CoVe in fully removing hallucinations and the potential use of external tools in combination with the CoVe approach.

Reference: https://arxiv.org/abs/2309.11495