Key Points
1. The paper "Hallucination of Multimodal Large Language Models: A Survey" presents a comprehensive analysis of the phenomenon of hallucination in multimodal large language models (MLLMs), also known as Large Vision-Language Models (LVLMs). MLLMs have shown significant advancements in multimodal tasks but often generate outputs that are inconsistent with the visual content, which poses substantial obstacles to their practical deployment and raises concerns regarding their reliability in real-world applications.
2. The problem of hallucination in MLLMs is mainly categorized into factuality hallucination and faithfulness hallucination. Factuality hallucination refers to the discrepancy between generated content and verifiable real-world facts, while faithfulness hallucination refers to the divergence of generated content from user instructions or the context provided by the input. In MLLMs, object hallucination has been empirically categorized into three categories: object category, object attribute, and object relation.
3. The comprehensive survey reviews the underlying causes of hallucinations in MLLMs, spanning contributing factors from data, model, training, to the inference stage. The unique challenges posed by cutting-edge MLLMs warrant an in-depth investigation, especially since their unique origins of hallucinations warrant a comprehensive overview of benchmarks and metrics designed for evaluating hallucinations in MLLMs.
4. The paper proposes a variety of methods to mitigate hallucinations in MLLMs, including enhancing the scale and resolution of the model, introducing dedicated modules, and employing reinforcement learning. Additionally, the paper presents approaches to mitigate hallucinations through data calibration, rewriting text captions, and auxiliary supervision, focusing on the use of additional supervision signals for MLLM training.
5. The research also delves into the use of dedicated modules and reinforcement learning to train MLLMs to mitigate hallucinations, particularly through data filtering strategies and the development of training loss from the perspective of embedding space distribution. Importantly, the study emphasizes the employment of reinforcement learning to address hallucinations through automatic metric-based reinforcement learning and utilization of visualization-based reinforcement learning.
6. The study provides a holistic overview of the performance of MLLMs across various benchmarks, revealing that different benchmarks have different evaluation dimensions and emphases, reflecting the inconsistency in MLLMs' performance across different benchmarks.
7. The research discusses the application of reinforcement learning to train MLLMs to mitigate hallucinations through approaches such as automatic metric-based reinforcement learning, visualization-based reinforcement learning, selective end-of-sequence (EOS) supervision, and reinforcing learning objectives.
8. The paper also explores the application of auxiliary supervision in training MLLMs to mitigate hallucinations, particularly through the use of additional supervision signals during training to enhance the perception ability of MLLMs and reduce hallucinations.
9. The study presents a detailed comparative analysis of mainstream MLLMs on generative benchmarks and discriminative benchmarks, revealing inconsistencies in MLLMs' performance across different benchmarks.
Summary
The paper "Hallucination of Multimodal Large Language Models: A Survey" presents a comprehensive analysis of the phenomenon of hallucination in multimodal large language models (MLLMs), also known as Large Vision-Language Models (LVLMs). These models have demonstrated significant advancements and remarkable abilities in multimodal tasks, but they often generate outputs that are inconsistent with the visual content, which is known as hallucination. The research reviews recent advances in identifying, evaluating, and mitigating hallucinations, offering a detailed overview of the underlying causes, evaluation benchmarks, metrics, and strategies developed to address this issue.
Types of Hallucinations and Causes
The paper identifies the problem of hallucination as originating from LLMs and categorizes hallucination into factuality and faithfulness types. Further, it focuses on object hallucination in MLLMs and categorizes it into three types: category, attribute, and relation. The paper extensively analyzes the causes of hallucinations in MLLMs, covering contributing factors from data, model, training, to the inference stage. It provides a comprehensive overview of benchmarks and metrics designed specifically for evaluating hallucinations in MLLMs, such as CHAIR, POPE, MME, CIEM, and more.
Strategies to Mitigate Hallucinations
In terms of strategies to mitigate hallucination, the paper discusses various approaches. For instance, it discusses methods related to data, including introducing negative data, introducing counterfactual data, and reducing noise and errors in the existing dataset. It also considers the scale-up resolution, versatile vision encoders, dedicated modules, auxiliary supervision, and reinforcement learning approaches to mitigate hallucinations in MLLMs. These approaches aim to enhance the performance and reduce hallucinations in MLLMs through various methods such as scaling up resolution, using versatile vision encoders, adding dedicated modules, incorporating auxiliary supervision and reinforcement learning techniques like automatic metric-based optimization, reinforcement learning from AI feedback, and more.
Survey Aim and Contribution
Overall, the survey aims to deepen the understanding of hallucinations in MLLMs and inspire further advancements in the field. It provides valuable insights and resources for researchers and practitioners alike to enhance the robustness and reliability of MLLMs. The survey presents a comprehensive analysis of causes, evaluation benchmarks, metrics, strategies, and future research pathways, contributing to the ongoing dialogue on enhancing the robustness and reliability of MLLMs.
Reference: https://arxiv.org/abs/2404.189...