Key Points

- The study investigates the spatial reasoning abilities of large language models (LLMs), focusing on their mental image capability for spatial reasoning which has received less attention in the past.

- Visualization-of-Thought (VoT) prompting is proposed to elicit spatial reasoning of LLMs by visualizing their reasoning traces, guiding subsequent reasoning steps. VoT significantly enhances the spatial reasoning abilities of LLMs and outperformed existing multimodal large language models in tasks like natural language navigation, visual navigation, and visual tiling in 2D grid worlds.

- The paper also sheds light on LLMs’ mental image for spatial reasoning from a cognitive perspective, conducts quantitative and qualitative analyses on the mind’s eye of LLMs and its limitations, and explores cues about the origin of this capability from code pre-training.

- The effectiveness of VoT prompting is evaluated on three tasks including natural-language navigation, visual navigation, and visual tiling. Experimental results demonstrate significant performance improvements with VoT prompting compared to other prompting methods and existing multimodal large language models.

- The study also touches upon the utilization of GPT-4 and GPT-4 Vision models, presenting their respective settings and the impact of visualization on final answers, and discusses possible sources from which LLMs' mental image capability might derive.

- The research also delves into the evaluation of spatial visualization and spatial understanding in LLMs through visual navigation and visual tiling tasks and presents future directions and potential areas of exploration to strengthen the mind’s eye of LLMs.

Summary

The paper explores the application of Visualization-of-Thought (VoT) prompting to enhance spatial reasoning in large language models (LLMs). The study addresses the lack of exploration of spatial reasoning in LLMs and proposes VoT prompting as a method to elicit spatial reasoning in LLMs by visualizing their reasoning traces and guiding subsequent reasoning steps. The paper evaluates the effectiveness of VoT in multi-hop spatial reasoning tasks, including natural language navigation, visual navigation, and visual tiling in 2D grid worlds.

The experimental results demonstrate that VoT significantly enhances the spatial reasoning abilities of LLMs and outperforms existing multimodal large language models (MLLMs) in these tasks. The study sheds light on LLMs' mental image for spatial reasoning from a cognitive perspective and develops tasks of "visual navigation" and "visual tiling" along with corresponding synthetic datasets to support varying levels of complexity, offering a well-designed testbed for research on spatial reasoning. Additionally, the paper discusses the potential viability of VoT in MLLMs and presents experiments on the effectiveness of VoT in spatial reasoning tasks for multimodal models.

Further, it introduces the concept of Visualization-of-Thought Prompting (VoT), which aims to generate reasoning traces and visualizations in an interleaved manner in order to elicit the spatial awareness of LLMs. The study discusses the implementation of VoT in different tasks, the experimental findings comparing different settings, and an analysis of the visual state tracking behaviors among prompting methods. The findings suggest that VoT significantly enhances LLMs' spatial reasoning capabilities and can potentially contribute to the advancement of their cognitive and reasoning abilities. However, the paper also notes limitations and challenges, indicating the need for further research and exploration in this area.

Overall, the study concludes by highlighting the potential of VoT to enhance the "mind’s eye" in MLLMs and outlines future research directions, including investigating the application of VoT in MLLMs and exploring effective methods for learning generalized internal representations of mental images to further improve the spatial reasoning abilities of LLMs.

Reference: https://arxiv.org/abs/2404.036...