Key Points

1. Humans use drawing as an aid during thinking and reasoning, but current multimodal language models lack this drawing capability.

2. The Visual Sketchpad framework is proposed to enable the multi-modal language model to generate various sketches such as lines, boxes, and markers during the reasoning process.

3. The Visual Sketchpad framework enables the language model to perform subsequent planning and reasoning based on the drawn sketches.

4. On mathematical tasks (geometry, functions, graph algorithms, chess games) and visual reasoning tasks, Visual Sketchpad significantly improves the performance of language models, surpassing the current best level.

5. The Visual Sketchpad framework enables language models to flexibly call different professional visual models during the reasoning process, such as target detection, semantic segmentation, etc.

6. Compared with chain reasoning using only text, Visual Sketchpad can better simulate the human thinking process.

7. The Visual Sketchpad framework summarizes the previous methods of using visual cues and tools, and proposes a more general framework.

8. Visual Sketchpad treats the language model as an intelligent agent that can plan and execute actions, not just a model that outputs text.

9. Visual Sketchpad opens new research opportunities to improve multimodal intelligence by combining linguistic and visual reasoning.

Summary

Section Title 2: Experimental Results of S KETCHPAD

KETCHPAD generates image sketches by letting the language model call specialized visual models (such as detection, segmentation, etc.) and utilizes these sketches for further reasoning. The research has conducted extensive experiments on mathematical tasks (including geometry, functions, graph theory, chess strategies) and complex visual reasoning tasks (such as depth estimation, spatial reasoning, puzzles, etc.).

The results show that S KETCHPAD can significantly improve language models. performance, achieving new optimal results on all tasks. Specifically, S KETCHPAD improved by an average of 12.7% on math tasks and 8.6% on visual tasks. GPT-4o with S KETCHPAD set new high records in V*Bench (80.3%), BLINK spatial reasoning (83.9%) and visual correspondence (80.8%).

Section Title 3: Applications and Prospects of S KETCHPAD

This paper also analyzes the agreement between sketch plans generated by S KETCHPAD and human-generated plans, and finds that they are well aligned and exhibit similar reasoning patterns. Overall, S KETCHPAD opens up a new research direction, enabling language models to enhance their reasoning capabilities by generating visual sketches, moving towards more humane multi-modal intelligence. This framework can be widely used in robotics, computer vision and other fields. We have made the relevant code and data public for further research.

Reference: https://arxiv.org/abs/2406.09403