Key Points

1. MobileVLM V2 models, built upon MobileVLM, significantly improve vision language model performance through novel architectural design, improved training schemes, and high-quality dataset curation.

2. MobileVLM V2 1.7B achieves better or on-par performance compared to larger VLMs at the 3B scale and outperforms a variety of VLMs at the 7B+ scale.

3. The study addresses the challenges of deploying vision language models on resource-constrained scenarios such as mobile devices, self-driving cars, and embodied AI systems.

4. The paper proposes three main improvements: exploiting contributive training data for small vision language models, exploring effective training strategies, and renovating a high-performance lightweight projector.

5. MobileVLM V2 uses 1.2 million high-quality image-text pairs to align vision-language features and incorporates more academic tasks to increase data diversity and instruction-following capacity.

6. The paper introduces a new projector, LDPv2, to align vision-language features with fewer parameters and enhance positional information with minimal performance degradation.

7. The study shows that MobileVLM V2 achieves a new state-of-the-art tradeoff between performance and inference speed across several vision language benchmarks, outperforming previous SOTA models with clear margins.

8. Latency measurements on mobile devices demonstrate that MobileVLM V2 exhibits lower inference latency than counterparts at the same parameter scale.

9. Experimental results show that MobileVLM V2 outperforms many large models with substantial inference advantages, paving the way for AI deployment in resource-limited scenarios.

Summary

The paper introduces MobileVLM V2 as an improved vision language model and demonstrates its enhanced performance compared to previous versions. The authors attribute the significant performance improvements to novel architectural designs, an improved training scheme tailored for mobile VLMs, and high-quality dataset curation. They specifically highlight that the MobileVLM V2 1.7B model achieves better or on-par performance on standard VLM benchmarks compared with much larger VLMs at the 3B scale and outperforms a large variety of VLMs at the 7B+ scale. The paper also emphasizes the faster inference speed of MobileVLM V2 compared to state-of-the-art VLMs.

Architectural Improvements
The architectural improvements primarily focus on the projector design, which is a critical component for aligning visual and language features. The authors introduce a lightweight downsample projector (LDPv2) that significantly reduces the number of visual tokens while enhancing positional information, leading to improved performance. Furthermore, the paper highlights the utilization of high-quality image-text pairs for effective alignment of vision-language features and the incorporation of more academic tasks to increase data diversity and instruction-following capacity.

Novel Training Scheme
In terms of training, the authors discuss the utilization of a novel training scheme that fully exploits the potential of high-quality multimodal data and contributes to a new state-of-the-art tradeoff between performance and inference speed across several vision language benchmarks.

The paper concludes by presenting comparisons with state-of-the-art methods and emphasizes the inference speed advantage and superior performance of MobileVLM V2. Additionally, the authors explore different approaches to improve the performance of MobileVLM V2, including data scaling schemes, improved training strategies, and efficient modality alignment design, resulting in new state-of-the-art results with significant inference advantages compared to larger models.

Reference: https://arxiv.org/abs/2402.03766