Key Points
- Recent advancements in MultiModal Large Language Models (MM-LLMs) have augmented off-the-shelf Large Language Models to support MultiModal (MM) inputs or outputs.
- MM-LLMs utilize Large Language Models (LLMs) to empower various MM tasks, leveraging robust language generation and zero-shot transfer capabilities.
- MM-LLMs are focused on refining alignment between modalities and aligning with human intent through a MM Pre-Training (PT) + MM Instruction-Tuning (IT) pipeline.
- Key developments in MM-LLMs primarily focus on MM content comprehension and text generation, as well as specific modalities such as image and speech generation.
- The general model architecture of MM-LLMs comprises Modality Encoder, Input Projector, LLM Backbone, Output Projector, and Modality Generator.
- Promising directions for MM-LLMs include enhancing models' strength, constructing more challenging benchmarks, deploying MM-LLMs on resource-constrained platforms, and integrating embodied intelligence and continual IT in practical applications.
- Training recipes for MM-LLMs include incorporating higher image resolution, utilizing high-quality Supervised Fine-Tuning (SFT) data, and exploring retrieval-based approaches for MM generation capabilities.
- Continual IT presents a challenge for MM-LLMs to adapt to new tasks while maintaining superior performance, addressing catastrophic forgetting and negative forward transfer.
- The survey provides insights for researchers in the ongoing advancements in the MM-LLMs domain and aims to support the continuous development of MM-LLMs through real-time tracking on a dedicated website.
Summary
The paper provides a comprehensive survey of MultiModal Large Language Models (MM-LLMs), focusing on recent advancements in the field. The survey outlines the general design formulations for model architecture and training pipeline, introduces 26 existing MM-LLMs, reviews their performance on mainstream benchmarks, and discusses key training recipes. It identifies promising directions for MM-LLMs research and establishes a website for tracking the latest progress in the field. The survey explores the strategy of utilizing Large Language Models (LLMs) to mitigate computational expenses and enhance the efficacy of MM pre-training, leading to the emergence of MM-LLMs. It addresses the challenges of effectively connecting LLMs with models in other modalities and discusses the refinement of alignment between modalities and aligning with human intent through a MM Pre-Training (PT) + MM Instruction-Tuning (IT) pipeline. The paper emphasizes the need for more powerful models, more challenging benchmarks, mobile/lightweight deployment, embodied intelligence, and continual IT in the future development of MM-LLMs. The survey aims to provide insights for researchers and contribute to the ongoing advancements in the MM-LLMs domain. It also establishes a dedicated website for real-time tracking, to capture the latest advancements and promote continuous development in the field.
Reference: https://arxiv.org/abs/2401.13601