Key Points
1. Current research in multimodal models (MMMs) for image and text tasks has seen a rapid expansion, leading to the creation of various benchmarks, such as MMBench, DeepSeek-VL, IconQA, and others, to evaluate their performance.
2. The generation of photorealistic images guided by text has been an area of significant interest, as evidenced by the development of models like Glow, CLIP Latents, and BEIT v2, which aim to improve image generation and editing with text-guided diffusion models.
3. The construction of large datasets, such as synthetic-dataset-1m-dalle3-high-quality-captions, YFCC15M, and others, has facilitated the training and evaluation of MMMs, particularly for tasks such as knowledge-aware visual question answering (KVQA) and dense image captioning.
4. Multimodal pretraining techniques have also been a focus of research, with the development of models like Llama, Gemini, and Emu3, which aim to improve foundation language models and chat models for various tasks.
5. Efforts to unify multimodal understanding and generation within a single transformer architecture have led to the development of models like Show-O, Minigpt-4, and Llama 2, aimed at achieving a more comprehensive approach to processing multimodal data.
6. The emergence of novel training techniques, such as Eva-CLIP and Autoregressive Model Beats Diffusion (LLAMA), have sought to improve the training and scalability of image generation models.
7. The exploration of multimodal capabilities has expanded to include tasks such as dense image captions, text-to-image generation, and vision-centric tasks, culminating in the development of models like VisionLLM and VILA-U.
8. Research has also focused on improving the efficiency of multimodal models, leading to the development of models like Chameleon, LLAVA-PHI, and MM-VET, which aim to streamline multimodal processing capabilities.
9. Finally, the development of benchmarks, datasets, and models such as MMMU, Transfusion, and Raphael have sought to provide comprehensive evaluation and solutions for multimodal understanding and reasoning tasks across various disciplines and modalities.
Summary
This paper introduces Janus, an autoregressive framework that unifies multimodal understanding and generation tasks. Previous unified models often rely on a single visual encoder for both tasks, which can lead to suboptimal performance, particularly in multimodal understanding. To address this issue, Janus decouples visual encoding into separate pathways for understanding and generation, while still utilizing a unified transformer architecture for processing.
Benefits of Decoupling
The decoupling of visual encoding offers two key benefits. First, it alleviates the conflict between the visual encoder's roles in understanding and generation, eliminating the need to make tradeoffs when selecting the visual encoder. Second, the framework becomes more flexible, as both the multimodal understanding and generation components can independently select the most suitable encoding methods.
Experimental Results
Experiments show that Janus surpasses previous unified models and matches or exceeds the performance of task-specific models on both multimodal understanding and generation benchmarks. On multimodal understanding benchmarks like MMBench, SEED-Bench, and POPE, Janus (1.3B) outperforms larger models like LLaVA-v1.5 (7B) and Qwen-VL-Chat (7B). On visual generation benchmarks MSCOCO-30K and GenEval, Janus achieves state-of-the-art results, outperforming text-to-image generative models like DALL-E 2 and SDXL.
Future Prospects
The authors highlight that the simplicity, high flexibility, and overall effectiveness of Janus make it a strong candidate for the development of next-generation unified multimodal models. The decoupled visual encoding approach allows Janus to be easily extended to incorporate additional input modalities, such as point clouds, EEG signals, or audio data, further expanding its versatility as a generalist multimodal model.
Reference: https://arxiv.org/abs/2410.13848