Summary

The article discusses a new model called CM3Leon that can generate both text and images. The model is trained using a large dataset of licensed images and text. It goes through a pretraining stage and a fine-tuning stage to improve its performance. The model shows impressive results in generating images and can even follow specific instructions for tasks like image editing. It outperforms other models in generating high-quality images with less computing power. The article also mentions other approaches like diffusion models and autoregressive token models that have been used in text-to-image generation. The results suggest that autoregressive models like CM3Leon are worth exploring further for text and image tasks.

Key points


- The article introduces CM3Leon, a retrieval-augmented, token-based, decoder-only multi-modal language model for generating and infilling text and images.
- CM3Leon is trained with a large-scale retrieval-augmented pretraining stage and a second multi-task supervised fine-tuning stage.
- The model achieves state-of-the-art performance in text-to-image generation with 5x less training compute compared to other methods.
- CM3Leon demonstrates unprecedented levels of controllability in tasks ranging from language-guided image editing to image-controlled generation and segmentation.
- The model is trained on a diverse dataset from Shutterstock that includes licensed image and text data.
- Retrieval augmentation is essential for efficient training, and a new contrastive decoding method improves generation quality.
- CM3Leon outperforms other text-to-image models, diffusion models, and non-autoregressive token models.
- The model is also capable of non-trivial image-to-text generation, despite being trained on a smaller text dataset.
- Supervised fine-tuning enhances performance in various vision-language tasks, including image captioning and visual question answering.

Reference: https://ai.meta.com/research/p...