Key Points

1. The paper proposes Transfusion, a method for training a multi-modal model over discrete and continuous data by combining language modeling loss function with diffusion. This new approach allows for training a single transformer model to predict both discrete text tokens and diffuse continuous images.

2. Transfusion is compared to Chameleon's discretization approach, and the results show that Transfusion scales better in every combination of modalities. Specifically, in text-to-image generation, Transfusion exceeds the Chameleon approach at less than a third of the compute, achieving approximately 2× lower FID scores.

3. The study reveals that Transfusion models can integrate both modalities without any information loss, by training a single model with separate losses for different modalities over shared data and parameters.

4. By introducing modality-specific encoding and decoding layers, Transfusion models can improve performance and even compress each image to just 16 patches.

5. The paper also explores the necessity of intra-image bidirectional attention, different patch sizes for latent image representations, and the utility of U-Net down and up blocks to encode and decode images, revealing potential improvements for Transfusion.

6. Transfusion is demonstrated to generate images at similar quality to other diffusion models. It outperforms models such as DALL-E 2 and SDXL on image generation benchmarks, while also generating text on par with Llama 1 on text benchmarks.

7. The controlled experiments show that Transfusion exhibits better scaling laws than Chameleon, with significant compute efficiency, particularly in image generation where Transfusion achieves parity with Chameleon using substantially fewer FLOPs.

8. The study also investigates the model's performance in multi-modal tasks, such as fine-tuning a 7B Transfusion model on image-to-image generation, and the results suggest that Transfusion models can generalize across new modality combinations.

9. The research suggests that Transfusion is a simple, end-to-end solution for multi-modal learning that understands and generates high-quality multi-modal data, bridging the gap between discrete sequence modeling and continuous media generation.

Summary

The paper introduces "Transfusion," a method for training a single multi-modal transformer model to generate both discrete text and continuous image data. The key innovation is combining a language modeling loss function for text tokens with a diffusion objective for image patches, allowing the model to learn from a mixture of text and image data using a shared set of parameters.

The Transfusion model takes a sequence of data that can contain both text tokens and image patch vectors, separated by special beginning and end of image tokens. The text tokens are processed using standard autoregressive attention, while the image patches use bidirectional attention to allow each patch to condition on all other patches in the same image. Modality-specific encoding and decoding layers are used to convert between the text and image representations.

The paper presents controlled experiments comparing Transfusion to a baseline approach that discretizes images into tokens and trains a language model, known as Chameleon. They find that Transfusion scales significantly better than Chameleon, achieving the same performance on text-to-image and image captioning tasks using less than a third of the compute. Transfusion also performs better than Chameleon on text-only tasks, despite using the same text modeling approach.

Ablation studies reveal the importance of the intra-image bidirectional attention, and show that adding U-Net encoding and decoding layers allows Transfusion to compress images into fewer patches with little loss in performance. This enables serving the generated content more efficiently.

Finally, the paper demonstrates that scaling Transfusion to 7 billion parameters and training on 2 trillion total tokens, including 692 million image-caption pairs, produces a model that can generate high-quality images and text that is competitive with specialized image and language models of similar scale. This highlights the promise of Transfusion as a unified approach to multi-modal generation.

In summary, Transfusion is a simple and effective method for training a single transformer-based model to understand and generate both text and images, outperforming approaches that treat the modalities separately. The authors show that this unified architecture can scale to generate high-quality multi-modal content.

Reference: https://www.arxiv.org/abs/2408...