Overview

Emu is a model that can generate images and texts in a multimodal context. Emu can take in different types of data, such as images, text, and videos, and learn to generate appropriate responses based on the input. It is trained using a unified objective of predicting the next token in the sequence, whether it's a text or a visual token. Emu can be used for tasks like image captioning, visual question answering, and text-to-image generation, and it outperforms other similar models in these tasks. Emu is trained on diverse data sources, including video, and can serve as a generalist multimodal interface.

Key Points

1. Emu is a multimodal foundation model that can generate both images and texts in a multimodal context.


2. It can take in any single-modality or multimodal data input indiscriminately, such as interleaved image, text, and video.


3. Emu is trained with a unified objective of predicting the next text token or regressing the next visual embedding in the multimodal sequence.


4. It supports diverse pretraining data sources at scale, including videos with interleaved frames and text, webpages with interleaved images and text, and web-scale image-text and video-text pairs.


5. Emu can be used as a generalist multimodal interface for image-to-text and text-to-image tasks, as well as in-context image and text generation.


6. It demonstrates superb performance in various zero-shot/few-shot tasks, including image captioning, visual question answering, video question answering, and text-to-image generation.


7. Emu can be instruction-tuned to serve as a multimodal assistant, aligning well with human instructions and performing tasks accordingly.


8. It achieves impressive in-context learning ability and can generate context-related images and follow context-related instructions.


9. Emu outperforms other state-of-the-art large multimodal models in terms of performance and scalability.

Reference: https://arxiv.org/abs/2307.052...