MINT-1T: Scaling Open-Source Multimodal Data by 10x: A Multimodal Dataset with One Trillion Tokens (AI summary)

Key Points

1. MINT-1T is the largest and most diverse open-source multimodal interleaved dataset to date, containing 1 trillion text tokens and 3.4 billion images.

2. MINT-1T includes data from previously untapped sources such as PDFs and ArXiv papers, in addition to HTML documents.

3. Scaling multimodal interleaved datasets is an engineering challenge that requires substantial effort, so sharing the data curation process and releasing the dataset greatly benefits the research community.

4. Experiments show that large multimodal models trained on MINT-1T rival the performance of models trained on the previous leading dataset, OBELICS, while offering a 10x increase in scale.

5. MINT-1T's HTML subset outperforms OBELICS on visual question answering tasks but performs worse on captioning benchmarks.

6. Adding PDF and ArXiv documents to the dataset mixture further improves the model's in-context learning performance compared to OBELICS.

7. The model trained on the full MINT-1T data mixture outperforms OBELICS and the HTML-only MINT-1T model on most in-context learning benchmarks.

8. The full MINT-1T model also outperforms OBELICS on the MMMU multimodal reasoning benchmark, but performs worse on the more complex Mantis-Eval.

9. The diversity of data sources in MINT-1T, including PDFs and ArXiv papers, leads to improved performance on science and technology domains compared to the HTML-focused OBELICS dataset.

Summary

This paper introduces MINT-1T, the largest and most diverse open-source multimodal interleaved dataset to date. Multimodal interleaved datasets featuring free-form interleaved sequences of images and text are crucial for training frontier large multimodal models (LMMs), but there has been a pronounced scarcity of such datasets available to the research community. MINT-1T addresses this gap by providing a dataset containing one trillion text tokens and 3.4 billion images, a 10x scale-up from the previous largest open-source dataset, OBELICS. In addition to expanding the scale, MINT-1T also diversifies the sources of data, going beyond just HTML documents to include previously untapped sources such as PDFs and ArXiv papers.

Benefits and Performance of MINT-1T Dataset

The authors emphasize that scaling multimodal interleaved datasets requires substantial engineering effort, and by sharing the data curation process and releasing the dataset, they aim to greatly benefit the research community. Experiments show that LMMs trained on MINT-1T rival the performance of models trained on OBELICS, while offering a tenfold increase in scale.

Data Engineering Challenges and Diversity of MINT-1T Dataset

The paper outlines the data engineering challenges in scaling multimodal interleaved datasets, including handling large document sizes and preserving the original ordering of images and text. It also discusses the diversity of the dataset, noting that the inclusion of PDFs and ArXiv papers helps improve domain coverage compared to datasets like OBELICS, which are more concentrated in the Humanities and Social Sciences domains.

The Significance of MINT-1T Dataset

Overall, MINT-1T represents a significant advancement in open-source multimodal interleaved datasets, providing a valuable resource for training and evaluating large multimodal models that can process interleaved sequences of images and text.

Reference: https://arxiv.org/abs/2406.11271

ML and AI papers

MINT-1T: Scaling Open-Source Multimodal Data by 10x: A Multimodal Dataset with One Trillion Tokens (AI summary)

Recent posts

Foundational Models Defining a New Era in Vision: A Survey and Outlook (AI summary)

MedAgents: Large Language Models as Collaborators for Zero-shot Medical Reasoning (AI summary)

If LLM Is the Wizard, Then Code Is the Wand: A Survey on How Code Empowers Large Language Models to Serve as Intelligent Agents (AI summary)