Key Points

1. The paper introduces FoMo-in-Flux, a benchmark for controlled continual multimodal pretraining that extends beyond monolithic pretraining datasets to specialized subdomains with fine-grained control over data streams and adaptation over long task horizons.

2. FoMo-in-Flux consists of 63 classification and retrieval datasets, comprising over 2.53M samples grouped into 23,045 concepts spanning diverse visual domains. It provides high-quality class-specific captions through various methods.

3. The paper outlines the FoMo-in-Flux training and evaluation pipeline, including compute budgeting using Memory-Adjusted FLOPs (MAFs) to account for both FLOP counts and peak memory usage.

4. The paper categorizes continual pretraining into major, patch, and minor updates, drawing parallels to semantic versioning in software development. It focuses on studying minor updates, which are adaptations to whole subdomains and general concepts.

5. Using FoMo-in-Flux, the paper explores the complex landscape of practical continual pretraining through method-centric, meta-learning rate schedule, and model/compute scaling investigations.

6. The paper provides a data-centric perspective on continual pretraining, studying the impact of different data stream orderings, data mixture ratios, and pretraining data pool choices.

7. The paper's key findings suggest that model merging techniques exhibit promising continual pretraining dynamics, retaining strong zero-shot performance while achieving substantial gains in knowledge accumulation.

8. The paper highlights the importance of learning rate schedules, model scaling, and compute scaling (when combined with model merging) for effective continual pretraining under practical constraints.

9. The paper provides a practitioner's guide to continual multimodal pretraining, covering insights on method choices, learning rate scheduling, and data-centric considerations for real-world deployment.

Summary
The paper introduces FoMo-in-Flux, a benchmark for controlled continual multimodal pretraining. The goal is to go beyond studying continual pretraining on monolithic datasets like TiC-RedCaps/TiC-DataComp, and instead focus on specialized subdomains with fine-grained control over data streams and adaptation over long task horizons.

FoMo-in-Flux Datasets and Pretraining Process
FoMo-in-Flux consists of 63 classification and retrieval datasets - either publicly available or introduced as part of this work - totaling over 2.53M samples grouped into 23,045 concepts spanning diverse visual domains. This allows the authors to experiment with precise and controlled ordering of the data encountered at each continual pretraining stage. The paper also describes the process of generating high-quality captions for each image using a scalable two-stage captioning mechanism.

Investigation of Continual Pretraining
Using FoMo-in-Flux, the authors investigate continual pretraining from multiple perspectives. They study the impact of different data mixtures and stream orderings that emulate real-world deployment scenarios. They also evaluate a range of continual learning methods, from simple finetuning to parameter-efficient updates and model merging. Additionally, they analyze the influence of learning rate schedules, model/compute scaling, and other experimental design choices.

1) Model merging techniques exhibit promising continual pretraining dynamics, showing improved base generalization performance and better retention across the full sequence compared to other methods.

2) Parameter-efficient tuning techniques face significant plasticity issues, sacrificing adaptation capacity to improve knowledge retention.

3) Continual learning regularization strategies under compute constraints show strong plasticity issues with high regularization, but minimal effect with low regularization.

4) Larger models and increased compute budgets (especially with model merging) help mitigate the trade-off between knowledge accumulation and retention.

5) Carefully designing the sequence of task updates, as well as the ratio of streaming, buffer, and pretraining data, is crucial for effective continual pretraining.

Overall, the paper provides comprehensive guidance for continual multimodal pretraining in real-world deployment scenarios, going beyond previous work focused on large-scale, infrequent updates or frequent, sample-level updates.

Reference: https://arxiv.org/abs/2408.144...