Key Points

1. The paper presents Autoregressive Image Models (A IM), a set of vision models pre-trained with an autoregressive objective, inspired by Large Language Models (LLMs).

2. The Autoregressive Image Models (A IM) exhibit strong scaling behavior with both model capacity and the quantity of data. They highlight a correlation between the value of the objective function and the performance of the model on downstream tasks.

3. The pre-training of A IM is similar to the pre-training of LLMs and does not require any stability-inducing techniques.

4. A IM models demonstrate consistent improvement in downstream performance as more images are used for training, with no sign of saturation.

5. The Autoregressive Image Models (A IM) achieve competitive performance with state-of-the-art joint embedding and generative methods across a set of image recognition benchmarks.

6. The paper investigates various design choices of A IM, including the impact of an MLP head, the autoregressive objective, and different pre-training strategies.

7. The study also explores the impact of scaling the approach in terms of parameters and training data, highlighting the potential for further improvement with larger models trained for longer schedules.

8. The study compares A IM to other state-of-the-art methods and demonstrates its potential as a scalable vision model that effectively leverages uncurated datasets without bias towards object-centric images or strong dependence on captions.

9. Lastly, the paper acknowledges the limitations of A IM and provides insights into the challenges and trade-offs associated with alternative methods, such as generative and joint embedding approaches.

Summary

The paper explores the application of autoregressive models in pre-training large-scale visual features. It introduces Autoregressive Image Models (AIM) and investigates their scalability and performance in comparison to language models. The study presents two architectural modifications to adapt autoregressive pre-training to visual features and provides a study of models ranging from 600M to 7B parameters pre-trained using 2B uncurated images. The findings demonstrate strong scaling behavior in the AIM models and consistent improvement in downstream performance as they are trained on more images.

The paper concludes by aligning these observations with previous studies on scaling large language models. The study shows that AIM represents a new frontier for training large-scale vision models and requires no image-specific strategies to stabilize training at scale. The paper reveals that the autoregressive pre-training objective is suitable for training visual features, and the downstream performance improves with larger model capacity and more training data. The results indicate that AIM outperforms other generative and joint embedding methods and is compatible with low-rank adaptation for further performance improvements through fine-tuning.

Additionally, ablation studies on the model design choices and training objectives provide valuable insights into the effectiveness of the proposed approach, demonstrating the potential for further performance improvements with longer pre-training schedules and larger models.

Reference: https://arxiv.org/abs/2401.08541