Key Points

1. The paper explores the landscape of pre-trained neural network backbones for computer vision tasks, aiming to provide practitioners with insights about which backbone to choose for specific tasks or for general use.

2. The study compares various pre-trained models, including vision-language models and those trained through self-supervised learning, neural architecture search, and other techniques, across a diverse set of computer vision tasks such as classification, object detection, segmentation, OOD generalization, and image retrieval.

3. The findings suggest that convolutional neural networks (CNNs) pretrained in a supervised fashion on large datasets still outperform other models on most tasks, with supervises SwinV2-Base, Supervised ConvNeXt-Base trained using ImageNet-21k, and CLIP ViT-Base ranking high in multiple tasks.

4. Vision transformers (ViTs) and self-supervised learning (SSL) backbones are competitive, particularly when SSL is performed with advanced architectures and larger pre-training datasets. However, supervised learning backbones demonstrate a decisive edge for detection and segmentation tasks.

5. The study identifies that performance across tasks is strongly correlated, indicating the trend of universal backbones that work well across a range of tasks.

6. The researchers observed that throughput and performance are inversely related, suggesting that larger models tend to exhibit superior performance, but at the expense of speed.

7. The study highlights the potential use of monocular depth-estimation as a general-purpose pretraining strategy and also examines the adversarial robustness, calibration, and test likelihood of different backbones.

8. The paper acknowledges limitations, including the evolving landscape of tasks, backbones, and settings, and focuses primarily on aspects related to performance. Additionally, the paper emphasizes the importance of continuously evolving insights as more backbones are introduced and more tasks and settings are considered.

9. In conclusion, the researchers released all their experimental results along with the code to put new backbones to the test, hoping to serve as a useful guide to both practitioners and researchers in the field of computer vision.

Summary

The research paper titled "Battle of the Backbones" investigates the dominant paradigm for building machine vision systems, focusing on the use of pretrained backbones in transfer learning. The paper compares various pretrained models, including those trained via self-supervised learning, vision-language models, and the Stable Diffusion backbone, across a diverse set of computer vision tasks. The study sheds light on promising research directions for advancing computer vision by providing a comprehensive analysis of more than 1500 training runs.

Advantages of Transfer Learning Using Pretrained Backbones
The paper emphasizes the advantages of transfer learning using pretrained backbones, which has led to improved performance on a wide range of applications and has reduced data requirements and training time. The study acknowledges the proliferation of choices for pretrained models and the difficulty in making informed decisions about choosing the appropriate backbone for a given task. To address this issue, the paper benchmarks a variety of pretrained models on tasks such as classification, object detection, out-of-distribution generalization, and image retrieval.

Key Findings: Performance of Pretrained Models
Key findings include that supervised learning backbones, such as ConvNeXt-Base and SwinV2-Base trained on large datasets, perform best across tasks, while self-supervised learning (SSL) backbones are highly competitive in apples-to-apples comparisons on the same architectures and similarly sized pretraining datasets. The study emphasizes that the performance of models is strongly correlated across tasks and settings. Furthermore, it highlights the potential of supervised learning, SSL, and vision-language pretraining methods and provides insights into the performance of various pretrained models, including both ViTs and CNNs.

Reference: https://arxiv.org/abs/2310.19909