Key Points

1. The paper identifies artifacts in feature maps of both supervised and self-supervised Vision Transformer (ViT) networks, particularly in low-informative background areas of images, and proposes a solution based on adding additional tokens to the input sequence to address this issue.

2. Numerous models, including DINOv2, exhibit high-norm outlier tokens in the attention maps which can affect performance on dense prediction tasks and unsupervised object discovery methods.

3. The study reveals that outlier tokens appear in patches with redundant information and hold less local information, suggesting that the model discards local information from these patches during inference.

4. The proposed interpretation suggests that the model learns to recognize patches with little useful information and repurpose the corresponding tokens to aggregate global image information while discarding spatial information.

5. The introduction of additional register tokens to the sequence as a simple fix removes the outlier tokens entirely, leading to improved performance in dense prediction tasks and object discovery methods.

6. The paper provides a comprehensive evaluation of the proposed solution by training vision transformers with additional register tokens and includes quantitative analysis, ablation studies, and qualitative assessment to confirm the effectiveness of the approach.

7. The addition of register tokens did not result in a degradation of model performance and was shown to improve downstream performance in tasks such as image classification, segmentation, and monocular depth estimation.

8. The paper also compares the attention maps and features of models trained with and without register tokens, highlighting the impact of the proposed solution on the model's behavior and output quality.

9. The study concludes that the proposed solution addresses artifacts in the feature maps of various popular Vision Transformer models and leads to smoother feature maps, improved attention maps, and enhances the performance of downstream tasks such as unsupervised object discovery.

Summary

I. Investigating Outlier Values in Attention Maps of Vision Transformers
The research paper presents a detailed investigation of outlier values in attention maps of vision transformers, with a focus on DINO and DINOv2 models. The study identifies artifacts in the feature maps of both supervised and self-supervised ViT networks, which correspond to high-norm tokens appearing during inference primarily in low-informative background areas of images. The paper proposes a simple and effective solution based on providing additional tokens to the input sequence of the Vision Transformer to mitigate this issue.

II. Resolving Outlier Values and Setting New State of the Art
The findings indicate that the inclusion of register tokens not only resolves the problem entirely for both supervised and self-supervised models, but it also sets a new state of the art for self-supervised visual models on dense visual prediction tasks. Furthermore, the study demonstrates that the proposed solution enables object discovery methods with larger models and leads to smoother feature maps and attention maps for downstream visual processing.

III. Understanding Outlier Tokens and Their Role in the Vision Transformer
The investigation reveals that outlier tokens with significantly higher norm values are a small fraction of the total sequence (around 2%) and appear around the middle layers of the vision transformer after sufficiently long training. These outlier tokens are observed to contain less local information about their original position in the image or the original pixels in their patch, suggesting that the model discards the local information contained in these patches during inference. However, they hold global information about the image, leading to the proposed interpretation that the model learns to recognize patches containing little useful information and recycle the corresponding tokens to aggregate global image information while discarding spatial information.

IV. Practical Solutions and Positive Impacts
Overall, the study provides a deeper understanding of outlier values in attention maps of vision transformers and offers a practical solution to mitigate this phenomenon. The proposed modification to the token sequence and the inclusion of register tokens have been shown to have a significant positive impact on model performance, feature maps, attention maps, and downstream visual tasks. The paper also includes a comprehensive analysis of the implications and benefits of the proposed solution for various supervised and self-supervised vision transformer models.

Reference: https://arxiv.org/abs/2309.16588