Introduction and Objectives
In this scientific article, the authors investigate the training of small transformers to learn arithmetic operations and elementary mathematical functions. They explore different aspects of the training data and evaluate the effects on accuracy and sample complexity.


Optimizing Training Data for Learning Arithmetic
The authors find that traditional training data is not optimal for learning arithmetic. They propose using detailed, instructive data with intermediate steps or reversing the output to improve performance. These modifications improve the sample complexity for addition, subtraction, multiplication, sine, and square root tasks.
The authors also explore the impact of few-shot prompting, pretraining, and model scale. Few-shot prompting enhances performance, especially for multiplication and square root tasks. Pretraining a model improves accuracy compared to training from scratch. Larger models exhibit better performance than smaller models.


Generalization Challenges and Future Approaches

The study investigates the generalization capabilities of the trained models. The models struggle to generalize to unseen digit lengths and struggle to extrapolate to larger digit lengths beyond what they have been trained on. The authors highlight the challenges of length generalization and suggest that it may require more complex approaches.


Conclusion and Contributions

Overall, the study emphasizes the importance of high-quality, instructive data for teaching transformers arithmetic operations. The findings contribute to a better understanding of the mechanisms by which transformers acquire arithmetic capabilities and highlight areas for future research.

Reference: https://arxiv.org/abs/2307.03381