Key Points
1. Validation loss during continual pre-training for models of different scales (10B and 405M) was observed, with a combination of LR re-warming, LR re-decaying, and 5% replay helping to strike a balance between forgetting and adaptation.
2. Models continually pre-trained with LR re-warming, LR re-decaying, and replay exceeded the average performance of baselines trained from random initialization on individual datasets.
3. The use of 5% replay minimally affected downstream performance compared to models using 0% replay, suggesting limited negative influence of model scale on forgetting-reduction from replay.
4. English-only models pre-trained continually using a combination of learning rate re-warming and 5% replay approached or surpassed the performance of models trained on the union of both datasets.
5. Learning rate re-warming was identified as causing unwanted forgetting, while infinite learning rate schedules were introduced as a promising way to circumvent this issue.
6. The performance of infinite learning rate schedules was effectively compared to a cosine decay schedule in single-dataset pre-training settings, showing similar final validation loss and the ability to smoothly transition between pre-training phases without re-warming.
7. In a continual learning setup, the infinite learning rate schedules outperformed repeated cosine decays, with the advantage of being able to start annealing at any time during the constant learning rate phase and negligible forgetting across dataset boundaries.
8. The study addressed limitations, including the number of model scales studied, lack of deduplication between training and validation datasets, and the need for experiments involving distribution shifts and larger scales of models and datasets.
9. Continued pre-training was identified as an efficient and promising alternative to re-training when updating large language models on new data, with the potential to significantly reduce the compute and energy required to maintain foundation models.
Summary
The research paper explores the efficiency of continually pre-training large language models (LLMs) as opposed to re-training from scratch when new data becomes available. The paper demonstrates that a combination of learning rate (LR) re-warming, LR re-decaying, and replay of previous data can match the performance of re-training from scratch on all available data. The findings are demonstrated for a weak distribution shift between two commonly used LLM pre-training datasets (English→English) and a stronger distribution shift (English→German) at the 405M parameter model scale with large dataset sizes, as well as for a 10B parameter LLM.
Importance of Learning Rate Re-warming, Learning Rate Re-decaying, and Replay
The paper highlights the importance of LR re-warming, LR re-decaying, and replay of previous data to maximize adaptation to the new dataset. Additionally, the study shows that adding a small percentage of replay can significantly reduce forgetting without having a substantial impact on downstream performance. The authors also propose infinite learning rate schedules as promising alternatives to the cosine learning rate schedule to circumvent optimization difficulties associated with learning rate re-warming.
Empirical Evaluations of Continual Learning Techniques
The research provides detailed empirical evaluations of continual learning techniques for LLM pre-training, highlighting the effectiveness of these techniques in reducing training costs and improving efficiency. The study compares the performance of continually pre-trained LLMs with models trained from random initialization on the union of all available data and demonstrates that the continual learning strategies can attain similar performance with significantly less compute. The findings suggest that a simple and scalable combination of LR re-warming, LR re-decaying, and compute-equivalent replay allows continually pre-trained models to attain similar performance on average to models re-trained on the union of all data while using significantly less compute.
Reference: https://arxiv.org/abs/2403.08763