Key Points

1. Pre-training large language models is resource-intensive and often inefficient, necessitating more efficient self-supervision strategies.

2. The S PAC T OR pre-training procedure combines span corruption (SC) and replaced token detection (RTD) to improve efficiency and generalization of T5 models.

3. An augmentation of the span corruption pre-training task with the RTD objective is proposed, along with a two-stage pre-training schedule that transitions to standard SC loss after an initial τ iterations.

4. Performance of the hybrid objective is found to be tied to the two-stage pre-training schedule, and significant improvements in downstream benchmark performance are observed.

5. The hybrid objective is evaluated on individual input text, with RTD learning a text representation while SC learns token generation.

6. S PAC T OR achieves a 50% reduction in training iterations and a 40% reduction in FLOPs while maintaining task performance, outperforming baseline models given the same compute budget.

7. The S PAC T OR procedure is shown to scale well as model size increases, offering around 40% savings in total pre-training compute.

8. The qualitative benefits of the two-staged pre-training approach are demonstrated through empirical evaluation on a variety of NLP tasks.

9. The S PAC T OR method is shown to result in substantial gains across a range of downstream tasks, achieving comparable performance to baseline models with significantly less compute, and outperforming baseline models given the same compute budget.

Summary

The research paper discusses a new pre-training procedure called S PAC T OR, which proposes a hybrid objective combining span corruption (SC) and token replacement detection (RTD) for pre-training large language models (LLMs). The study empirically shows that the hybrid objective, when optimized over a two-stage pre-training schedule, achieves the same downstream performance as standard SC pre-training, while enabling a reduction in pre-training iterations and total floating point operations (FLOPs).

The paper addresses the challenges associated with pre-training LLMs, such as the massive computational cost incurred and the need for large datasets. The effectiveness of the hybrid objective is supported by extensive analysis, and the study highlights the benefits of the two-stage pre-training approach. Additionally, the paper provides insights into the interaction between the two pre-training objectives and their impact on downstream tasks. It also includes details about the experimental setup, including the hyperparameter choices for pre-training and fine-tuning.

The results are presented with detailed breakdowns of performance for various tasks, including GLUE, SuperGLUE, SQuAD, Rainbow, BBH, and MMLU, using both T5-Base and T5-Large models. Overall, the paper presents a novel and effective approach to pre-training large language models, addressing the need for efficient self-supervision strategies.

Reference: https://arxiv.org/abs/2401.13160