Key Points
1. Large language models (LLMs) undergo transfer learning from unsupervised data to specific downstream tasks like translation. The study focuses on the scaling behavior of LLMs in a transfer learning setting, investigating the impact of pretraining data size and alignment on downstream performance metrics, such as BLEU score and downstream cross-entropy.

2. The alignment between the pretraining and downstream data significantly influences the scaling behavior of LLMs, with well-aligned distributions leading to improved downstream performance metrics. The study shows that with sufficient alignment, BLEU score and downstream cross-entropy improve monotonically with more pretraining, and the BLEU score can be predicted accurately using a log-law.

3. However, misalignment between the pretraining and downstream data can lead to non-monotonic behavior in the BLEU score, even as downstream cross-entropy continues to improve. This suggests that using cross-entropy as a sole indicator for downstream performance may lead to misleading conclusions about the relevance of the pretraining data.

4. Pretraining may bring little to no improvement to the BLEU score when the finetuning dataset is already large enough, indicating that the value of pretraining data should be carefully evaluated based on downstream task-related metrics.

5. The study proposes scaling laws for the downstream cross-entropy and the BLEU score, observing that cross-entropy always follows a monotonically decreasing trend, while the BLEU score may have a non-monotonic trend when pretraining data is not sufficiently aligned with the task.

6. The research findings support the importance of studying downstream performance metrics and not making decisions solely based on cross-entropy. The study highlights the discrepancy in behavior between smooth and non-smooth metrics when LLMs are scaled, emphasizing the need to consider downstream task-related metrics when evaluating the value of pretraining data.

7. The study emphasizes the significance of alignment between pretraining and downstream data, providing valuable insights for guiding the design of large language models, resource allocation, and selection of appropriate training data for transfer learning settings.

8. Experimentally, the study delves into the scaling behavior of LLMs through systematic experiments on different pretraining and finetuning datasets, offering concrete scaling laws for downstream LLM performance through extensive assessment of calibration error and model alignment.

9. The findings guide the assessment of the value of pretraining data for a given target downstream task, providing a practical guideline for determining the relevance of pretraining data based on the proposed scaling laws and observed empirical scaling behavior.


Summary

Research Objectives and Findings
The research paper focuses on studying the scaling behavior in a transfer learning setting for large language models (LLMs) finetuned for machine translation tasks. Specifically, the study examines the effects of pretraining data size, distribution alignment, and their impact on downstream performance measured by downstream cross-entropy and BLEU score. The experiments indicate that the size of the finetuning dataset and the alignment between the pretraining and downstream data significantly influence the scaling behavior. In cases where there is sufficient alignment, both downstream cross-entropy and BLEU score improve with more pretraining data, with the possibility of predicting the downstream BLEU score accurately using a log-law. However, moderate misalignment can lead to fluctuating or worsening BLEU scores with more pretraining, even though downstream cross-entropy consistently improves. This suggests that using cross-entropy as a proxy for task-related metrics like BLEU score may lead to critical misjudgments.

Practical Insights and Implications
The study also provides practical insights for choosing appropriate pretraining data based on the observed scaling behavior. It demonstrates that pretraining may bring little to no improvement on the BLEU score when the finetuning dataset is already large enough. The research emphasizes that assessing the value of pretraining data should be based on downstream task-related metrics, rather than solely on cross-entropy. The findings suggest that using scaling laws for downstream performance metrics can be valuable for making decisions about model development and resource allocation.

Experimentation and Results
The paper presents systematic experiments on 770-million and 3-billion encoder-decoder T5 models, studying how the downstream performance scales with pretraining dataset size. It also proposes log and power-law scaling laws for the BLEU score and downstream cross-entropy. The study provides practical guidance for assessing the value of pretraining datasets for specific downstream tasks, emphasizing the importance of considering downstream metrics in decision-making. Additionally, it highlights the importance of studying downstream performance metrics and not relying solely on cross-entropy for making decisions about model training and resource allocation. Overall, the research offers important insights into the scaling behavior of LLMs in a transfer learning setting and provides guidance for choosing appropriate pretraining data based on downstream task-related metrics.

Reference: https://arxiv.org/abs/2402.041...