Key Points

- The paper investigates the optimal model size and number of tokens for training a transformer language model under a given compute budget.

- It finds current large language models are significantly undertrained due to the focus on scaling models while keeping the amount of training data constant.

- The study involves training over 400 language models ranging from 70 million to over 16 billion parameters on 5 to 500 billion tokens.

- The findings suggest that for compute-optimal training, the model size and the number of training tokens should be scaled equally by doubling for every doubling of model size.

- It introduces a predicted compute-optimal model, Chinchilla, that outperforms existing large models on a wide range of downstream evaluation tasks.

- The paper presents a comparison with existing literature and methods used in training large language models, highlighting differences and proposing alternative approaches.

- It addresses the challenges faced by large language models, including computational requirements, training data quality, and the need for more high-quality training data.

- The findings predict that for the same compute budget, a smaller model trained on more data would perform better, suggesting a need for increased focus on dataset scaling and quality in the future.

- The study emphasizes the importance of considering ethical and privacy concerns related to training on larger datasets and the potential impacts of dataset quality and fairness on model performance.

Summary

The research paper explores the optimal model size and number of training tokens for training a transformer language model under a given compute budget, with a focus on the trade-off between model size and training data. The study includes training and analyzing over 400 language models with varying parameters and training tokens to predict the optimal allocation of a computational budget. The findings suggest that for compute-optimal training, the model size and the number of training tokens should be scaled equally, with a prediction that an optimal model should be four times smaller and trained on four times more tokens for a given compute budget.

The study introduces a new, smaller model called Chinchilla, demonstrating improved performance and reduced inference cost compared to larger models. Additionally, the paper emphasizes the importance of estimating hyperparameters for large models and provides experimental heuristics for determining these parameters. The authors highlight the potential risks and ethical considerations associated with training and deploying large language models, and emphasize the need for increased focus on dataset scaling and data quality. Lastly, the study proposes predictive approaches for setting model size and training duration and discusses the potential application of similar trade-offs in other modalities.

The findings are supported by detailed evaluations and comparisons of Chinchilla with existing large language models on various language tasks and benchmarks. The paper emphasizes the importance of responsibly collecting and using large datasets, and the need for continued observation and evaluation of large language models for bias and toxicity. The authors also note the potential broader applicability of the methods proposed in the study for training large models in different settings.

Reference: https://arxiv.org/abs/2203.15556