Key Points

1. Addition of 20 million Chinese multi-choice questions enhanced model performance on multiple-choice benchmarks but did not extend to generative evaluation benchmarks.

2. Excluding multi-choice data from pre-training and fine-tuning stages was preferred to prevent overfitting and contribute to achieving true intelligence in the model.

3. Incorporating 5 million instruction data during the pre-training phase had nearly identical outcomes to adding the same data during the subsequent fine-tuning stage.

4. Introduction of a system prompt led to varied performance impact, with larger models showing significantly improved results compared to smaller models.

5. DeepSeek introduced LLMs trained on a large dataset of 2 trillion tokens in English and Chinese, and detailed hyper-parameters selection, scaling laws, and fine-tuning attempts.

6. DeepSeek LLM shares acknowledged limitations such as lack of ongoing knowledge updates after pre-training, possibility of generating non-factual information, and suboptimal performance on certain Chinese-specific topics.

7. DeepSeek LLM is committed to advancing open-source language models by releasing technique reports in code intelligence and Mixture-of-Experts (MoE), constructing a larger and improved dataset, and studying reinforcement learning to boost model reasoning capability.

8. Future work includes releasing technique reports in code intelligence and MoE, constructing a larger and improved dataset, and studying reinforcement learning to boost model reasoning capability.

9. The model's proficiency in other languages remains delicate and should be approached with caution due to the primarily Chinese and English sources of the data.

These points summarize the key findings, conclusions, and future work from the scientific article.

Summary

The paper "Scaling Open-Source Language Models with Longtermism" explores the behavior of scaling laws of language models. It presents distinctive findings that facilitate the scaling of large-scale models in two prevalent open-source configurations, namely 7B and 67B, guided by the scaling laws. The study delves into the scaling laws of batch size, learning rate, data, and model scale, and identifies the impact of dataset choice on the scaling behavior. The paper also introduces DeepSeek LLM, a project dedicated to advancing open-source language models with a long-term perspective. The DeepSeek LLM outperforms existing models, particularly in the domains of code, mathematics, and reasoning, as demonstrated in open-ended evaluations.

Additionally, the safety evaluation indicates that DeepSeek 67B Chat provides harmless responses in practice. The paper details the development and evaluation of DeepSeek LLM, providing insights gained from reputable sources and addressing the scaling laws, hyperparameters, model and data scaling, data ablation techniques, and model architecture.

The research involves the development and evaluation of large language models and provides valuable insights into dataset choice, model scaling, and performance evaluation.
The paper investigates the scaling behavior of language models and its application in creating open-source large language models. The findings on scaling laws reveal the impact of batch size, learning rate, data, and model scale, as well as insights into the impact of dataset choice on the scaling behavior. The DeepSeek LLM is developed and evaluated, demonstrating superior performance compared to existing models. The paper also discusses the impact of adding multi-choice question data and the inclusion of instruction data in pre-training.

Furthermore, the system prompt is evaluated, showing improved results with larger models and a slight degradation with smaller models. The paper concludes with limitations and future work, including plans to improve Chinese knowledge and code capabilities in the upcoming version of DeepSeek LLM.

Reference: https://arxiv.org/abs/2401.02954