Key Points

1. The research introduces DeepSeekMath 7B, a domain-specific language model trained with 120B math-related tokens from Common Crawl, achieving impressive scores on mathematical benchmarks without relying on external toolkits and voting techniques.

2. The DeepSeekMath Corpus is constructed using a meticulously designed data selection pipeline, resulting in a high-quality dataset that outperforms existing mathematical corpora.

3. The research investigates the impact of code training on mathematical reasoning, finding that code training improves models’ ability to solve mathematical problems both with and without tool use.

4. ArXiv papers are found to be ineffective in improving mathematical reasoning when used as pre-training data.

5. The research presents Group Relative Policy Optimization (GRPO), a reinforcement learning algorithm that enhances the performance of instruction-tuned models by estimating the baseline from group scores, significantly reducing training resources.

6. The study analyzes different training methods, such as Supervised Fine-tuning (SFT), Rejection Sampling Fine-tuning (RFT), Direct Preference Optimization (DPO), Proximal Policy Optimization (PPO), and Group Relative Policy Optimization (GRPO), and explores their impact on model performance.

7. The research highlights the importance of online data sampling and differential reinforcement in reinforcement learning, noting the superior performance of methods such as Online RFT and GRPO+PS.

8. The study provides insights on the impact of math training on different tasks, including mathematical reasoning, theorem proving, natural language understanding, reasoning, and coding capabilities.

9. The research suggests future directions for improving reinforcement learning and mathematical reasoning in language models.

Summary

The paper titled "DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models" introduces DeepSeekMath 7B, a language model designed to tackle the challenge of mathematical reasoning. DeepSeekMath 7B achieved an impressive 51.7% score on the MATH benchmark, approaching the performance level of Gemini-Ultra and GPT-4. The model's self-consistency over 64 samples reached 60.9% on MATH. Two factors contributing to its mathematical reasoning capability are harnessing publicly available web data through a carefully engineered data selection pipeline and introducing Group Relative Policy Optimization (GRPO), a variant of Proximal Policy Optimization (PPO), which enhances mathematical reasoning abilities and optimizes memory usage.

In addition, the paper discusses the development of the DeepSeekMath Corpus, a large-scale high-quality pre-training corpus comprising 120B math tokens. The dataset is extracted from the Common Crawl using a fastText-based classifier and is shown to be of high quality, covering multilingual mathematical content and being several times larger than existing mathematical corpora.

Furthermore, the paper highlights the benefits of code training, demonstrating that it enhances models' ability to solve mathematical problems both with and without tool use. It also presents insights into the ineffectiveness of arXiv papers in improving mathematical reasoning.

The study introduces Group Relative Policy Optimization (GRPO) as an efficient and effective reinforcement learning (RL) algorithm, and demonstrates that RL methods significantly enhance the models' mathematical reasoning capabilities. The paper concludes by interpreting that RL enhances the model's overall performance by rendering the output distribution more robust, boosting the correct response from TopK, and aligning with human preferences.

In summary, the paper presents DeepSeekMath 7B as a state-of-the-art language model that significantly outperforms existing open-source models in mathematical reasoning and demonstrates the effectiveness of reinforcement learning in enhancing the model's mathematical reasoning abilities.

Reference:  https://arxiv.org/abs/2402.03300