Key Points
1. Proof assistants like Lean have revolutionized mathematical proof verification, ensuring high accuracy and reliability. However, large language models (LLMs) show promise in mathematical reasoning but are hindered by a lack of training data. This paper introduces an approach to generate extensive Lean 4 proof data derived from high-school and undergraduate-level mathematical competition problems by translating natural language problems into formal statements, filtering out low-quality statements, and generating proofs to create synthetic data.
2. The method involves an iterative framework that improves proof quality and addresses the challenge of the large search space for proofs. The model achieved whole-proof generation accuracies of 46.3% with 64 samples on the Lean 4 miniF2F test, surpassing the baseline GPT-4 and a tree search reinforcement learning method. Additionally, the model successfully proved problems in the Lean 4 Formalized International Mathematical Olympiad (FIMO) benchmark, while GPT-4 failed to prove any.
3. The complexity of proofs in modern mathematics has led to the development of formal mathematical languages such as Lean, Isabelle, and Coq, enabling the creation of computer-verifiable proofs. However, crafting formal proofs demands significant effort and specialized expertise. Automated theorem proving is on the rise to reduce the effort involved in writing formal mathematical proofs.
4. Several approaches have been developed to reduce the effort involved in writing formal mathematical proofs, primarily focusing on search algorithms that explore potential solutions for proposed theorems. Recent advances in LLMs have introduced a novel strategy utilizing pre-trained models to guide the search process. However, the lack of parallel corpus has hindered practical applicability.
5. The paper proposes a method for generating extensive Lean 4 proof data from informal mathematical problems. This approach involves translating high-school and undergraduate-level mathematical competition problems into formal statements and automating proof generation using a large language model (LLM), subsequently verified by the Lean 4 environment.
6. The method includes a multi-step process to enhance the quality of generated proofs, filtering out simple statements and implementing a hypothesis rejection strategy to eliminate invalid statements. To accelerate the proof generation process, the method proposes proving negated statements in parallel, terminating the process once either the original statement or its negation is proved.
7. The paper introduces DeepSeek-Prover, a model trained on the synthetic dataset, achieving state-of-the-art performance on benchmarks, with whole-proof generation accuracies surpassing the baseline GPT-4 and reinforcement learning methods. The model successfully proved problems in the miniF2F and FIMO benchmarks, demonstrating the potential of leveraging large-scale synthetic data to enhance theorem-proving capabilities in LLMs.
8. The effectiveness of the approach was demonstrated through various experiments and benchmarks, showcasing the scalability, quality, and iterative enhancement of the model in theorem proving. These experiments illustrate the potential and necessity of systematic data construction for progressing in the field of automated theorem proving.
9. The research presented in the paper has the potential to significantly advance automated theorem proving by leveraging large-scale synthetic proof data, contributing to more reliable mathematical proof verification and providing valuable educational resources. The authors aim to advance research in automated theorem proving and enhance the capabilities of large language models in formal mathematical reasoning by open-sourcing the dataset and model.
Summary
The paper discusses the generation of extensive synthetic proof data for Lean 4 using high-school and undergraduate-level mathematical competition problems. The approach involves translating natural language problems into formal statements, filtering out low-quality statements, and generating proofs to create synthetic data. The DeepSeekMath 7B model was fine-tuned on this synthetic dataset, consisting of 8 million formal statements with proofs. The model achieved whole-proof generation accuracies of 46.3% with 64 samples and 52% cumulatively on the Lean 4 miniF2F test, surpassing baseline methods. Additionally, the model successfully proved 5 out of 148 problems in the Lean 4 Formalized International Mathematical Olympiad (FIMO) benchmark, while GPT-4 failed to prove any.
Significance of Formal Proof Languages and Automated Theorem Proving
The paper addresses the significance of formal proof languages such as Lean in mathematical proof verification and the challenges associated with crafting formal proofs, leading to the rise of automated theorem proving. It also discusses the limitations of existing approaches in dealing with complex theorems and the potential of large language models (LLMs) in this field. The paper introduces a method to generate extensive Lean 4 proof data from informal mathematical problems and emphasizes the importance of ensuring the quality and scale of the synthetic data.
The research presents a multi-step process to enhance the quality of generated proofs through model scoring and hypothesis rejection, improving proof quality through iterative training, and accelerating the proof generation process by proving negated statements in parallel. The effectiveness of the method is demonstrated through the success of the DeepSeekMath 7B model in proving problems in the miniF2F and FIMO benchmarks, surpassing other state-of-the-art methods. The paper also highlights the systematic construction of synthetic proof data to boost model proficiency in automated theorem proving.
Correlation Between Dataset Size and Model Efficacy
Furthermore, the paper discusses the correlation between dataset size and model efficacy, demonstrating the significance of large-scale datasets for boosting model proficiency. It also presents case studies illustrating successful proofs and the identification of inconsistencies during the Hypothesis Rejection stage. Finally, the paper emphasizes the potential of large-scale synthetic proof data to enhance theorem-proving capabilities in LLMs and contributing to more reliable mathematical proof verification. The researchers have open-sourced the dataset, model, and code to facilitate further research and development in automated theorem proving.
Reference: https://arxiv.org/abs/2405.14333