Key Points

1. The paper presents a paradigm for efficiently harvesting 10 million naturally existing instruction data from the pre-training web corpus to enhance the reasoning abilities of large language models (LLMs). This approach involves recalling relevant documents, extracting instruction-response pairs, and refining the extracted pairs using open-source LLMs, leading to the development of MAmmoTH2 models.

2. The MAmmoTH2 models significantly improve the performance of LLMs on reasoning benchmarks, with MAmmoTH2-7B's (Mistral) performance increasing from 11% to 34% on MATH and from 36% to 67% on GSM8K without training on any in-domain data.

3. The paper demonstrates the effectiveness of the approach in scaling up instruction data from the web without costly human annotation or GPT-4 distillation, offering a new perspective for future instruction tuning studies.

4. The study proposes to discover naturally existing instruction data from the web, arguing that the pre-training corpus already contains a vast amount of high-quality instruction data for LLM reasoning, covering various domains like mathematics, science, engineering, and humanities.

5. The authors also outline the process of constructing the W EB I NSTRUCT dataset by recalling relevant documents, extracting Q-A pairs, and refining them using LLMs. Through this pipeline, they harvested a total of 10 million instruction-response pairs, which is purely mined from the web without human crowdsourcing or GPT-4 distillation.

6. The effectiveness of W EB I NSTRUCT is validated by training MAmmoTH2 on various base models, including Mistral-7B, Llama3-8B, Mixtral-8×7B, and Yi-34B, and achieving significant performance improvements on reasoning benchmarks such as TheoremQA, GSM8K, MATH, ARC-C, MMLU-STEM, GPQA, and BBH.

7. The paper also addresses the limitations of existing instruction tuning datasets and compares W EB I NSTRUCT with other datasets, emphasizing its uniqueness in terms of scalability and quality.

8. The study further enhances MAmmoTH2's performance on code generation, math reasoning, and instruction-following tasks by fine-tuning it on open-source instruction datasets, demonstrating its versatility and applicability in real-world scenarios.

9. Additionally, the paper presents a case study examining the quality of the extracted and refined QA pairs from the dataset, showcasing the accuracy and low hallucination rate of the harvested instruction tuning dataset.

Summary

The research paper "MAmmoTH2: Scaling Instructions from the Web," discusses an approach to efficiently improve reasoning abilities in large language models (LLMs) through instruction tuning. The paper proposes a method to harvest 10 million naturally existing instruction data from the web corpus to enhance LLM reasoning. This involves recalling relevant documents, extracting instruction-response pairs, and refining the pairs using open-source LLMs. The resulting MAmmoTH2 models, created through fine-tuning base LLMs on this dataset, significantly enhance performance on reasoning benchmarks. Notably, MAmmoTH2-7B (Mistral) performance increases from 11% to 34% on MATH and from 36% to 67% on GSM8K without training on any in-domain data. Further training on public instruction tuning datasets yields MAmmoTH2-Plus, achieving state-of-the-art performance on several reasoning and chatbot benchmarks. The research demonstrates a new paradigm for building better instruction tuning data without costly human annotation or GPT-4 distillation.

The paper outlines the process of constructing the W EB I NSTRUCT dataset, starting with the recall of relevant documents from the web corpus and followed by the extraction and refinement of Q-A pairs. The dataset is shown to be highly diverse and of high quality, obtained without any human crowdsourcing or GPT-4 distillation. The authors validate the effectiveness of W EB I NSTRUCT by training MAmmoTH2 on various base models, resulting in significant performance enhancements across several reasoning benchmarks.

Impact of Model Scaling and Loss Functions

Additionally, the research explores the impact of model scaling and loss functions on the performance of language models across different tasks. The authors compare the effectiveness of two training losses: LM Loss and SFT Loss functions, showing that increasing model size and using SFT Loss with synthetic data consistently improves accuracy across all tasks. [ 10 ]

Significance of Research in Instruction Tuning

Furthermore, the paper discusses the significance of the research in the context of instruction tuning and mathematics reasoning, highlighting the various existing approaches to enhancing LLMs' reasoning abilities in mathematics and science domains. It emphasizes the importance of enhancing general scientific reasoning ability and the value of developing high-quality training data for a broader range of subjects. [ 10 ]

Conclusion: Harnessing Instruction Data for LLMs

In conclusion, the paper demonstrates the potential of harnessing vast amounts of instruction data from the web corpus to democratize the development of LLMs with enhanced reasoning capabilities, thus providing a new paradigm for building high-quality instruction tuning data without relying on costly human annotation or GPT-4 distillation.

Reference: https://arxiv.org/abs/2405.035...