Key Points

1. Effective data management plays a crucial role in the pretraining and supervised fine-tuning stages of Large Language Models (LLMs) for enhancing model performance and training efficiency.

2. The construction of high-quality pretraining datasets, including data quantity, data quality, domain composition, and data-efficient learning, is essential for efficient training of LLMs.

3. Research explores the relationship between model size and training dataset size to scale training datasets for efficient pretraining of LLMs.

4. Data quality control techniques, such as deduplication, quality filtering, toxicity filtering, and addressing social biases, are crucial for constructing high-quality pretraining datasets.

5. The impact of domain mixtures on model performance in LLMs and methods for finding proper domain composition weights are explored to enhance model abilities.

6. Several strategies, such as data pruning, contrastive post-training, and data-efficient learning, are proposed to fine-tune LLMs more efficiently.

7. The management of instruction data in supervised fine-tuning is crucial, addressing data quantity, data quality, task composition, and efficient learning from data.

8. Future directions in LLM data management include exploring multimodal data management, mitigating hallucinations, addressing social biases, and improving fairness, and investigating data curriculum strategies.

9. This survey serves as a comprehensive resource for practitioners to construct powerful LLMs through effective and efficient data management practices.

Summary

The paper discusses the importance of data management in the training of Large Language Models (LLMs), focusing on both the pretraining and Supervised Fine-Tuning (SFT) stages. It emphasizes the significance of well-suited training datasets in improving model performance and training efficiency. The paper provides a comprehensive overview of current research in LLM data management, covering various noteworthy aspects of data management strategy design, such as data quantity, data quality, domain/task composition, and data-efficient learning in the SFT stage. It also outlines the challenges and future directions in LLM data management. The paper serves as a guiding resource for practitioners aspiring to construct powerful LLMs through effective data management practices.

The authors review existing studies on LLM training data management and provide in-depth insights. They discuss the challenges and future directions in training data management, including the need for a comprehensive understanding of the impacts of data management, the development of a general data management framework suitable for a broad range of applications, the exploration of multimodal data management, the mitigation of hallucinations, and the management of social biases and fairness in LLM training. The paper also highlights the efforts made in exploring data curriculum, addressing conflict data separation, and the role of multimodal data management in the context of LLMs.

Overall, the paper presents a thorough overview of the importance of data management in the training of LLMs and highlights the need for further research and development in this area to improve the construction of LLMs through effective data management practices.

Reference: https://arxiv.org/abs/2312.01700