Key Points

1. The emergence of Large Language Models (LLMs) offers a data-centric solution to alleviate the limitations of real-world data with synthetic data generation.

2. Current investigations into this field lack a unified framework, and this paper aims to provide an organization of relevant studies based on a generic workflow of synthetic data generation.

3. The two key requirements for high-quality synthetic data are faithfulness (logical and grammatical coherence) and diversity (variation in text length, topic, writing style).

4. Prompt engineering, including task specification, conditional prompting, and in-context learning, is a crucial aspect of synthetic data generation using LLMs.

5. Multi-step generation, through sample-wise and dataset-wise decomposition, can address the challenges of long-text processing and logical reasoning when generating complex data.

6. Data curation approaches, including sample filtering and label enhancement, are necessary to address the issues of noisy and low-quality samples generated by LLMs.

7. Evaluation of synthetic data quality involves both direct methods (assessing faithfulness and diversity) and indirect methods (evaluating downstream task performance).

8. Key research directions include developing autonomous synthetic data generation agents, incorporating human-friendly interactive systems, and leveraging domain-specific knowledge bases for enhanced generation.

9. Collaboration between large and small models can be further exploited for more effective data curation and quality control.

Summary

The scientific article provides a comprehensive overview of the current research on using Large Language Models (LLMs) to generate synthetic data, address the limitations of real-world data, and propose potential future directions. The authors highlight the potential of LLMs in generating synthetic data, considering the dilemma of data quantity and quality in deep learning. They emphasize the need for a unified framework to systematically organize and advance the field of LLM-driven synthetic data generation. The paper identifies gaps in current research and outlines gaps and prospects for future study. The paper discusses the potential game-changing impact of LLMs in addressing the limitations of real-world data, such as the high cost, scarcity, and inherent biases in human-generated data.

The article delves into the capabilities of LLMs for generating synthetic data that mimics the characteristics and patterns of real-world data. It highlights the advantages of LLMs, including knowledge acquisition through pretraining, exceptional linguistic comprehension, and instruction-following capabilities. The LLMs enable the creation of tailored datasets for specific applications with more flexible process designs.

The paper systematically summarizes common strategies for synthetic data generation with LLMs, including conditional prompting to diversify generation, in-context learning for instruction-following and factorizing datasets into simpler sub-tasks for multi-step generation. Additionally, it discusses the challenges in generating high-quality synthetic data and offers solutions such as sample-wise and dataset-wise decomposition and conditional prompting with finer-grained attributes.

The article also addresses the issue of data curation, focusing on sample filtering, label enhancement, heuristic metrics, auxiliary models, and sample re-weighting. It outlines the current mainstream evaluation methods, including direct and indirect methods, for assessing the quality and application effectiveness of the generated data. The paper highlights the need for future studies to focus on activating the reasoning and planning capabilities of LLMs for autonomous synthetic data generation, human-friendly interactive systems for data generation, knowledge-driven data generation, and diverse collaborative modes between large and small models for data curation.

The paper concludes by summarizing the potential impact and ethical concerns of synthetic data generation and the need for a unified framework to organize and advance LLM-driven synthetic data generation. The research is supported by the Pioneer R&D Program of Zhejiang, NSFC Grants, and the Fundamental Research Funds for the Central Universities.

Reference: https://arxiv.org/abs/2406.151...