Key Points

1. Large language models (LLMs) rely on strong evaluators at every stage of the development lifecycle, such as for training as reward models and for human evaluation at inference time. Improvements in evaluation capabilities can benefit the entire workflow.

2. Building strong evaluator models typically requires large amounts of high-quality human preference data, which can be costly and time-consuming to collect. As models improve, the existing annotations can also become outdated.

3. The paper proposes an iterative self-training approach that uses no human-annotated preferences in the training loop, relying purely on synthetically generated data.

4. The method first uses prompting to generate contrasting synthetic preference pairs for a given input, such that one response is designed to be inferior to the other. It then uses the model itself as an LLM-as-a-Judge to generate reasoning traces and judgments for these pairs.

5. The labeled synthetic data is used to fine-tune the LLM-as-a-Judge model, which is then used to iterate the whole process for self-improvement.

6. Starting from the Llama-3-70B-Instruct model, the proposed self-taught evaluator approach improves the accuracy on the RewardBench from 75.4 to 88.7 (with majority vote), matching or outperforming reward models trained with human annotations.

7. Synthetic data has been beneficial for efficiently acquiring training examples in settings where real-world data can be hard to access or annotate, such as for coding tasks.

8. The paper explores using synthetic data to construct preference pairs, generate reasoning traces, and annotate the model's own judgments to iteratively improve the LLM-as-a-Judge evaluator.

9. Experiments show that the self-taught evaluator trained on synthetic data outperforms the seed model and matches the performance of top-performing reward models trained with human-annotated data.

Summary

Traditional Evaluation Methodologies and Proposed Iterative Self-Training Approach
The research paper discusses the challenges and limitations associated with traditional methods of model-based evaluation, which primarily rely on collecting human preference judgments over model responses. These methods are time-consuming, expensive, and pose challenges in scaling to new tasks or evaluation criteria. To address these limitations, the paper presents an iterative self-training approach for improving evaluators without human annotations, using synthetic training data only. The proposed approach relies on a large language model (LLM) as a Judge, which uses prompting to generate contrasting synthetic preference pairs for a given input and then generates reasoning traces and judgments for these pairs to label as correct or not.

Performance Improvement and Comparison with Existing Evaluators
In the experiments, the authors use a seed model and demonstrate that their proposed method improves the accuracy on a benchmark dataset, RewardBench, from 75.4 to 88.7 (with majority vote) or 88.3 without. This performance matches or outperforms the performance of reward models derived from the same seed model using human annotations. The research finds that their approach significantly improves evaluators without human annotations, outperforming commonly used LLM judges and matching the performance of top-performing reward models trained with labeled examples. Notably, the proposed approach achieves better evaluation performance, particularly on challenging tasks such as Chat Hard, Safety, and Reasoning.

Role of LLM-Based Evaluators in Model Development Lifecycle
The paper also highlights the important role of LLM-based evaluators in the model development lifecycle, both at training time as reward models and at inference time as an alternative to human evaluation. It discusses the benefits of using synthetic data for model alignment, improving the model's capabilities, and teaching the model new skills. The authors compare the performance of the proposed self-taught evaluator with that of other existing evaluators, demonstrating its superiority through extensive experiments.

Potential Implications and Data Sources for Model Evaluation
Furthermore, the paper explores the potential implications of the proposed approach, including its ability to empower the scientific research process by developing better overall techniques. It also discusses various sources of synthetic data that have been used to measure different tasks, such as factuality, safety, coding, and general instruction following, showing strong correlation with real human judgments. The authors also analyze the effectiveness of combining synthetic and human-labeled preference data, providing insights into the impacts of different data sources on model performance. Overall, the paper presents a novel, effective, and efficient approach that achieves superior performance in improving evaluators without human annotations, thereby addressing the limitations of traditional evaluation methodologies and opening up new possibilities for model development and evaluation.

Reference: https://arxiv.org/abs/2408.02666