Key Points
- The paper introduces Prometheus 2, an open-source language model specialized in evaluating other language models. It addresses concerns regarding transparency, controllability, and affordability by focusing on direct assessment and pairwise ranking evaluation formats, incorporating user-defined evaluation criteria, and closely mirroring human and proprietary LM judgments.
- The paper highlights the challenges in evaluating the quality of outputs produced by language models due to the diverse distribution of text and complex tasks. Language model-based evaluation has emerged as a scalable and cost-effective paradigm, with proprietary LMs demonstrating high correlation with human evaluations and increased speed and cost-effectiveness. However, reliance on proprietary LMs for evaluation poses challenges such as lack of transparency, controllability, and affordability.
- Existing open evaluator LMs exhibit critical shortcomings, issuing scores significantly divergent from those assigned by humans and lacking the flexibility to perform both direct assessment and pairwise ranking. Additionally, they lack the ability to evaluate based on custom evaluation criteria and instead focus on general attributes like helpfulness and harmlessness.
- The paper proposes a unified evaluator LM by merging the weights of two evaluator LMs trained separately on direct assessment and pairwise ranking formats. It introduces the PREFERENCE COLLECTION, a new fine-grained pairwise ranking feedback dataset, and demonstrates the performance of Prometheus 2 models on direct assessment and pairwise ranking benchmarks, achieving high correlations and agreement with humans and proprietary LM judges.
- The paper discusses containerization, which refers to the process of packaging up an application along with all its related configurations files, libraries, and dependencies required to run, into a standalone unit or a ‘container’. Docker is highlighted as a tool that facilitates containerization, making it easier to use software containers.
- The paper evaluates the language models' generation capabilities and discusses the use of language models to mimic human evaluation. It addresses the need for open evaluator LMs to function in different evaluation schemes and demonstrates how Prometheus 2 bridges the gap by excelling in both direct assessment and pairwise ranking.
- The study presents a new recipe for training a unified evaluator LM based on merging the weights of models trained for direct assessment and pairwise ranking. It provides detailed explanations of direct assessment, pairwise ranking, joint training, weight merging, prompting, and single-format training in the context of evaluator LM training.
- The paper evaluates the performance of Prometheus 2 models on various benchmarks, showcasing high correlation, strong consistency across evaluation formats, and the effectiveness of weight merging in enhancing evaluation capabilities. It compares the results of weight merging with joint training and single-format training, highlighting the superiority of weight merging in achieving positive task transfer.
- The study presents research questions and experimental findings related to the effectiveness of weight merging compared to joint training, the impact of model ensembling, and the influence of training on direct assessment feedback data on pairwise ranking accuracy, and vice versa. It concludes with proposed future research directions and contributions of the study.
Summary
The research paper introduces Prometheus 2, an open-source language model specialized in evaluating other language models. The paper discusses the limitations of existing open evaluator LMs, such as diverging scores from human assessments and the inability to accommodate custom evaluation criteria. It highlights the features of Prometheus 2, including its ability to closely mirror human and GPT-4 judgments, process both direct assessment and pair-wise ranking formats, and exhibit high correlation and agreement with humans and proprietary LM judges. The paper emphasizes the public availability of the models, code, and data.
The paper introduces the concept of language model-based evaluation, which has emerged as a scalable and cost-effective paradigm for assessing LM-generated text. It discusses the challenges of relying on proprietary LMs for evaluation, including transparency, controllability, and affordability issues. To address these challenges, the paper focuses on developing open-access, transparent, and controllable evaluator LMs.
Training a Unified Evaluator LM
The paper presents a new recipe for training a unified evaluator LM by merging the weights of models trained on direct assessment and pairwise ranking. It introduces a new fine-grained pairwise ranking feedback dataset, called the PREFERENCE COLLECTION, which includes over 1,000 evaluation criteria beyond basic qualities such as helpfulness and harmlessness. The paper demonstrates that weight merging can result in an evaluator LM that not only functions in both assessment formats but also outperforms single-format trained LMs.
Experimental Results and Analysis
The research paper presents detailed experimental results, including benchmarks and metrics employed to assess the evaluation capabilities of evaluator LMs, such as direct assessment results, pairwise ranking results, and consistency across evaluation formats. It also discusses the effectiveness of weight merging compared to joint training, as well as the significance of merging models trained with different formats. Additionally, the paper explores the impact of training on direct assessment feedback data on pairwise ranking accuracy and vice versa.
In conclusion, the paper introduces Prometheus 2 as an open-source language model specialized in evaluating other language models, addressing the limitations of existing open evaluator LMs. It demonstrates the effectiveness of weight merging and the significance of unifying different evaluation formats in training a robust unified evaluator LM. The paper emphasizes the importance of using open-source language models for fair and accessible evaluations, encouraging further research in this area.
Reference: https://arxiv.org/abs/2405.015...