Key Points

1. Large Language Models (LLMs) are rapidly surpassing human knowledge in many domains, but improving these models traditionally relies on costly human data.

2. Recent self-rewarding mechanisms have shown that LLMs can improve by judging their own responses instead of relying on human labelers.

3. However, existing methods have primarily focused on improving model responses rather than judgment capabilities, resulting in rapid saturation during iterative training.

4. The paper introduces a novel Meta-Rewarding step to the self-improvement process, where the model judges its own judgements and uses that feedback to refine its judgment skills.

5. This unsupervised approach improves the model's ability to judge and follow instructions, as demonstrated by significant win rate improvements on AlpacaEval 2 and Arena-Hard benchmarks.

6. The meta-judge enables the creation of preference pairs of judgements, in addition to preferences between actor responses, to train both the acting and judging skills of the model.

7. The paper also introduces a length-control mechanism to address the issue of response length explosion when training with AI feedback.

8. Experimental results show that Meta-Rewarding outperforms strong baselines like Self-Rewarding and SPPO that rely on human feedback or large external reward models.

9. The findings suggest the potential for self-improving models without human supervision and the importance of simultaneously improving both acting and judging skills for language models.

Summary

The paper introduces a novel approach called "Meta-Rewarding" to improve the judgment capabilities of large language models (LLMs) through a self-rewarding mechanism. Existing methods have focused on improving model responses rather than judgment, leading to rapid saturation during training. The researchers propose introducing a Meta-Rewarding step to the self-improvement process. In this step, the model judges its own judgments and uses that feedback to refine its judgment skills. This unsupervised approach aims to improve the model's ability to judge and follow instructions. The results show significant improvements in the model's performance. On the AlpacaEval 2 benchmark, the model's length-controlled win rate increased from 22.9% to 39.4%, approaching the level of the powerful GPT-4 model. The model also showed substantial gains on the Arena-Hard benchmark, which targets complex and challenging questions.

Results

Importantly, the meta-judge training helped the model improve its judging capabilities. Evaluations show the model's judgments became better correlated with both human preferences and a strong AI judge (GPT-4). This suggests the potential for self-improving models without human supervision. The paper also addresses issues like length bias in the judging process. A length-control mechanism is introduced to ensure a balance between comprehensiveness and conciseness in the model's responses.

Addressing Length Bias

Overall, the results demonstrate the effectiveness of the Meta-Rewarding approach in enhancing an LLM's judgment and instruction following abilities in an unsupervised manner. This work suggests promising directions for achieving "super alignment" where models can improve themselves beyond human-level capabilities.

Reference: https://arxiv.org/abs/2407.19594