Key Points

1. Optimizing chain-of-thought solutions only for answer correctness can make them less legible to humans.

2. To mitigate the loss in legibility, the paper proposes a training algorithm inspired by Prover-Verifier Game from prior work.

3. The algorithm iteratively trains small verifiers to predict solution correctness, "helpful" provers to produce correct solutions that the verifier accepts, and "sneaky" provers to produce incorrect solutions that fool the verifier.

4. The helpful prover's accuracy and the verifier's robustness to adversarial attacks increase over the course of training.

5. Legibility training transfers to time-constrained humans tasked with verifying solution correctness. Over the course of training, human accuracy increases when checking the helpful prover's solutions, and decreases when checking the sneaky prover's solutions.

6. Training for checkability by small verifiers is proposed as a plausible technique for increasing the legibility of large language models to humans, which could help with alignment of superhuman models.

7. Legibility training against small verifiers reveals a tension between achieving high performance and retaining legibility in the solutions.

8. The verifier becomes more robust to adversarial attacks over the course of training, as the sneaky prover produces increasingly subtle flaws.

9. The helpful prover becomes increasingly legible to time-constrained humans over the course of training, while maintaining high accuracy.

Summary

1. The authors observe that directly optimizing LLM outputs for correctness can lead to high performance but poor legibility - the solutions become difficult for time-constrained humans to evaluate. This represents a "legibility tax" where optimizing solely for performance can come at the cost of legibility.

2. To address this, the authors propose a training algorithm inspired by the Prover-Verifier Game framework. This involves iteratively training a smaller "verifier" model to predict the correctness of solutions, and a larger "helpful prover" model to generate correct and convincing solutions that the verifier accepts. Additionally, a "sneaky prover" model is trained to generate incorrect but convincing solutions that try to fool the verifier.

3. The authors find that through this checkability training process, the verifier becomes more robust to adversarial attacks from the sneaky prover over successive rounds. The helpful prover also produces solutions that are increasingly legible to the verifier, as well as to time-constrained human judges.

4. The authors argue that this type of legibility training against smaller verifier models is a promising approach for increasing the legibility and alignment of outputs from superhuman LLMs. By making the reasoning behind an LLM's outputs more transparent and verifiable, it can help build trust and facilitate human oversight, even as the models' capabilities exceed those of humans.

5. The authors also discuss the importance of balancing the tradeoff between model performance and legibility, and suggest potential future directions like using unsupervised signals for legibility and separating the model's internal reasoning from the final output.

In summary, this work proposes a novel training approach inspired by Prover-Verifier Games to improve the legibility and alignment of LLM outputs, which has important implications for building trust and facilitating human oversight as these models become increasingly capable.

Reference: https://arxiv.org/abs/2407.13692