Key Points

1. Efficient exploration in gathering human feedback is shown to greatly benefit the improvement of large language models, enabling high levels of performance with far fewer queries.

2. Large language models demonstrate remarkable capabilities but reinforcement learning from human feedback (RLHF) significantly improves their behavior, even after tens of thousands of interactions.

3. Active exploration through double Thompson sampling accelerates learning and results in higher win rates, outperforming passive exploration and other active exploration algorithms.

4. Conventional practice in RLHF involves sending prompts and response pairs to human raters, and a reward model is fit to the feedback to steer subsequent responses aligning with the feedback received thus far.

5. Exploration algorithms, such as Boltzmann exploration and infomax, which leverage uncertainty estimates offered by an epistemic neural network, demonstrate dramatic improvement in performance when compared to passive exploration.

6. Efficient exploration algorithms improve the performance of large language models, scaling with the volume of human feedback.

7. The quality of uncertainty estimates is assessed in terms of dyadic joint negative-log loss, and the double Thompson sampling (TS) algorithm tends to converge on better responses than other alternatives due to its utilization of uncertainty estimates.

8. Further research directions include exploring alternative architectures for epistemic neural networks, improving reward model architectures, and investigating efficient exploration in multiturn dialog scenarios.

9. The study represents the first to demonstrate substantial benefits of active exploration in tuning large language models, but suggests that there is much room for further work in this area.

Summary

The research paper presents evidence of the substantial benefit of efficient exploration in gathering human feedback to improve large language models (LLMs). The paper uses an agent that sequentially generates queries and fits a reward model to the feedback received. The best-performing agent uses double Thompson sampling and uncertainty estimated by an epistemic neural network. The results demonstrate that efficient exploration enables high levels of performance with far fewer queries. Both uncertainty estimation and the choice of exploration scheme are found to play critical roles.
The study compares passive exploration to several active exploration algorithms, including Boltzmann exploration, infomax, and double Thompson sampling. The empirical results show that active exploration, especially with the double Thompson sampling approach, accelerates learning and leads to higher win rates compared to passive exploration. Furthermore, the paper indicates that the advantage of efficient exploration scales with the volume of human feedback, potentially accelerating the attainment of superhuman creativity by decades.

Scaling of Efficient Exploration with Feedback Volume
The findings also highlight the potential of efficient exploration in scaling with the volume of feedback, reducing the data requirements by an order of magnitude as the feedback data grows. The study assesses the quality of uncertainty estimates and explores the evolution of rewards that models assign to responses, demonstrating the benefits of active exploration, particularly with the use of uncertainty estimates offered by an epistemic neural network.

Areas for Further Research
The paper concludes by highlighting areas for further research, including the exploration of more complex architectures for efficient exploration, optimizing reward model architectures, and algorithms for multiturn dialog exploration. Overall, the research paper provides compelling evidence of the substantial benefits of efficient exploration in improving large language models and opens up promising avenues for future work in this area.

Reference: https://arxiv.org/abs/2402.003...