Key Points
1. With the development of language models, there has been an increasing interest in using these models to develop intelligent agents to automate the complete scientific discovery process, which has led to both excitement and skepticism about the true capabilities of such agents.
2. The authors believe that to achieve automated scientific discovery, agents must be able to complete all necessary tasks in the entire workflow. Therefore, they call for rigorous evaluation of the performance of agents on individual tasks in the workflow before claiming to be able to achieve end-to-end automation.
3. To this end, the authors proposed ScienceAgentBench, a new benchmark for evaluating language agents for data-driven scientific discovery.
4. To ensure the scientific authenticity and real-world relevance of the benchmark, the authors extracted 102 tasks from 44 peer-reviewed papers and invited 9 domain experts to verify them.
5. The authors unified the target output of each task into an independent Python program file and adopted multiple evaluation metrics to check the generated program, execution results, and cost.
6. Each task was manually verified for multiple rounds to ensure annotation quality and scientific plausibility. The authors also proposed two effective strategies to mitigate the problem of data contamination.
7. The authors evaluated 5 open source and proprietary language models, each using 3 frameworks: direct prompting, OpenHands CodeAct, and self-debugging. Even given 3 attempts per task, the best performing agent was only able to complete 32.4% of the tasks independently and 34.3% using expert-provided knowledge.
8. These results highlight the limited capabilities of current language agents for generating code for data-driven discovery, let alone automating end-to-end scientific research.
9. Despite current mediocre performance, the authors argue that language agents have significant potential to augment the productivity of human scientists: for each task in the benchmark, trained annotators take an average of 2.5-3 hours to adapt a program from publicly available source code, while language agents can typically generate a meaningful program draft in 10 minutes.
Summary
The results show that the capabilities of current language agents are still limited, with the best performing agents only able to solve 32.4% of the tasks independently and 34.3% with the help of expert-provided knowledge. These findings highlight the limited capabilities of current language agents in generating code for data-driven discovery, let alone end-to-end scientific research automation.
The study built ScienceAgentBench through the following three key design principles: 1) ensuring the scientific authenticity of tasks through co-design with subject experts; 2) adopting rigorous hierarchical evaluation, unifying task outputs into self-contained Python programs, and using multiple evaluation metrics; 3) taking multi-stage quality control measures, including expert verification and mitigating data contamination issues. In addition, the study also conducted a comprehensive evaluation on five open source and proprietary language models, as well as three different agent frameworks (direct prompt, OpenHands CodeAct, and self-debugging).
The results show that even with the best agent frameworks and models, language agents have serious limitations in completing data-driven discovery tasks independently. This highlights that current language agents still have a long way to go in achieving end-to-end scientific research automation. In summary, ScienceAgentBench provides a high-quality benchmark for the objective evaluation and continuous development of future language agents, contributing to a deep understanding of their strengths and weaknesses and providing more useful assistants to scientists.
Reference: https://arxiv.org/abs/2410.05080