Key Points
- The presented paper introduces GPQA, a challenging dataset of 448 multiple-choice questions in the domains of biology, physics, and chemistry, designed for the assessment of human and AI performance.
- The questions are of high-quality and extremely difficult, achieving 74% estimated objectivity based on expert assessment. Skilled non-experts, with unrestricted internet access, average 34% accuracy and spend over 30 minutes answering each question.
- State-of-the-art AI systems, particularly the GPT-4–based baseline, achieve 39% accuracy, highlighting the difficulty of the dataset and its potential for scalable oversight experiments.
- The dataset creation process involves multiple stages of question writing, expert validation, question revision, and non-expert validation, with heavy incentivization for accuracy and feedback quality.
- The main set and the diamond set are created, comprising 448 and 198 questions respectively, selected based on expert and non-expert validation results, and stringent criteria for objectivity and difficulty.
- The study reports expert validator accuracy of 66.5% ± 4.0%, second expert validator accuracy of 64.8% ± 4.0%, and non-expert validator accuracy of 34.1% ± 2.3% on the extended set.
- The dataset is intended for use in scalable oversight experiments to help develop oversight protocols for supervising superhuman AI systems, addressing potential limitations such as its small size, specialized non-experts, possible biases, and relevance to superhuman systems.
- GPQA is expected to serve as a challenging benchmark for scalable oversight experiments that can continue to be useful as model capabilities improve, providing a much harder benchmark compared to existing QA datasets.
- The study is supported by financial and in-kind contributions from various sources, acknowledging the valuable input of the workers involved in creating the dataset.
Summary
The research paper presents a new dataset called GPQA, consisting of 448 multiple-choice questions in the domains of physics, chemistry, and biology. The questions are designed to be extremely difficult, challenging for both highly skilled non-experts and state-of-the-art AI systems. The authors validated the accuracy of the questions by having experts and non-experts attempt to answer them, aiming to determine whether the questions are "Google-proof". Expert validation resulted in 65% accuracy, with significant difficulty even for experts, and only 34% accuracy achieved by skilled non-experts who spent over 30 minutes on average with unrestricted web access.
The difficulty of the dataset makes it suitable for scalable oversight research. The authors also evaluated baseline AI models, including GPT-4, which achieved 39% accuracy, demonstrating the difficulty of the dataset for AI systems. The paper discusses the challenges and methods associated with scalable oversight experiments, and the dataset's limitations as a small-sized specialized dataset. The authors believe that GPQA can be used for realistic scalable oversight experiments to help develop supervision protocols for superhuman AI systems.
The paper highlights the importance of high-quality data collection for natural language tasks and its usefulness in evaluating AI systems' capabilities. The paper also acknowledges the limitations of the dataset and emphasizes the need for future work in overseeing superhuman systems and the development of oversight protocols.
The authors conclude by crediting the workers who contributed their expertise to create the dataset and expressing gratitude for financial and in-kind support for the project.
Reference: https://arxiv.org/abs/2311.12022