MLE-bench: Evaluating Machine Learning Agents on Machine Learning Engineering (AI summary)

Key Points

1. The challenge aims to predict interactions between atoms in molecules, specifically the scalar coupling constant, which is the magnetic interaction between two atoms in a molecule. This interaction is crucial for understanding the molecular composition of tissues and the structure and dynamics of proteins and molecules, with applications in environmental science, pharmaceutical science, and materials science.^{[ 25 ]}

2. The competition is hosted by the CHemistry and Mathematics in Phase Space (CHAMPS) at the University of Bristol, Cardiff University, Imperial College, and the University of Leeds. Winning teams will have an opportunity to partner with this multi-university research program on an academic publication.^{[ 25 ]}

3. The predictive analytics challenge in this competition involves developing an algorithm that can accurately predict scalar coupling constants given only a 3D molecular structure as input. The goal is to create a fast and reliable method to predict these interactions, enabling scientists to gain structural insights faster and cheaper.^{[ 2526 ]}

4. The submissions in the competition are evaluated based on the Log of the Mean Absolute Error (MAE), calculated for each scalar coupling type, and then averaged across types. The scoring metric reflects the accuracy of the predictions, with a minimum (best) possible score for perfect predictions being approximately -20.7232.^{[ 4 ]}

5. The competition provides specific deadlines for entry, pre-trained model and external data disclosure, team merger, and final submission, with corresponding dates and times in UTC. Prizes are awarded to the top winners, with the 1st place prize being $12,500.^{[ 1434 ]}

6. The competition provides various data sets for training and testing, including information on molecular structures, electric dipole moments, magnetic shielding tensors, Mulliken charges, potential energy, and scalar coupling contributions. Additionally, the competition requires predicting the scalar coupling constant between atom pairs in molecules based on the provided data sets and specific requirements for file formats and content.^[

Summary

This paper introduces MLE-bench, a benchmark for evaluating the machine learning engineering capabilities of AI agents. The benchmark consists of 75 curated Kaggle competitions that cover a diverse range of machine learning tasks, including natural language processing, computer vision, and signal processing. These competitions were selected to be representative of contemporary machine learning engineering work and to provide a challenging test of real-world skills such as training models, preparing datasets, and running experiments. The authors establish human baselines for each competition by using the publicly available leaderboards from Kaggle. They then evaluate several frontier language models, including OpenAI's o1-preview and GPT-4o, on the MLE-bench using open-source agent scaffolds. They find that the best-performing setup, o1-preview with the AIDE scaffold, achieves at least the level of a Kaggle bronze medal in 16.9% of the competitions on average.

Resource Scaling and Pre-training Impact
The paper also investigates various forms of resource-scaling for the AI agents, including the impact of increasing the number of attempts per competition and the amount of time provided per competition. They find that performance significantly improves when agents are given more attempts or more time. Additionally, the authors explore the impact of contamination from pre-training on the agents' performance, finding no evidence of systematic inflation of results due to memorization of competition details or winning solutions.

Open-sourcing MLE-bench
Finally, the authors open-source the MLE-bench code to facilitate future research into understanding the machine learning engineering capabilities of AI agents. They emphasize the importance of benchmarks like MLE-bench in evaluating the risks and potential benefits of AI systems capable of autonomous machine learning research and engineering.

Reference: https://arxiv.org/abs/2410.07095

ML and AI papers

MLE-bench: Evaluating Machine Learning Agents on Machine Learning Engineering (AI summary)

Recent posts

Foundational Models Defining a New Era in Vision: A Survey and Outlook (AI summary)

MedAgents: Large Language Models as Collaborators for Zero-shot Medical Reasoning (AI summary)

If LLM Is the Wizard, Then Code Is the Wand: A Survey on How Code Empowers Large Language Models to Serve as Intelligent Agents (AI summary)