RE-Bench: Evaluating Frontier AI R&D Capabilities of Language Model Agents against Human Experts

Hjalmar Wijk; Tao Roa Lin; Joel Becker; Sami Jawhar; Neev Parikh; Thomas Broadley; Lawrence Chan; Michael Chen; Joshua M Clymer; Jai Dhyani; Elena Ericheva; Katharyn Garcia; Brian Goodrich; Nikola Jurkovic; Megan Kinniment; Aron Lajko; Seraphina Nix; Lucas Jun Koba Sato; William Saunders; Maksym Taran; Ben West; Elizabeth Barnes

RE-Bench: Evaluating Frontier AI R&D Capabilities of Language Model Agents against Human Experts

Hjalmar Wijk, Tao Roa Lin, Joel Becker, Sami Jawhar, Neev Parikh, Thomas Broadley, Lawrence Chan, Michael Chen, Joshua M Clymer, Jai Dhyani, Elena Ericheva, Katharyn Garcia, Brian Goodrich, Nikola Jurkovic, Megan Kinniment, Aron Lajko, Seraphina Nix, Lucas Jun Koba Sato, William Saunders, Maksym Taran, Ben West, Elizabeth Barnes

Proceedings of the 42nd International Conference on Machine Learning, PMLR 267:66772-66832, 2025.

Abstract

Frontier AI safety policies highlight automation of AI research and development (R&D) by AI agents as an important capability to anticipate. However, there exist few evaluations for AI R&D capabilities, and none that are highly realistic and have a direct comparison to human performance. We introduce RE-Bench (Research Engineering Benchmark, V1), which consists of 7 challenging, open-ended ML research engineering environments and data from 71 8-hour attempts by 61 distinct human experts. We confirm that our experts make progress in the environments given 8 hours, with 82% of expert attempts achieving a non-zero score and 24% matching or exceeding our strong reference solutions. We compare humans to several public frontier models through best-of-$k$ with varying time budgets and agent designs, and find that the best AI agents achieve a score 4$\times$ higher than human experts when both are given a total time budget of 2 hours per environment. However, humans currently display better returns to increasing time budgets, narrowly exceeding the top AI agent scores given an 8-hour budget, and achieving 2$\times$ the score of the top AI agent when both are given 32 total hours (across different attempts).

Cite this Paper

BibTeX

@InProceedings{pmlr-v267-wijk25a,
  title = 	 {{RE}-Bench: Evaluating Frontier {AI} R&D Capabilities of Language Model Agents against Human Experts},
  author =       {Wijk, Hjalmar and Lin, Tao Roa and Becker, Joel and Jawhar, Sami and Parikh, Neev and Broadley, Thomas and Chan, Lawrence and Chen, Michael and Clymer, Joshua M and Dhyani, Jai and Ericheva, Elena and Garcia, Katharyn and Goodrich, Brian and Jurkovic, Nikola and Kinniment, Megan and Lajko, Aron and Nix, Seraphina and Koba Sato, Lucas Jun and Saunders, William and Taran, Maksym and West, Ben and Barnes, Elizabeth},
  booktitle = 	 {Proceedings of the 42nd International Conference on Machine Learning},
  pages = 	 {66772--66832},
  year = 	 {2025},
  editor = 	 {Singh, Aarti and Fazel, Maryam and Hsu, Daniel and Lacoste-Julien, Simon and Berkenkamp, Felix and Maharaj, Tegan and Wagstaff, Kiri and Zhu, Jerry},
  volume = 	 {267},
  series = 	 {Proceedings of Machine Learning Research},
  month = 	 {13--19 Jul},
  publisher =    {PMLR},
  pdf = 	 {https://raw.githubusercontent.com/mlresearch/v267/main/assets/wijk25a/wijk25a.pdf},
  url = 	 {https://proceedings.mlr.press/v267/wijk25a.html},
  abstract = 	 {Frontier AI safety policies highlight automation of AI research and development (R&D) by AI agents as an important capability to anticipate. However, there exist few evaluations for AI R&D capabilities, and none that are highly realistic and have a direct comparison to human performance. We introduce RE-Bench (Research Engineering Benchmark, V1), which consists of 7 challenging, open-ended ML research engineering environments and data from 71 8-hour attempts by 61 distinct human experts. We confirm that our experts make progress in the environments given 8 hours, with 82% of expert attempts achieving a non-zero score and 24% matching or exceeding our strong reference solutions. We compare humans to several public frontier models through best-of-$k$ with varying time budgets and agent designs, and find that the best AI agents achieve a score 4$\times$ higher than human experts when both are given a total time budget of 2 hours per environment. However, humans currently display better returns to increasing time budgets, narrowly exceeding the top AI agent scores given an 8-hour budget, and achieving 2$\times$ the score of the top AI agent when both are given 32 total hours (across different attempts).}
}

Endnote

%0 Conference Paper
%T RE-Bench: Evaluating Frontier AI R&D Capabilities of Language Model Agents against Human Experts
%A Hjalmar Wijk
%A Tao Roa Lin
%A Joel Becker
%A Sami Jawhar
%A Neev Parikh
%A Thomas Broadley
%A Lawrence Chan
%A Michael Chen
%A Joshua M Clymer
%A Jai Dhyani
%A Elena Ericheva
%A Katharyn Garcia
%A Brian Goodrich
%A Nikola Jurkovic
%A Megan Kinniment
%A Aron Lajko
%A Seraphina Nix
%A Lucas Jun Koba Sato
%A William Saunders
%A Maksym Taran
%A Ben West
%A Elizabeth Barnes
%B Proceedings of the 42nd International Conference on Machine Learning
%C Proceedings of Machine Learning Research
%D 2025
%E Aarti Singh
%E Maryam Fazel
%E Daniel Hsu
%E Simon Lacoste-Julien
%E Felix Berkenkamp
%E Tegan Maharaj
%E Kiri Wagstaff
%E Jerry Zhu	
%F pmlr-v267-wijk25a
%I PMLR
%P 66772--66832
%U https://proceedings.mlr.press/v267/wijk25a.html
%V 267
%X Frontier AI safety policies highlight automation of AI research and development (R&D) by AI agents as an important capability to anticipate. However, there exist few evaluations for AI R&D capabilities, and none that are highly realistic and have a direct comparison to human performance. We introduce RE-Bench (Research Engineering Benchmark, V1), which consists of 7 challenging, open-ended ML research engineering environments and data from 71 8-hour attempts by 61 distinct human experts. We confirm that our experts make progress in the environments given 8 hours, with 82% of expert attempts achieving a non-zero score and 24% matching or exceeding our strong reference solutions. We compare humans to several public frontier models through best-of-$k$ with varying time budgets and agent designs, and find that the best AI agents achieve a score 4$\times$ higher than human experts when both are given a total time budget of 2 hours per environment. However, humans currently display better returns to increasing time budgets, narrowly exceeding the top AI agent scores given an 8-hour budget, and achieving 2$\times$ the score of the top AI agent when both are given 32 total hours (across different attempts).

APA

Wijk, H., Lin, T.R., Becker, J., Jawhar, S., Parikh, N., Broadley, T., Chan, L., Chen, M., Clymer, J.M., Dhyani, J., Ericheva, E., Garcia, K., Goodrich, B., Jurkovic, N., Kinniment, M., Lajko, A., Nix, S., Koba Sato, L.J., Saunders, W., Taran, M., West, B. & Barnes, E.. (2025). RE-Bench: Evaluating Frontier AI R&D Capabilities of Language Model Agents against Human Experts. Proceedings of the 42nd International Conference on Machine Learning, in Proceedings of Machine Learning Research 267:66772-66832 Available from https://proceedings.mlr.press/v267/wijk25a.html.

RE-Bench: Evaluating Frontier AI R&D Capabilities of Language Model Agents against Human Experts

Abstract

Cite this Paper

Related Material