MLAgentBench: Evaluating Language Agents on Machine Learning Experimentation

Qian Huang; Jian Vora; Percy Liang; Jure Leskovec

MLAgentBench: Evaluating Language Agents on Machine Learning Experimentation

Qian Huang, Jian Vora, Percy Liang, Jure Leskovec

Proceedings of the 41st International Conference on Machine Learning, PMLR 235:20271-20309, 2024.

Abstract

A central aspect of machine learning research is experimentation, the process of designing and running experiments, analyzing the results, and iterating towards some positive outcome (e.g., improving accuracy). Could agents driven by powerful language models perform machine learning experimentation effectively? To answer this question, we introduce MLAgentBench, a suite of 13 tasks ranging from improving model performance on CIFAR-10 to recent research problems like BabyLM. For each task, an agent can perform actions like reading/writing files, executing code, and inspecting outputs. We then construct an agent that can perform ML experimentation based on ReAct framework. We benchmark agents based on Claude v1.0, Claude v2.1, Claude v3 Opus, GPT-4, GPT-4-turbo, Gemini-Pro, and Mixtral and find that a Claude v3 Opus agent is the best in terms of success rate. It can build compelling ML models over many tasks in MLAgentBench with 37.5% average success rate. Our agents also display highly interpretable plans and actions. However, the success rates vary considerably; they span from 100% on well-established older datasets to as low as 0% on recent Kaggle challenges created potentially after the underlying LM was trained. Finally, we identify several key challenges for LM-based agents such as long-term planning and reducing hallucination.

Cite this Paper

BibTeX


@InProceedings{pmlr-v235-huang24y,
  title = 	 {{MLA}gent{B}ench: Evaluating Language Agents on Machine Learning Experimentation},
  author =       {Huang, Qian and Vora, Jian and Liang, Percy and Leskovec, Jure},
  booktitle = 	 {Proceedings of the 41st International Conference on Machine Learning},
  pages = 	 {20271--20309},
  year = 	 {2024},
  editor = 	 {Salakhutdinov, Ruslan and Kolter, Zico and Heller, Katherine and Weller, Adrian and Oliver, Nuria and Scarlett, Jonathan and Berkenkamp, Felix},
  volume = 	 {235},
  series = 	 {Proceedings of Machine Learning Research},
  month = 	 {21--27 Jul},
  publisher =    {PMLR},
  pdf = 	 {https://raw.githubusercontent.com/mlresearch/v235/main/assets/huang24y/huang24y.pdf},
  url = 	 {https://proceedings.mlr.press/v235/huang24y.html},
  abstract = 	 {A central aspect of machine learning research is experimentation, the process of designing and running experiments, analyzing the results, and iterating towards some positive outcome (e.g., improving accuracy). Could agents driven by powerful language models perform machine learning experimentation effectively? To answer this question, we introduce MLAgentBench, a suite of 13 tasks ranging from improving model performance on CIFAR-10 to recent research problems like BabyLM. For each task, an agent can perform actions like reading/writing files, executing code, and inspecting outputs. We then construct an agent that can perform ML experimentation based on ReAct framework. We benchmark agents based on Claude v1.0, Claude v2.1, Claude v3 Opus, GPT-4, GPT-4-turbo, Gemini-Pro, and Mixtral and find that a Claude v3 Opus agent is the best in terms of success rate. It can build compelling ML models over many tasks in MLAgentBench with 37.5% average success rate. Our agents also display highly interpretable plans and actions. However, the success rates vary considerably; they span from 100% on well-established older datasets to as low as 0% on recent Kaggle challenges created potentially after the underlying LM was trained. Finally, we identify several key challenges for LM-based agents such as long-term planning and reducing hallucination.}
}

Endnote

%0 Conference Paper
%T MLAgentBench: Evaluating Language Agents on Machine Learning Experimentation
%A Qian Huang
%A Jian Vora
%A Percy Liang
%A Jure Leskovec
%B Proceedings of the 41st International Conference on Machine Learning
%C Proceedings of Machine Learning Research
%D 2024
%E Ruslan Salakhutdinov
%E Zico Kolter
%E Katherine Heller
%E Adrian Weller
%E Nuria Oliver
%E Jonathan Scarlett
%E Felix Berkenkamp	
%F pmlr-v235-huang24y
%I PMLR
%P 20271--20309
%U https://proceedings.mlr.press/v235/huang24y.html
%V 235
%X A central aspect of machine learning research is experimentation, the process of designing and running experiments, analyzing the results, and iterating towards some positive outcome (e.g., improving accuracy). Could agents driven by powerful language models perform machine learning experimentation effectively? To answer this question, we introduce MLAgentBench, a suite of 13 tasks ranging from improving model performance on CIFAR-10 to recent research problems like BabyLM. For each task, an agent can perform actions like reading/writing files, executing code, and inspecting outputs. We then construct an agent that can perform ML experimentation based on ReAct framework. We benchmark agents based on Claude v1.0, Claude v2.1, Claude v3 Opus, GPT-4, GPT-4-turbo, Gemini-Pro, and Mixtral and find that a Claude v3 Opus agent is the best in terms of success rate. It can build compelling ML models over many tasks in MLAgentBench with 37.5% average success rate. Our agents also display highly interpretable plans and actions. However, the success rates vary considerably; they span from 100% on well-established older datasets to as low as 0% on recent Kaggle challenges created potentially after the underlying LM was trained. Finally, we identify several key challenges for LM-based agents such as long-term planning and reducing hallucination.

APA


Huang, Q., Vora, J., Liang, P. & Leskovec, J.. (2024). MLAgentBench: Evaluating Language Agents on Machine Learning Experimentation. Proceedings of the 41st International Conference on Machine Learning, in Proceedings of Machine Learning Research 235:20271-20309 Available from https://proceedings.mlr.press/v235/huang24y.html.

MLAgentBench: Evaluating Language Agents on Machine Learning Experimentation

Abstract

Cite this Paper

Related Material