ReX-MLE: The Autonomous Agent Benchmark for Medical Imaging Challenges

Roshan Kenia; Xiaoman Zhang; Pranav Rajpurkar

ReX-MLE: The Autonomous Agent Benchmark for Medical Imaging Challenges

Roshan Kenia, Xiaoman Zhang, Pranav Rajpurkar

Proceedings of The 9th International Conference on Medical Imaging with Deep Learning, PMLR 315:4288-4315, 2026.

Abstract

Autonomous coding agents built on large language models (LLMs) can now solve many general software and machine learning tasks, but they remain ineffective on complex, domain-specific scientific problems. Medical imaging is a particularly demanding domain, requiring long training cycles, high-dimensional data handling, and specialized preprocessing and validation pipelines, capabilities not fully measured in existing agent benchmarks. To address this gap, we introduce ReX-MLE , a benchmark of 20 challenges derived from high-impact medical imaging competitions spanning diverse modalities and task types. Unlike prior ML-agent benchmarks, ReX-MLE evaluates full end-to-end workflows, requiring agents to independently manage data preprocessing, model training, and submission under realistic compute and time constraints. Evaluating state-of-the-art agents (AIDE, ML-Master, R$&$D-Agent) with different LLM backends (GPT-5, Gemini, Claude), we observe a severe performance gap: most submissions rank in the 0th percentile compared to human experts. Failures stem from domain-knowledge and engineering limitations. ReX-MLE exposes these bottlenecks and provides a foundation for developing domain-aware autonomous AI systems.

Cite this Paper

BibTeX

@InProceedings{pmlr-v315-kenia26a,
  title = 	 {ReX-MLE: The Autonomous Agent Benchmark for Medical Imaging Challenges},
  author =       {Kenia, Roshan and Zhang, Xiaoman and Rajpurkar, Pranav},
  booktitle = 	 {Proceedings of The 9th International Conference on Medical Imaging with Deep Learning},
  pages = 	 {4288--4315},
  year = 	 {2026},
  editor = 	 {Huo, Yuankai and Gao, Mingchen and Kuo, Chang-Fu and Jin, Yueming and Deng, Ruining},
  volume = 	 {315},
  series = 	 {Proceedings of Machine Learning Research},
  month = 	 {08--10 Jul},
  publisher =    {PMLR},
  pdf = 	 {https://raw.githubusercontent.com/mlresearch/v315/main/assets/kenia26a/kenia26a.pdf},
  url = 	 {https://proceedings.mlr.press/v315/kenia26a.html},
  abstract = 	 {Autonomous coding agents built on large language models (LLMs) can now solve many general software and machine learning tasks, but they remain ineffective on complex, domain-specific scientific problems. Medical imaging is a particularly demanding domain, requiring long training cycles, high-dimensional data handling, and specialized preprocessing and validation pipelines, capabilities not fully measured in existing agent benchmarks. To address this gap, we introduce ReX-MLE , a benchmark of 20 challenges derived from high-impact medical imaging competitions spanning diverse modalities and task types. Unlike prior ML-agent benchmarks, ReX-MLE evaluates full end-to-end workflows, requiring agents to independently manage data preprocessing, model training, and submission under realistic compute and time constraints. Evaluating state-of-the-art agents (AIDE, ML-Master, R$&$D-Agent) with different LLM backends (GPT-5, Gemini, Claude), we observe a severe performance gap: most submissions rank in the 0th percentile compared to human experts. Failures stem from domain-knowledge and engineering limitations. ReX-MLE exposes these bottlenecks and provides a foundation for developing domain-aware autonomous AI systems.}
}

Endnote

%0 Conference Paper
%T ReX-MLE: The Autonomous Agent Benchmark for Medical Imaging Challenges
%A Roshan Kenia
%A Xiaoman Zhang
%A Pranav Rajpurkar
%B Proceedings of The 9th International Conference on Medical Imaging with Deep Learning
%C Proceedings of Machine Learning Research
%D 2026
%E Yuankai Huo
%E Mingchen Gao
%E Chang-Fu Kuo
%E Yueming Jin
%E Ruining Deng	
%F pmlr-v315-kenia26a
%I PMLR
%P 4288--4315
%U https://proceedings.mlr.press/v315/kenia26a.html
%V 315
%X Autonomous coding agents built on large language models (LLMs) can now solve many general software and machine learning tasks, but they remain ineffective on complex, domain-specific scientific problems. Medical imaging is a particularly demanding domain, requiring long training cycles, high-dimensional data handling, and specialized preprocessing and validation pipelines, capabilities not fully measured in existing agent benchmarks. To address this gap, we introduce ReX-MLE , a benchmark of 20 challenges derived from high-impact medical imaging competitions spanning diverse modalities and task types. Unlike prior ML-agent benchmarks, ReX-MLE evaluates full end-to-end workflows, requiring agents to independently manage data preprocessing, model training, and submission under realistic compute and time constraints. Evaluating state-of-the-art agents (AIDE, ML-Master, R$&$D-Agent) with different LLM backends (GPT-5, Gemini, Claude), we observe a severe performance gap: most submissions rank in the 0th percentile compared to human experts. Failures stem from domain-knowledge and engineering limitations. ReX-MLE exposes these bottlenecks and provides a foundation for developing domain-aware autonomous AI systems.

APA

Kenia, R., Zhang, X. & Rajpurkar, P.. (2026). ReX-MLE: The Autonomous Agent Benchmark for Medical Imaging Challenges. Proceedings of The 9th International Conference on Medical Imaging with Deep Learning, in Proceedings of Machine Learning Research 315:4288-4315 Available from https://proceedings.mlr.press/v315/kenia26a.html.

Related Material

Download PDF