Windows Agent Arena: Evaluating Multi-Modal OS Agents at Scale

Rogerio Bonatti; Dan Zhao; Francesco Bonacci; Dillon Dupont; Sara Abdali; Yinheng Li; Yadong Lu; Justin Wagle; Kazuhito Koishida; Arthur Bucker; Lawrence Keunho Jang; Zheng Hui

Windows Agent Arena: Evaluating Multi-Modal OS Agents at Scale

Rogerio Bonatti, Dan Zhao, Francesco Bonacci, Dillon Dupont, Sara Abdali, Yinheng Li, Yadong Lu, Justin Wagle, Kazuhito Koishida, Arthur Bucker, Lawrence Keunho Jang, Zheng Hui

Proceedings of the 42nd International Conference on Machine Learning, PMLR 267:4874-4910, 2025.

Abstract

Large language models (LLMs) show potential as computer agents, enhancing productivity and software accessibility in multi-modal tasks. However, measuring agent performance in sufficiently realistic and complex environments becomes increasingly challenging as: (i) most benchmarks are limited to specific modalities/domains (e.g., text-only, web navigation, Q&A) and (ii) full benchmark evaluations are slow (on order of magnitude of multiple hours/days) given the multi-step sequential nature of tasks. To address these challenges, we introduce Windows Agent Arena: a general environment focusing exclusively on the Windows operating system (OS) where agents can operate freely within a real OS to use the same applications and tools available to human users when performing tasks. We create 150+ diverse tasks across representative domains that require agentic abilities in planning, screen understanding, and tool usage. Our benchmark is scalable and can be seamlessly parallelized for a full benchmark evaluation in as little as $20$ minutes. Our work not only speeds up the development and evaluation cycle of multi-modal agents, but also highlights and analyzes existing shortfalls in the agentic abilities of several multimodal LLMs as agents within the Windows computing environment—with the best achieving only a 19.5% success rate compared to a human success rate of 74.5%.

Cite this Paper

BibTeX

@InProceedings{pmlr-v267-bonatti25a,
  title = 	 {Windows Agent Arena: Evaluating Multi-Modal {OS} Agents at Scale},
  author =       {Bonatti, Rogerio and Zhao, Dan and Bonacci, Francesco and Dupont, Dillon and Abdali, Sara and Li, Yinheng and Lu, Yadong and Wagle, Justin and Koishida, Kazuhito and Bucker, Arthur and Jang, Lawrence Keunho and Hui, Zheng},
  booktitle = 	 {Proceedings of the 42nd International Conference on Machine Learning},
  pages = 	 {4874--4910},
  year = 	 {2025},
  editor = 	 {Singh, Aarti and Fazel, Maryam and Hsu, Daniel and Lacoste-Julien, Simon and Berkenkamp, Felix and Maharaj, Tegan and Wagstaff, Kiri and Zhu, Jerry},
  volume = 	 {267},
  series = 	 {Proceedings of Machine Learning Research},
  month = 	 {13--19 Jul},
  publisher =    {PMLR},
  pdf = 	 {https://raw.githubusercontent.com/mlresearch/v267/main/assets/bonatti25a/bonatti25a.pdf},
  url = 	 {https://proceedings.mlr.press/v267/bonatti25a.html},
  abstract = 	 {Large language models (LLMs) show potential as computer agents, enhancing productivity and software accessibility in multi-modal tasks. However, measuring agent performance in sufficiently realistic and complex environments becomes increasingly challenging as: (i) most benchmarks are limited to specific modalities/domains (e.g., text-only, web navigation, Q&A) and (ii) full benchmark evaluations are slow (on order of magnitude of multiple hours/days) given the multi-step sequential nature of tasks. To address these challenges, we introduce Windows Agent Arena: a general environment focusing exclusively on the Windows operating system (OS) where agents can operate freely within a real OS to use the same applications and tools available to human users when performing tasks. We create 150+ diverse tasks across representative domains that require agentic abilities in planning, screen understanding, and tool usage. Our benchmark is scalable and can be seamlessly parallelized for a full benchmark evaluation in as little as $20$ minutes. Our work not only speeds up the development and evaluation cycle of multi-modal agents, but also highlights and analyzes existing shortfalls in the agentic abilities of several multimodal LLMs as agents within the Windows computing environment—with the best achieving only a 19.5% success rate compared to a human success rate of 74.5%.}
}

Endnote

%0 Conference Paper
%T Windows Agent Arena: Evaluating Multi-Modal OS Agents at Scale
%A Rogerio Bonatti
%A Dan Zhao
%A Francesco Bonacci
%A Dillon Dupont
%A Sara Abdali
%A Yinheng Li
%A Yadong Lu
%A Justin Wagle
%A Kazuhito Koishida
%A Arthur Bucker
%A Lawrence Keunho Jang
%A Zheng Hui
%B Proceedings of the 42nd International Conference on Machine Learning
%C Proceedings of Machine Learning Research
%D 2025
%E Aarti Singh
%E Maryam Fazel
%E Daniel Hsu
%E Simon Lacoste-Julien
%E Felix Berkenkamp
%E Tegan Maharaj
%E Kiri Wagstaff
%E Jerry Zhu	
%F pmlr-v267-bonatti25a
%I PMLR
%P 4874--4910
%U https://proceedings.mlr.press/v267/bonatti25a.html
%V 267
%X Large language models (LLMs) show potential as computer agents, enhancing productivity and software accessibility in multi-modal tasks. However, measuring agent performance in sufficiently realistic and complex environments becomes increasingly challenging as: (i) most benchmarks are limited to specific modalities/domains (e.g., text-only, web navigation, Q&A) and (ii) full benchmark evaluations are slow (on order of magnitude of multiple hours/days) given the multi-step sequential nature of tasks. To address these challenges, we introduce Windows Agent Arena: a general environment focusing exclusively on the Windows operating system (OS) where agents can operate freely within a real OS to use the same applications and tools available to human users when performing tasks. We create 150+ diverse tasks across representative domains that require agentic abilities in planning, screen understanding, and tool usage. Our benchmark is scalable and can be seamlessly parallelized for a full benchmark evaluation in as little as $20$ minutes. Our work not only speeds up the development and evaluation cycle of multi-modal agents, but also highlights and analyzes existing shortfalls in the agentic abilities of several multimodal LLMs as agents within the Windows computing environment—with the best achieving only a 19.5% success rate compared to a human success rate of 74.5%.

APA

Bonatti, R., Zhao, D., Bonacci, F., Dupont, D., Abdali, S., Li, Y., Lu, Y., Wagle, J., Koishida, K., Bucker, A., Jang, L.K. & Hui, Z.. (2025). Windows Agent Arena: Evaluating Multi-Modal OS Agents at Scale. Proceedings of the 42nd International Conference on Machine Learning, in Proceedings of Machine Learning Research 267:4874-4910 Available from https://proceedings.mlr.press/v267/bonatti25a.html.

Windows Agent Arena: Evaluating Multi-Modal OS Agents at Scale

Abstract

Cite this Paper

Related Material