Aguvis: Unified Pure Vision Agents for Autonomous GUI Interaction

Yiheng Xu, Zekun Wang, Junli Wang, Dunjie Lu, Tianbao Xie, Amrita Saha, Doyen Sahoo, Tao Yu, Caiming Xiong
Proceedings of the 42nd International Conference on Machine Learning, PMLR 267:69772-69805, 2025.

Abstract

Automating GUI tasks remains challenging due to reliance on textual representations, platform-specific action spaces, and limited reasoning capabilities. We introduce Aguvis, a unified vision-based framework for autonomous GUI agents that directly operates on screen images, standardizes cross-platform interactions and incorporates structured reasoning via inner monologue. To enable this, we construct Aguvis data collection, a large-scale dataset with multimodal grounding and reasoning annotations, and develop a two-stage training pipeline that separates GUI grounding from planning and reasoning. Experiments show that Aguvis achieves state-of-the-art performance across offline and real-world online benchmarks, marking the first fully autonomous vision-based GUI agent that operates without closed-source models. We open-source all datasets, models, and training recipes at https://aguvis-project.github.io to advance future research.

Cite this Paper


BibTeX
@InProceedings{pmlr-v267-xu25ae, title = {Aguvis: Unified Pure Vision Agents for Autonomous {GUI} Interaction}, author = {Xu, Yiheng and Wang, Zekun and Wang, Junli and Lu, Dunjie and Xie, Tianbao and Saha, Amrita and Sahoo, Doyen and Yu, Tao and Xiong, Caiming}, booktitle = {Proceedings of the 42nd International Conference on Machine Learning}, pages = {69772--69805}, year = {2025}, editor = {Singh, Aarti and Fazel, Maryam and Hsu, Daniel and Lacoste-Julien, Simon and Berkenkamp, Felix and Maharaj, Tegan and Wagstaff, Kiri and Zhu, Jerry}, volume = {267}, series = {Proceedings of Machine Learning Research}, month = {13--19 Jul}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v267/main/assets/xu25ae/xu25ae.pdf}, url = {https://proceedings.mlr.press/v267/xu25ae.html}, abstract = {Automating GUI tasks remains challenging due to reliance on textual representations, platform-specific action spaces, and limited reasoning capabilities. We introduce Aguvis, a unified vision-based framework for autonomous GUI agents that directly operates on screen images, standardizes cross-platform interactions and incorporates structured reasoning via inner monologue. To enable this, we construct Aguvis data collection, a large-scale dataset with multimodal grounding and reasoning annotations, and develop a two-stage training pipeline that separates GUI grounding from planning and reasoning. Experiments show that Aguvis achieves state-of-the-art performance across offline and real-world online benchmarks, marking the first fully autonomous vision-based GUI agent that operates without closed-source models. We open-source all datasets, models, and training recipes at https://aguvis-project.github.io to advance future research.} }
Endnote
%0 Conference Paper %T Aguvis: Unified Pure Vision Agents for Autonomous GUI Interaction %A Yiheng Xu %A Zekun Wang %A Junli Wang %A Dunjie Lu %A Tianbao Xie %A Amrita Saha %A Doyen Sahoo %A Tao Yu %A Caiming Xiong %B Proceedings of the 42nd International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2025 %E Aarti Singh %E Maryam Fazel %E Daniel Hsu %E Simon Lacoste-Julien %E Felix Berkenkamp %E Tegan Maharaj %E Kiri Wagstaff %E Jerry Zhu %F pmlr-v267-xu25ae %I PMLR %P 69772--69805 %U https://proceedings.mlr.press/v267/xu25ae.html %V 267 %X Automating GUI tasks remains challenging due to reliance on textual representations, platform-specific action spaces, and limited reasoning capabilities. We introduce Aguvis, a unified vision-based framework for autonomous GUI agents that directly operates on screen images, standardizes cross-platform interactions and incorporates structured reasoning via inner monologue. To enable this, we construct Aguvis data collection, a large-scale dataset with multimodal grounding and reasoning annotations, and develop a two-stage training pipeline that separates GUI grounding from planning and reasoning. Experiments show that Aguvis achieves state-of-the-art performance across offline and real-world online benchmarks, marking the first fully autonomous vision-based GUI agent that operates without closed-source models. We open-source all datasets, models, and training recipes at https://aguvis-project.github.io to advance future research.
APA
Xu, Y., Wang, Z., Wang, J., Lu, D., Xie, T., Saha, A., Sahoo, D., Yu, T. & Xiong, C.. (2025). Aguvis: Unified Pure Vision Agents for Autonomous GUI Interaction. Proceedings of the 42nd International Conference on Machine Learning, in Proceedings of Machine Learning Research 267:69772-69805 Available from https://proceedings.mlr.press/v267/xu25ae.html.

Related Material