UI-Vision: A Desktop-centric GUI Benchmark for Visual Perception and Interaction

Shravan Nayak; Xiangru Jian; Kevin Qinghong Lin; Juan A. Rodriguez; Montek Kalsi; Nicolas Chapados; M. Tamer Özsu; Aishwarya Agrawal; David Vazquez; Christopher Pal; Perouz Taslakian; Spandana Gella; Sai Rajeswar

UI-Vision: A Desktop-centric GUI Benchmark for Visual Perception and Interaction

Shravan Nayak, Xiangru Jian, Kevin Qinghong Lin, Juan A. Rodriguez, Montek Kalsi, Nicolas Chapados, M. Tamer Özsu, Aishwarya Agrawal, David Vazquez, Christopher Pal, Perouz Taslakian, Spandana Gella, Sai Rajeswar

Proceedings of the 42nd International Conference on Machine Learning, PMLR 267:45817-45851, 2025.

Abstract

Autonomous agents that navigate Graphical User Interfaces (GUIs) to automate tasks like document editing and file management can greatly enhance computer workflows. While existing research focuses on online settings, desktop environments, critical for many professional and everyday tasks, remain underexplored due to data collection challenges and licensing issues. We introduce UI-Vision, the first comprehensive, license-permissive benchmark for offline, fine-grained evaluation of computer use agents in real-world desktop environments. Unlike online benchmarks, UI-Vision provides: (i) dense, high-quality annotations of human demonstrations, including bounding boxes, UI labels, and action trajectories (clicks, drags, and keyboard inputs) across 83 software applications, and (ii) three fine-to-coarse grained tasks—Element Grounding, Layout Grounding, and Action Prediction—with well-defined metrics to rigorously evaluate agents’ performance in desktop environments. Our evaluation reveals critical limitations in state-of-the-art models like UI-TARS-72B, including issues with understanding professional software, spatial reasoning, and complex actions like drag-and-drop. These findings highlight the challenges in developing fully autonomous computer-use agents. With UI-Vision, we aim to advance the development of more capable agents for real-world desktop tasks.

Cite this Paper

BibTeX

@InProceedings{pmlr-v267-nayak25a,
  title = 	 {{UI}-Vision: A Desktop-centric {GUI} Benchmark for Visual Perception and Interaction},
  author =       {Nayak, Shravan and Jian, Xiangru and Lin, Kevin Qinghong and Rodriguez, Juan A. and Kalsi, Montek and Chapados, Nicolas and \"{O}zsu, M. Tamer and Agrawal, Aishwarya and Vazquez, David and Pal, Christopher and Taslakian, Perouz and Gella, Spandana and Rajeswar, Sai},
  booktitle = 	 {Proceedings of the 42nd International Conference on Machine Learning},
  pages = 	 {45817--45851},
  year = 	 {2025},
  editor = 	 {Singh, Aarti and Fazel, Maryam and Hsu, Daniel and Lacoste-Julien, Simon and Berkenkamp, Felix and Maharaj, Tegan and Wagstaff, Kiri and Zhu, Jerry},
  volume = 	 {267},
  series = 	 {Proceedings of Machine Learning Research},
  month = 	 {13--19 Jul},
  publisher =    {PMLR},
  pdf = 	 {https://raw.githubusercontent.com/mlresearch/v267/main/assets/nayak25a/nayak25a.pdf},
  url = 	 {https://proceedings.mlr.press/v267/nayak25a.html},
  abstract = 	 {Autonomous agents that navigate Graphical User Interfaces (GUIs) to automate tasks like document editing and file management can greatly enhance computer workflows. While existing research focuses on online settings, desktop environments, critical for many professional and everyday tasks, remain underexplored due to data collection challenges and licensing issues. We introduce UI-Vision, the first comprehensive, license-permissive benchmark for offline, fine-grained evaluation of computer use agents in real-world desktop environments. Unlike online benchmarks, UI-Vision provides: (i) dense, high-quality annotations of human demonstrations, including bounding boxes, UI labels, and action trajectories (clicks, drags, and keyboard inputs) across 83 software applications, and (ii) three fine-to-coarse grained tasks—Element Grounding, Layout Grounding, and Action Prediction—with well-defined metrics to rigorously evaluate agents’ performance in desktop environments. Our evaluation reveals critical limitations in state-of-the-art models like UI-TARS-72B, including issues with understanding professional software, spatial reasoning, and complex actions like drag-and-drop. These findings highlight the challenges in developing fully autonomous computer-use agents. With UI-Vision, we aim to advance the development of more capable agents for real-world desktop tasks.}
}

Endnote

%0 Conference Paper
%T UI-Vision: A Desktop-centric GUI Benchmark for Visual Perception and Interaction
%A Shravan Nayak
%A Xiangru Jian
%A Kevin Qinghong Lin
%A Juan A. Rodriguez
%A Montek Kalsi
%A Nicolas Chapados
%A M. Tamer Özsu
%A Aishwarya Agrawal
%A David Vazquez
%A Christopher Pal
%A Perouz Taslakian
%A Spandana Gella
%A Sai Rajeswar
%B Proceedings of the 42nd International Conference on Machine Learning
%C Proceedings of Machine Learning Research
%D 2025
%E Aarti Singh
%E Maryam Fazel
%E Daniel Hsu
%E Simon Lacoste-Julien
%E Felix Berkenkamp
%E Tegan Maharaj
%E Kiri Wagstaff
%E Jerry Zhu	
%F pmlr-v267-nayak25a
%I PMLR
%P 45817--45851
%U https://proceedings.mlr.press/v267/nayak25a.html
%V 267
%X Autonomous agents that navigate Graphical User Interfaces (GUIs) to automate tasks like document editing and file management can greatly enhance computer workflows. While existing research focuses on online settings, desktop environments, critical for many professional and everyday tasks, remain underexplored due to data collection challenges and licensing issues. We introduce UI-Vision, the first comprehensive, license-permissive benchmark for offline, fine-grained evaluation of computer use agents in real-world desktop environments. Unlike online benchmarks, UI-Vision provides: (i) dense, high-quality annotations of human demonstrations, including bounding boxes, UI labels, and action trajectories (clicks, drags, and keyboard inputs) across 83 software applications, and (ii) three fine-to-coarse grained tasks—Element Grounding, Layout Grounding, and Action Prediction—with well-defined metrics to rigorously evaluate agents’ performance in desktop environments. Our evaluation reveals critical limitations in state-of-the-art models like UI-TARS-72B, including issues with understanding professional software, spatial reasoning, and complex actions like drag-and-drop. These findings highlight the challenges in developing fully autonomous computer-use agents. With UI-Vision, we aim to advance the development of more capable agents for real-world desktop tasks.

APA

Nayak, S., Jian, X., Lin, K.Q., Rodriguez, J.A., Kalsi, M., Chapados, N., Özsu, M.T., Agrawal, A., Vazquez, D., Pal, C., Taslakian, P., Gella, S. & Rajeswar, S.. (2025). UI-Vision: A Desktop-centric GUI Benchmark for Visual Perception and Interaction. Proceedings of the 42nd International Conference on Machine Learning, in Proceedings of Machine Learning Research 267:45817-45851 Available from https://proceedings.mlr.press/v267/nayak25a.html.

UI-Vision: A Desktop-centric GUI Benchmark for Visual Perception and Interaction

Abstract

Cite this Paper

Related Material