Cradle: Empowering Foundation Agents towards General Computer Control

Weihao Tan, Wentao Zhang, Xinrun Xu, Haochong Xia, Ziluo Ding, Boyu Li, Bohan Zhou, Junpeng Yue, Jiechuan Jiang, Yewen Li, Ruyi An, Molei Qin, Chuqiao Zong, Longtao Zheng, Yujie Wu, Xiaoqiang Chai, Yifei Bi, Tianbao Xie, Pengjie Gu, Xiyun Li, Ceyao Zhang, Long Tian, Chaojie Wang, Xinrun Wang, Börje F. Karlsson, Bo An, Shuicheng Yan, Zongqing Lu
Proceedings of the 42nd International Conference on Machine Learning, PMLR 267:58658-58725, 2025.

Abstract

Despite their success in specific scenarios, existing foundation agents still struggle to generalize across various virtual scenarios, mainly due to the dramatically different encapsulations of environments with manually designed observation and action spaces. To handle this issue, we propose the General Computer Control (GCC) setting to restrict foundation agents to interact with software through the most unified and standardized interface, i.e., using screenshots as input and keyboard and mouse actions as output. We introduce Cradle, a modular and flexible LMM-powered framework, as a preliminary attempt towards GCC. Enhanced by six key modules, Information Gathering, Self-Reflection, Task Inference, Skill Curation, Action Planning, and Memory, Cradle is able to understand input screenshots and output executable code for low-level keyboard and mouse control after high-level planning and information retrieval, so that Cradle can interact with any software and complete long-horizon complex tasks without relying on any built-in APIs. Experimental results show that Cradle exhibits remarkable generalizability and impressive performance across four previously unexplored commercial video games (Red Dead Redemption 2, Cities:Skylines, Stardew Valley and Dealer’s Life 2), five software applications (Chrome, Outlook, Feishu, Meitu and CapCut), and a comprehensive benchmark, OSWorld. With a unified interface to interact with any software, Cradle greatly extends the reach of foundation agents thus paving the way for generalist agents.

Cite this Paper


BibTeX
@InProceedings{pmlr-v267-tan25h, title = {Cradle: Empowering Foundation Agents towards General Computer Control}, author = {Tan, Weihao and Zhang, Wentao and Xu, Xinrun and Xia, Haochong and Ding, Ziluo and Li, Boyu and Zhou, Bohan and Yue, Junpeng and Jiang, Jiechuan and Li, Yewen and An, Ruyi and Qin, Molei and Zong, Chuqiao and Zheng, Longtao and Wu, Yujie and Chai, Xiaoqiang and Bi, Yifei and Xie, Tianbao and Gu, Pengjie and Li, Xiyun and Zhang, Ceyao and Tian, Long and Wang, Chaojie and Wang, Xinrun and Karlsson, B\"{o}rje F. and An, Bo and Yan, Shuicheng and Lu, Zongqing}, booktitle = {Proceedings of the 42nd International Conference on Machine Learning}, pages = {58658--58725}, year = {2025}, editor = {Singh, Aarti and Fazel, Maryam and Hsu, Daniel and Lacoste-Julien, Simon and Berkenkamp, Felix and Maharaj, Tegan and Wagstaff, Kiri and Zhu, Jerry}, volume = {267}, series = {Proceedings of Machine Learning Research}, month = {13--19 Jul}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v267/main/assets/tan25h/tan25h.pdf}, url = {https://proceedings.mlr.press/v267/tan25h.html}, abstract = {Despite their success in specific scenarios, existing foundation agents still struggle to generalize across various virtual scenarios, mainly due to the dramatically different encapsulations of environments with manually designed observation and action spaces. To handle this issue, we propose the General Computer Control (GCC) setting to restrict foundation agents to interact with software through the most unified and standardized interface, i.e., using screenshots as input and keyboard and mouse actions as output. We introduce Cradle, a modular and flexible LMM-powered framework, as a preliminary attempt towards GCC. Enhanced by six key modules, Information Gathering, Self-Reflection, Task Inference, Skill Curation, Action Planning, and Memory, Cradle is able to understand input screenshots and output executable code for low-level keyboard and mouse control after high-level planning and information retrieval, so that Cradle can interact with any software and complete long-horizon complex tasks without relying on any built-in APIs. Experimental results show that Cradle exhibits remarkable generalizability and impressive performance across four previously unexplored commercial video games (Red Dead Redemption 2, Cities:Skylines, Stardew Valley and Dealer’s Life 2), five software applications (Chrome, Outlook, Feishu, Meitu and CapCut), and a comprehensive benchmark, OSWorld. With a unified interface to interact with any software, Cradle greatly extends the reach of foundation agents thus paving the way for generalist agents.} }
Endnote
%0 Conference Paper %T Cradle: Empowering Foundation Agents towards General Computer Control %A Weihao Tan %A Wentao Zhang %A Xinrun Xu %A Haochong Xia %A Ziluo Ding %A Boyu Li %A Bohan Zhou %A Junpeng Yue %A Jiechuan Jiang %A Yewen Li %A Ruyi An %A Molei Qin %A Chuqiao Zong %A Longtao Zheng %A Yujie Wu %A Xiaoqiang Chai %A Yifei Bi %A Tianbao Xie %A Pengjie Gu %A Xiyun Li %A Ceyao Zhang %A Long Tian %A Chaojie Wang %A Xinrun Wang %A Börje F. Karlsson %A Bo An %A Shuicheng Yan %A Zongqing Lu %B Proceedings of the 42nd International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2025 %E Aarti Singh %E Maryam Fazel %E Daniel Hsu %E Simon Lacoste-Julien %E Felix Berkenkamp %E Tegan Maharaj %E Kiri Wagstaff %E Jerry Zhu %F pmlr-v267-tan25h %I PMLR %P 58658--58725 %U https://proceedings.mlr.press/v267/tan25h.html %V 267 %X Despite their success in specific scenarios, existing foundation agents still struggle to generalize across various virtual scenarios, mainly due to the dramatically different encapsulations of environments with manually designed observation and action spaces. To handle this issue, we propose the General Computer Control (GCC) setting to restrict foundation agents to interact with software through the most unified and standardized interface, i.e., using screenshots as input and keyboard and mouse actions as output. We introduce Cradle, a modular and flexible LMM-powered framework, as a preliminary attempt towards GCC. Enhanced by six key modules, Information Gathering, Self-Reflection, Task Inference, Skill Curation, Action Planning, and Memory, Cradle is able to understand input screenshots and output executable code for low-level keyboard and mouse control after high-level planning and information retrieval, so that Cradle can interact with any software and complete long-horizon complex tasks without relying on any built-in APIs. Experimental results show that Cradle exhibits remarkable generalizability and impressive performance across four previously unexplored commercial video games (Red Dead Redemption 2, Cities:Skylines, Stardew Valley and Dealer’s Life 2), five software applications (Chrome, Outlook, Feishu, Meitu and CapCut), and a comprehensive benchmark, OSWorld. With a unified interface to interact with any software, Cradle greatly extends the reach of foundation agents thus paving the way for generalist agents.
APA
Tan, W., Zhang, W., Xu, X., Xia, H., Ding, Z., Li, B., Zhou, B., Yue, J., Jiang, J., Li, Y., An, R., Qin, M., Zong, C., Zheng, L., Wu, Y., Chai, X., Bi, Y., Xie, T., Gu, P., Li, X., Zhang, C., Tian, L., Wang, C., Wang, X., Karlsson, B.F., An, B., Yan, S. & Lu, Z.. (2025). Cradle: Empowering Foundation Agents towards General Computer Control. Proceedings of the 42nd International Conference on Machine Learning, in Proceedings of Machine Learning Research 267:58658-58725 Available from https://proceedings.mlr.press/v267/tan25h.html.

Related Material