MCU: An Evaluation Framework for Open-Ended Game Agents

Xinyue Zheng, Haowei Lin, Kaichen He, Zihao Wang, Qiang Fu, Haobo Fu, Zilong Zheng, Yitao Liang
Proceedings of the 42nd International Conference on Machine Learning, PMLR 267:78221-78259, 2025.

Abstract

Developing AI agents capable of interacting with open-world environments to solve diverse tasks is a compelling challenge. However, evaluating such open-ended agents remains difficult, with current benchmarks facing scalability limitations. To address this, we introduce Minecraft Universe (MCU), a comprehensive evaluation framework set within the open-world video game Minecraft. MCU incorporates three key components: (1) an expanding collection of 3,452 composable atomic tasks that encompasses 11 major categories and 41 subcategories of challenges; (2) a task composition mechanism capable of generating infinite diverse tasks with varying difficulty; and (3) a general evaluation framework that achieves 91.5% alignment with human ratings for open-ended task assessment. Empirical results reveal that even state-of-the-art foundation agents struggle with the increasing diversity and complexity of tasks. These findings highlight the necessity of MCU as a robust benchmark to drive progress in AI agent development within open-ended environments. Our evaluation code and scripts are available at https://github.com/CraftJarvis/MCU.

Cite this Paper


BibTeX
@InProceedings{pmlr-v267-zheng25j, title = {{MCU}: An Evaluation Framework for Open-Ended Game Agents}, author = {Zheng, Xinyue and Lin, Haowei and He, Kaichen and Wang, Zihao and Fu, Qiang and Fu, Haobo and Zheng, Zilong and Liang, Yitao}, booktitle = {Proceedings of the 42nd International Conference on Machine Learning}, pages = {78221--78259}, year = {2025}, editor = {Singh, Aarti and Fazel, Maryam and Hsu, Daniel and Lacoste-Julien, Simon and Berkenkamp, Felix and Maharaj, Tegan and Wagstaff, Kiri and Zhu, Jerry}, volume = {267}, series = {Proceedings of Machine Learning Research}, month = {13--19 Jul}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v267/main/assets/zheng25j/zheng25j.pdf}, url = {https://proceedings.mlr.press/v267/zheng25j.html}, abstract = {Developing AI agents capable of interacting with open-world environments to solve diverse tasks is a compelling challenge. However, evaluating such open-ended agents remains difficult, with current benchmarks facing scalability limitations. To address this, we introduce Minecraft Universe (MCU), a comprehensive evaluation framework set within the open-world video game Minecraft. MCU incorporates three key components: (1) an expanding collection of 3,452 composable atomic tasks that encompasses 11 major categories and 41 subcategories of challenges; (2) a task composition mechanism capable of generating infinite diverse tasks with varying difficulty; and (3) a general evaluation framework that achieves 91.5% alignment with human ratings for open-ended task assessment. Empirical results reveal that even state-of-the-art foundation agents struggle with the increasing diversity and complexity of tasks. These findings highlight the necessity of MCU as a robust benchmark to drive progress in AI agent development within open-ended environments. Our evaluation code and scripts are available at https://github.com/CraftJarvis/MCU.} }
Endnote
%0 Conference Paper %T MCU: An Evaluation Framework for Open-Ended Game Agents %A Xinyue Zheng %A Haowei Lin %A Kaichen He %A Zihao Wang %A Qiang Fu %A Haobo Fu %A Zilong Zheng %A Yitao Liang %B Proceedings of the 42nd International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2025 %E Aarti Singh %E Maryam Fazel %E Daniel Hsu %E Simon Lacoste-Julien %E Felix Berkenkamp %E Tegan Maharaj %E Kiri Wagstaff %E Jerry Zhu %F pmlr-v267-zheng25j %I PMLR %P 78221--78259 %U https://proceedings.mlr.press/v267/zheng25j.html %V 267 %X Developing AI agents capable of interacting with open-world environments to solve diverse tasks is a compelling challenge. However, evaluating such open-ended agents remains difficult, with current benchmarks facing scalability limitations. To address this, we introduce Minecraft Universe (MCU), a comprehensive evaluation framework set within the open-world video game Minecraft. MCU incorporates three key components: (1) an expanding collection of 3,452 composable atomic tasks that encompasses 11 major categories and 41 subcategories of challenges; (2) a task composition mechanism capable of generating infinite diverse tasks with varying difficulty; and (3) a general evaluation framework that achieves 91.5% alignment with human ratings for open-ended task assessment. Empirical results reveal that even state-of-the-art foundation agents struggle with the increasing diversity and complexity of tasks. These findings highlight the necessity of MCU as a robust benchmark to drive progress in AI agent development within open-ended environments. Our evaluation code and scripts are available at https://github.com/CraftJarvis/MCU.
APA
Zheng, X., Lin, H., He, K., Wang, Z., Fu, Q., Fu, H., Zheng, Z. & Liang, Y.. (2025). MCU: An Evaluation Framework for Open-Ended Game Agents. Proceedings of the 42nd International Conference on Machine Learning, in Proceedings of Machine Learning Research 267:78221-78259 Available from https://proceedings.mlr.press/v267/zheng25j.html.

Related Material