RoboArena: Distributed Real-World Evaluation of Generalist Robot Policies

Pranav Atreya, Karl Pertsch, Tony Lee, Moo Jin Kim, Arhan Jain, Artur Kuramshin, Cyrus Neary, Edward S. Hu, Kanav Arora, Kirsty Ellis, Luca Macesanu, Matthew Leonard, Meedeum Cho, Ozgur Aslan, Shivin Dass, Tony Wang, Xingfang Yuan, Abhishek Gupta, Dinesh Jayaraman, Glen Berseth, Kostas Daniilidis, Roberto Martín-Martín, Youngwoon Lee, Percy Liang, Chelsea Finn, Sergey Levine
Proceedings of The 9th Conference on Robot Learning, PMLR 305:336-364, 2025.

Abstract

Comprehensive, unbiased, and comparable evaluation of modern generalist policies is uniquely challenging: existing approaches for robot benchmarking typically rely on heavy standardization, either by specifying fixed evaluation tasks and environments, or by hosting centralized "robot challenges", and do not readily scale to evaluating generalist policies across a broad range of tasks and environments. In this work, we propose RoboArena, a new approach for scalable evaluation of generalist robot policies in the real world. Instead of standardizing evaluations around fixed tasks, environments, or locations, we propose to crowd-source evaluations across a distributed network of evaluators. Importantly, evaluators can freely choose the tasks and environments they evaluate on, enabling easy scaling of diversity, but they are required to perform double-blind evaluations over pairs of policies. Then, by aggregating preference feedback from pairwise comparisons across diverse tasks and environments, we can derive a ranking of policies. We instantiate our approach across a network of evaluators at seven academic institutions using the DROID robot platform. Through more than 600 pairwise real-robot evaluation episodes across seven generalist policies, we demonstrate that our crowd-sourced approach can more accurately rank the performance of existing generalist policies than conventional, centralized evaluation approaches, while being more scalable, resilient, and trustworthy. We open our evaluation network to the community and hope that it can enable more accessible comparisons of generalist robot policies.

Cite this Paper


BibTeX
@InProceedings{pmlr-v305-atreya25a, title = {RoboArena: Distributed Real-World Evaluation of Generalist Robot Policies}, author = {Atreya, Pranav and Pertsch, Karl and Lee, Tony and Kim, Moo Jin and Jain, Arhan and Kuramshin, Artur and Neary, Cyrus and Hu, Edward S. and Arora, Kanav and Ellis, Kirsty and Macesanu, Luca and Leonard, Matthew and Cho, Meedeum and Aslan, Ozgur and Dass, Shivin and Wang, Tony and Yuan, Xingfang and Gupta, Abhishek and Jayaraman, Dinesh and Berseth, Glen and Daniilidis, Kostas and Mart\'{i}n-Mart\'{i}n, Roberto and Lee, Youngwoon and Liang, Percy and Finn, Chelsea and Levine, Sergey}, booktitle = {Proceedings of The 9th Conference on Robot Learning}, pages = {336--364}, year = {2025}, editor = {Lim, Joseph and Song, Shuran and Park, Hae-Won}, volume = {305}, series = {Proceedings of Machine Learning Research}, month = {27--30 Sep}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v305/main/assets/atreya25a/atreya25a.pdf}, url = {https://proceedings.mlr.press/v305/atreya25a.html}, abstract = {Comprehensive, unbiased, and comparable evaluation of modern generalist policies is uniquely challenging: existing approaches for robot benchmarking typically rely on heavy standardization, either by specifying fixed evaluation tasks and environments, or by hosting centralized "robot challenges", and do not readily scale to evaluating generalist policies across a broad range of tasks and environments. In this work, we propose RoboArena, a new approach for scalable evaluation of generalist robot policies in the real world. Instead of standardizing evaluations around fixed tasks, environments, or locations, we propose to crowd-source evaluations across a distributed network of evaluators. Importantly, evaluators can freely choose the tasks and environments they evaluate on, enabling easy scaling of diversity, but they are required to perform double-blind evaluations over pairs of policies. Then, by aggregating preference feedback from pairwise comparisons across diverse tasks and environments, we can derive a ranking of policies. We instantiate our approach across a network of evaluators at seven academic institutions using the DROID robot platform. Through more than 600 pairwise real-robot evaluation episodes across seven generalist policies, we demonstrate that our crowd-sourced approach can more accurately rank the performance of existing generalist policies than conventional, centralized evaluation approaches, while being more scalable, resilient, and trustworthy. We open our evaluation network to the community and hope that it can enable more accessible comparisons of generalist robot policies.} }
Endnote
%0 Conference Paper %T RoboArena: Distributed Real-World Evaluation of Generalist Robot Policies %A Pranav Atreya %A Karl Pertsch %A Tony Lee %A Moo Jin Kim %A Arhan Jain %A Artur Kuramshin %A Cyrus Neary %A Edward S. Hu %A Kanav Arora %A Kirsty Ellis %A Luca Macesanu %A Matthew Leonard %A Meedeum Cho %A Ozgur Aslan %A Shivin Dass %A Tony Wang %A Xingfang Yuan %A Abhishek Gupta %A Dinesh Jayaraman %A Glen Berseth %A Kostas Daniilidis %A Roberto Martín-Martín %A Youngwoon Lee %A Percy Liang %A Chelsea Finn %A Sergey Levine %B Proceedings of The 9th Conference on Robot Learning %C Proceedings of Machine Learning Research %D 2025 %E Joseph Lim %E Shuran Song %E Hae-Won Park %F pmlr-v305-atreya25a %I PMLR %P 336--364 %U https://proceedings.mlr.press/v305/atreya25a.html %V 305 %X Comprehensive, unbiased, and comparable evaluation of modern generalist policies is uniquely challenging: existing approaches for robot benchmarking typically rely on heavy standardization, either by specifying fixed evaluation tasks and environments, or by hosting centralized "robot challenges", and do not readily scale to evaluating generalist policies across a broad range of tasks and environments. In this work, we propose RoboArena, a new approach for scalable evaluation of generalist robot policies in the real world. Instead of standardizing evaluations around fixed tasks, environments, or locations, we propose to crowd-source evaluations across a distributed network of evaluators. Importantly, evaluators can freely choose the tasks and environments they evaluate on, enabling easy scaling of diversity, but they are required to perform double-blind evaluations over pairs of policies. Then, by aggregating preference feedback from pairwise comparisons across diverse tasks and environments, we can derive a ranking of policies. We instantiate our approach across a network of evaluators at seven academic institutions using the DROID robot platform. Through more than 600 pairwise real-robot evaluation episodes across seven generalist policies, we demonstrate that our crowd-sourced approach can more accurately rank the performance of existing generalist policies than conventional, centralized evaluation approaches, while being more scalable, resilient, and trustworthy. We open our evaluation network to the community and hope that it can enable more accessible comparisons of generalist robot policies.
APA
Atreya, P., Pertsch, K., Lee, T., Kim, M.J., Jain, A., Kuramshin, A., Neary, C., Hu, E.S., Arora, K., Ellis, K., Macesanu, L., Leonard, M., Cho, M., Aslan, O., Dass, S., Wang, T., Yuan, X., Gupta, A., Jayaraman, D., Berseth, G., Daniilidis, K., Martín-Martín, R., Lee, Y., Liang, P., Finn, C. & Levine, S.. (2025). RoboArena: Distributed Real-World Evaluation of Generalist Robot Policies. Proceedings of The 9th Conference on Robot Learning, in Proceedings of Machine Learning Research 305:336-364 Available from https://proceedings.mlr.press/v305/atreya25a.html.

Related Material