Efficient Evaluation of Multi-Task Robot Policies With Active Experiment Selection

Abrar Anwar, Rohan Gupta, Zain Merchant, Sayan Ghosh, Willie Neiswanger, Jesse Thomason
Proceedings of The 9th Conference on Robot Learning, PMLR 305:1636-1653, 2025.

Abstract

Evaluating learned robot control policies to determine their performance costs the experimenter time and effort. As robots become more capable in accomplishing diverse tasks, evaluating across all these tasks becomes more difficult as it is impractical to test every policy on every task multiple times. Rather than considering the average performance of a policy on a task, we consider the distribution of performance over time. In a multi-task policy evaluation setting, we actively model the distribution of robot performance across multiple tasks and policies as we sequentially execute experiments. We show that natural language is a useful prior in modeling relationships between tasks because they often share similarities that can reveal potential relationships in policy behavior. We leverage this formulation to reduce experimenter effort by using a cost-aware information gain heuristic to efficiently select informative trials. We conduct experiments on existing evaluation data from real robots and simulations and find a 50% reduction in estimates of the mean performance given a fixed cost budget. We encourage the use of our surrogate model as a scalable approach to track progress in evaluation.

Cite this Paper


BibTeX
@InProceedings{pmlr-v305-anwar25a, title = {Efficient Evaluation of Multi-Task Robot Policies With Active Experiment Selection}, author = {Anwar, Abrar and Gupta, Rohan and Merchant, Zain and Ghosh, Sayan and Neiswanger, Willie and Thomason, Jesse}, booktitle = {Proceedings of The 9th Conference on Robot Learning}, pages = {1636--1653}, year = {2025}, editor = {Lim, Joseph and Song, Shuran and Park, Hae-Won}, volume = {305}, series = {Proceedings of Machine Learning Research}, month = {27--30 Sep}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v305/main/assets/anwar25a/anwar25a.pdf}, url = {https://proceedings.mlr.press/v305/anwar25a.html}, abstract = {Evaluating learned robot control policies to determine their performance costs the experimenter time and effort. As robots become more capable in accomplishing diverse tasks, evaluating across all these tasks becomes more difficult as it is impractical to test every policy on every task multiple times. Rather than considering the average performance of a policy on a task, we consider the distribution of performance over time. In a multi-task policy evaluation setting, we actively model the distribution of robot performance across multiple tasks and policies as we sequentially execute experiments. We show that natural language is a useful prior in modeling relationships between tasks because they often share similarities that can reveal potential relationships in policy behavior. We leverage this formulation to reduce experimenter effort by using a cost-aware information gain heuristic to efficiently select informative trials. We conduct experiments on existing evaluation data from real robots and simulations and find a 50% reduction in estimates of the mean performance given a fixed cost budget. We encourage the use of our surrogate model as a scalable approach to track progress in evaluation.} }
Endnote
%0 Conference Paper %T Efficient Evaluation of Multi-Task Robot Policies With Active Experiment Selection %A Abrar Anwar %A Rohan Gupta %A Zain Merchant %A Sayan Ghosh %A Willie Neiswanger %A Jesse Thomason %B Proceedings of The 9th Conference on Robot Learning %C Proceedings of Machine Learning Research %D 2025 %E Joseph Lim %E Shuran Song %E Hae-Won Park %F pmlr-v305-anwar25a %I PMLR %P 1636--1653 %U https://proceedings.mlr.press/v305/anwar25a.html %V 305 %X Evaluating learned robot control policies to determine their performance costs the experimenter time and effort. As robots become more capable in accomplishing diverse tasks, evaluating across all these tasks becomes more difficult as it is impractical to test every policy on every task multiple times. Rather than considering the average performance of a policy on a task, we consider the distribution of performance over time. In a multi-task policy evaluation setting, we actively model the distribution of robot performance across multiple tasks and policies as we sequentially execute experiments. We show that natural language is a useful prior in modeling relationships between tasks because they often share similarities that can reveal potential relationships in policy behavior. We leverage this formulation to reduce experimenter effort by using a cost-aware information gain heuristic to efficiently select informative trials. We conduct experiments on existing evaluation data from real robots and simulations and find a 50% reduction in estimates of the mean performance given a fixed cost budget. We encourage the use of our surrogate model as a scalable approach to track progress in evaluation.
APA
Anwar, A., Gupta, R., Merchant, Z., Ghosh, S., Neiswanger, W. & Thomason, J.. (2025). Efficient Evaluation of Multi-Task Robot Policies With Active Experiment Selection. Proceedings of The 9th Conference on Robot Learning, in Proceedings of Machine Learning Research 305:1636-1653 Available from https://proceedings.mlr.press/v305/anwar25a.html.

Related Material