Contrast Sets for Evaluating Language-Guided Robot Policies

Abrar Anwar; Rohan Gupta; Jesse Thomason

Contrast Sets for Evaluating Language-Guided Robot Policies

Abrar Anwar, Rohan Gupta, Jesse Thomason

Proceedings of The 8th Conference on Robot Learning, PMLR 270:2205-2219, 2025.

Abstract

Robot evaluations in language-guided, real world settings are time-consuming and often sample only a small space of potential instructions across complex scenes. In this work, we introduce contrast sets for robotics as an approach to make small, but specific, perturbations to otherwise independent, identically distributed (i.i.d.) test instances. We investigate the relationship between experimenter effort to carry out an evaluation and the resulting estimated test performance as well as the insights that can be drawn from performance on perturbed instances. We use contrast sets to characterize policies at reduced experimenter effort in both a simulated manipulation task and a physical robot vision-and-language navigation task. We encourage the use of contrast set evaluations as a more informative alternative to small scale, i.i.d. demonstrations on physical robots, and as a scalable alternative to industry-scale real world evaluations.

Cite this Paper

BibTeX

@InProceedings{pmlr-v270-anwar25a,
  title = 	 {Contrast Sets for Evaluating Language-Guided Robot Policies},
  author =       {Anwar, Abrar and Gupta, Rohan and Thomason, Jesse},
  booktitle = 	 {Proceedings of The 8th Conference on Robot Learning},
  pages = 	 {2205--2219},
  year = 	 {2025},
  editor = 	 {Agrawal, Pulkit and Kroemer, Oliver and Burgard, Wolfram},
  volume = 	 {270},
  series = 	 {Proceedings of Machine Learning Research},
  month = 	 {06--09 Nov},
  publisher =    {PMLR},
  pdf = 	 {https://raw.githubusercontent.com/mlresearch/v270/main/assets/anwar25a/anwar25a.pdf},
  url = 	 {https://proceedings.mlr.press/v270/anwar25a.html},
  abstract = 	 {Robot evaluations in language-guided, real world settings are time-consuming and often sample only a small space of potential instructions across complex scenes. In this work, we introduce contrast sets for robotics as an approach to make small, but specific, perturbations to otherwise independent, identically distributed (i.i.d.) test instances. We investigate the relationship between experimenter effort to carry out an evaluation and the resulting estimated test performance as well as the insights that can be drawn from performance on perturbed instances. We use contrast sets to characterize policies at reduced experimenter effort in both a simulated manipulation task and a physical robot vision-and-language navigation task. We encourage the use of contrast set evaluations as a more informative alternative to small scale, i.i.d. demonstrations on physical robots, and as a scalable alternative to industry-scale real world evaluations.}
}

Endnote

%0 Conference Paper
%T Contrast Sets for Evaluating Language-Guided Robot Policies
%A Abrar Anwar
%A Rohan Gupta
%A Jesse Thomason
%B Proceedings of The 8th Conference on Robot Learning
%C Proceedings of Machine Learning Research
%D 2025
%E Pulkit Agrawal
%E Oliver Kroemer
%E Wolfram Burgard	
%F pmlr-v270-anwar25a
%I PMLR
%P 2205--2219
%U https://proceedings.mlr.press/v270/anwar25a.html
%V 270
%X Robot evaluations in language-guided, real world settings are time-consuming and often sample only a small space of potential instructions across complex scenes. In this work, we introduce contrast sets for robotics as an approach to make small, but specific, perturbations to otherwise independent, identically distributed (i.i.d.) test instances. We investigate the relationship between experimenter effort to carry out an evaluation and the resulting estimated test performance as well as the insights that can be drawn from performance on perturbed instances. We use contrast sets to characterize policies at reduced experimenter effort in both a simulated manipulation task and a physical robot vision-and-language navigation task. We encourage the use of contrast set evaluations as a more informative alternative to small scale, i.i.d. demonstrations on physical robots, and as a scalable alternative to industry-scale real world evaluations.

APA

Anwar, A., Gupta, R. & Thomason, J.. (2025). Contrast Sets for Evaluating Language-Guided Robot Policies. Proceedings of The 8th Conference on Robot Learning, in Proceedings of Machine Learning Research 270:2205-2219 Available from https://proceedings.mlr.press/v270/anwar25a.html.

Contrast Sets for Evaluating Language-Guided Robot Policies

Abstract

Cite this Paper

Related Material