RoboMonkey: Scaling Test-Time Sampling and Verification for Vision-Language-Action Models

Jacky Kwok; Christopher Agia; Rohan Sinha; Matt Foutter; Shulu Li; Ion Stoica; Azalia Mirhoseini; Marco Pavone

RoboMonkey: Scaling Test-Time Sampling and Verification for Vision-Language-Action Models

Jacky Kwok, Christopher Agia, Rohan Sinha, Matt Foutter, Shulu Li, Ion Stoica, Azalia Mirhoseini, Marco Pavone

Proceedings of The 9th Conference on Robot Learning, PMLR 305:3200-3217, 2025.

Abstract

Vision-Language-Action (VLA) models, pre-trained on large-scale imitation learning datasets, have demonstrated remarkable capabilities in visuomotor control. However, these models exhibit diverse failure modes in unstructured real-world environments, limiting the widespread adoption of VLAs in robotics. Efforts to enhance the robustness and generalization of VLAs have gradually shifted from the pre-training to the post-training phase. Yet, the potential of scaling test-time compute remains underexplored. In this paper, we investigate test-time scaling for robotics through the lens of sampling and verification. We first demonstrate that the relationship between action error and the number of generated samples follows an exponentiated power law across a range of VLAs, indicating the existence of inference-time scaling laws. Building on this insight, we propose a synthetic data generation pipeline for training a Vision-Language Model (VLM)-based action verifier, and show that scaling the synthetic dataset consistently improves verification and downstream accuracy. We then introduce RoboMonkey, a test-time scaling framework for VLAs. At deployment, RoboMonkey samples a small set of actions from a VLA, applies Gaussian perturbations and majority voting to construct an action proposal distribution, and then uses the VLM-based verifier to select the optimal action. Through extensive evaluations across simulated and real-world environments, we show that pairing existing VLAs with RoboMonkey yields significant performance gains, achieving a 25% absolute improvement on out-of-distribution tasks and 8% higher average success rate on in-distribution tasks. Additionally, when adapting to new robot setups, we show that fine-tuning both VLAs and action verifiers yields a 7% performance increase compared to fine-tuning VLAs alone.

Cite this Paper

BibTeX

@InProceedings{pmlr-v305-kwok25a,
  title = 	 {RoboMonkey: Scaling Test-Time Sampling and Verification for Vision-Language-Action Models},
  author =       {Kwok, Jacky and Agia, Christopher and Sinha, Rohan and Foutter, Matt and Li, Shulu and Stoica, Ion and Mirhoseini, Azalia and Pavone, Marco},
  booktitle = 	 {Proceedings of The 9th Conference on Robot Learning},
  pages = 	 {3200--3217},
  year = 	 {2025},
  editor = 	 {Lim, Joseph and Song, Shuran and Park, Hae-Won},
  volume = 	 {305},
  series = 	 {Proceedings of Machine Learning Research},
  month = 	 {27--30 Sep},
  publisher =    {PMLR},
  pdf = 	 {https://raw.githubusercontent.com/mlresearch/v305/main/assets/kwok25a/kwok25a.pdf},
  url = 	 {https://proceedings.mlr.press/v305/kwok25a.html},
  abstract = 	 {Vision-Language-Action (VLA) models, pre-trained on large-scale imitation learning datasets, have demonstrated remarkable capabilities in visuomotor control. However, these models exhibit diverse failure modes in unstructured real-world environments, limiting the widespread adoption of VLAs in robotics. Efforts to enhance the robustness and generalization of VLAs have gradually shifted from the pre-training to the post-training phase. Yet, the potential of scaling test-time compute remains underexplored. In this paper, we investigate test-time scaling for robotics through the lens of sampling and verification. We first demonstrate that the relationship between action error and the number of generated samples follows an exponentiated power law across a range of VLAs, indicating the existence of inference-time scaling laws. Building on this insight, we propose a synthetic data generation pipeline for training a Vision-Language Model (VLM)-based action verifier, and show that scaling the synthetic dataset consistently improves verification and downstream accuracy. We then introduce RoboMonkey, a test-time scaling framework for VLAs. At deployment, RoboMonkey samples a small set of actions from a VLA, applies Gaussian perturbations and majority voting to construct an action proposal distribution, and then uses the VLM-based verifier to select the optimal action. Through extensive evaluations across simulated and real-world environments, we show that pairing existing VLAs with RoboMonkey yields significant performance gains, achieving a 25% absolute improvement on out-of-distribution tasks and 8% higher average success rate on in-distribution tasks. Additionally, when adapting to new robot setups, we show that fine-tuning both VLAs and action verifiers yields a 7% performance increase compared to fine-tuning VLAs alone.}
}

Endnote

%0 Conference Paper
%T RoboMonkey: Scaling Test-Time Sampling and Verification for Vision-Language-Action Models
%A Jacky Kwok
%A Christopher Agia
%A Rohan Sinha
%A Matt Foutter
%A Shulu Li
%A Ion Stoica
%A Azalia Mirhoseini
%A Marco Pavone
%B Proceedings of The 9th Conference on Robot Learning
%C Proceedings of Machine Learning Research
%D 2025
%E Joseph Lim
%E Shuran Song
%E Hae-Won Park	
%F pmlr-v305-kwok25a
%I PMLR
%P 3200--3217
%U https://proceedings.mlr.press/v305/kwok25a.html
%V 305
%X Vision-Language-Action (VLA) models, pre-trained on large-scale imitation learning datasets, have demonstrated remarkable capabilities in visuomotor control. However, these models exhibit diverse failure modes in unstructured real-world environments, limiting the widespread adoption of VLAs in robotics. Efforts to enhance the robustness and generalization of VLAs have gradually shifted from the pre-training to the post-training phase. Yet, the potential of scaling test-time compute remains underexplored. In this paper, we investigate test-time scaling for robotics through the lens of sampling and verification. We first demonstrate that the relationship between action error and the number of generated samples follows an exponentiated power law across a range of VLAs, indicating the existence of inference-time scaling laws. Building on this insight, we propose a synthetic data generation pipeline for training a Vision-Language Model (VLM)-based action verifier, and show that scaling the synthetic dataset consistently improves verification and downstream accuracy. We then introduce RoboMonkey, a test-time scaling framework for VLAs. At deployment, RoboMonkey samples a small set of actions from a VLA, applies Gaussian perturbations and majority voting to construct an action proposal distribution, and then uses the VLM-based verifier to select the optimal action. Through extensive evaluations across simulated and real-world environments, we show that pairing existing VLAs with RoboMonkey yields significant performance gains, achieving a 25% absolute improvement on out-of-distribution tasks and 8% higher average success rate on in-distribution tasks. Additionally, when adapting to new robot setups, we show that fine-tuning both VLAs and action verifiers yields a 7% performance increase compared to fine-tuning VLAs alone.

APA

Kwok, J., Agia, C., Sinha, R., Foutter, M., Li, S., Stoica, I., Mirhoseini, A. & Pavone, M.. (2025). RoboMonkey: Scaling Test-Time Sampling and Verification for Vision-Language-Action Models. Proceedings of The 9th Conference on Robot Learning, in Proceedings of Machine Learning Research 305:3200-3217 Available from https://proceedings.mlr.press/v305/kwok25a.html.

RoboMonkey: Scaling Test-Time Sampling and Verification for Vision-Language-Action Models

Abstract

Cite this Paper

Related Material