ControlVLA: Few-shot Object-centric Adaptation for Pre-trained Vision-Language-Action Models

Puhao Li; Yingying Wu; Ziheng Xi; Wanlin Li; Yuzhe Huang; Zhiyuan Zhang; Yinghan Chen; Jianan Wang; Song-Chun Zhu; Tengyu Liu; Siyuan Huang

ControlVLA: Few-shot Object-centric Adaptation for Pre-trained Vision-Language-Action Models

Puhao Li, Yingying Wu, Ziheng Xi, Wanlin Li, Yuzhe Huang, Zhiyuan Zhang, Yinghan Chen, Jianan Wang, Song-Chun Zhu, Tengyu Liu, Siyuan Huang

Proceedings of The 9th Conference on Robot Learning, PMLR 305:1898-1913, 2025.

Abstract

Learning real-world robotic manipulation is challenging, particularly when limited demonstrations are available. Existing methods for few-shot manipulation often rely on simulation-augmented data or pre-built modules like grasping and pose estimation, which struggle with sim-to-real gaps and lack extensibility. While large-scale imitation pre-training shows promise, adapting these general-purpose policies to specific tasks in data-scarce settings remains unexplored. To achieve this, we propose ControlVLA, a novel framework that bridges pre-trained VLA models with object-centric representations via a ControlNet-style architecture for efficient fine-tuning. Specifically, to introduce object-centric conditions without overwriting prior knowledge, ControlVLA zero-initializes a set of projection layers, allowing them to gradually adapt the pre-trained manipulation policies. In real-world experiments across 6 diverse tasks, including pouring cubes and folding clothes, our method achieves a 76.7% success rate while requiring only 10-20 demonstrations — a significant improvement over traditional approaches that require more than 100 demonstrations to achieve comparable success. Additional experiments highlight ControlVLA’s extensibility to long-horizon tasks and robustness to unseen objects and backgrounds.

Cite this Paper

BibTeX

@InProceedings{pmlr-v305-li25c,
  title = 	 {ControlVLA: Few-shot Object-centric Adaptation for Pre-trained Vision-Language-Action Models},
  author =       {Li, Puhao and Wu, Yingying and Xi, Ziheng and Li, Wanlin and Huang, Yuzhe and Zhang, Zhiyuan and Chen, Yinghan and Wang, Jianan and Zhu, Song-Chun and Liu, Tengyu and Huang, Siyuan},
  booktitle = 	 {Proceedings of The 9th Conference on Robot Learning},
  pages = 	 {1898--1913},
  year = 	 {2025},
  editor = 	 {Lim, Joseph and Song, Shuran and Park, Hae-Won},
  volume = 	 {305},
  series = 	 {Proceedings of Machine Learning Research},
  month = 	 {27--30 Sep},
  publisher =    {PMLR},
  pdf = 	 {https://raw.githubusercontent.com/mlresearch/v305/main/assets/li25c/li25c.pdf},
  url = 	 {https://proceedings.mlr.press/v305/li25c.html},
  abstract = 	 {Learning real-world robotic manipulation is challenging, particularly when limited demonstrations are available. Existing methods for few-shot manipulation often rely on simulation-augmented data or pre-built modules like grasping and pose estimation, which struggle with sim-to-real gaps and lack extensibility. While large-scale imitation pre-training shows promise, adapting these general-purpose policies to specific tasks in data-scarce settings remains unexplored. To achieve this, we propose ControlVLA, a novel framework that bridges pre-trained VLA models with object-centric representations via a ControlNet-style architecture for efficient fine-tuning. Specifically, to introduce object-centric conditions without overwriting prior knowledge, ControlVLA zero-initializes a set of projection layers, allowing them to gradually adapt the pre-trained manipulation policies. In real-world experiments across 6 diverse tasks, including pouring cubes and folding clothes, our method achieves a 76.7% success rate while requiring only 10-20 demonstrations — a significant improvement over traditional approaches that require more than 100 demonstrations to achieve comparable success.  Additional experiments highlight ControlVLA’s extensibility to long-horizon tasks and robustness to unseen objects and backgrounds.}
}

Endnote

%0 Conference Paper
%T ControlVLA: Few-shot Object-centric Adaptation for Pre-trained Vision-Language-Action Models
%A Puhao Li
%A Yingying Wu
%A Ziheng Xi
%A Wanlin Li
%A Yuzhe Huang
%A Zhiyuan Zhang
%A Yinghan Chen
%A Jianan Wang
%A Song-Chun Zhu
%A Tengyu Liu
%A Siyuan Huang
%B Proceedings of The 9th Conference on Robot Learning
%C Proceedings of Machine Learning Research
%D 2025
%E Joseph Lim
%E Shuran Song
%E Hae-Won Park	
%F pmlr-v305-li25c
%I PMLR
%P 1898--1913
%U https://proceedings.mlr.press/v305/li25c.html
%V 305
%X Learning real-world robotic manipulation is challenging, particularly when limited demonstrations are available. Existing methods for few-shot manipulation often rely on simulation-augmented data or pre-built modules like grasping and pose estimation, which struggle with sim-to-real gaps and lack extensibility. While large-scale imitation pre-training shows promise, adapting these general-purpose policies to specific tasks in data-scarce settings remains unexplored. To achieve this, we propose ControlVLA, a novel framework that bridges pre-trained VLA models with object-centric representations via a ControlNet-style architecture for efficient fine-tuning. Specifically, to introduce object-centric conditions without overwriting prior knowledge, ControlVLA zero-initializes a set of projection layers, allowing them to gradually adapt the pre-trained manipulation policies. In real-world experiments across 6 diverse tasks, including pouring cubes and folding clothes, our method achieves a 76.7% success rate while requiring only 10-20 demonstrations — a significant improvement over traditional approaches that require more than 100 demonstrations to achieve comparable success.  Additional experiments highlight ControlVLA’s extensibility to long-horizon tasks and robustness to unseen objects and backgrounds.

APA

Li, P., Wu, Y., Xi, Z., Li, W., Huang, Y., Zhang, Z., Chen, Y., Wang, J., Zhu, S., Liu, T. & Huang, S.. (2025). ControlVLA: Few-shot Object-centric Adaptation for Pre-trained Vision-Language-Action Models. Proceedings of The 9th Conference on Robot Learning, in Proceedings of Machine Learning Research 305:1898-1913 Available from https://proceedings.mlr.press/v305/li25c.html.

ControlVLA: Few-shot Object-centric Adaptation for Pre-trained Vision-Language-Action Models

Abstract

Cite this Paper

Related Material