Focusing on What Matters: Object-Agent-centric Tokenization for Vision Language Action models

Rokas Bendikas; Daniel Dijkman; Markus Peschl; Sanjay Haresh; Pietro Mazzaglia

Focusing on What Matters: Object-Agent-centric Tokenization for Vision Language Action models

Rokas Bendikas, Daniel Dijkman, Markus Peschl, Sanjay Haresh, Pietro Mazzaglia

Proceedings of The 9th Conference on Robot Learning, PMLR 305:3869-3887, 2025.

Abstract

Vision-Language-Action (VLA) models offer a pivotal approach to learning robotic manipulation at scale by repurposing large pre-trained Vision-Language-Models (VLM) to output robotic actions. However, adapting VLMs for robotic domains comes with an unnecessarily high computational cost, which we attribute to the tokenization scheme of visual inputs. In this work, we aim to enable efficient VLA training by proposing Oat-VLA, an Object-Agent-centric Tokenization for VLAs. Building on the insights of object-centric representation learning, our method introduces an inductive bias towards scene objects and the agent’s own visual information. As a result, we find that Oat-VLA can drastically reduce the number of visual tokens to just a few tokens without sacrificing performance. We reveal that Oat-VLA converges at least twice as fast as OpenVLA on the LIBERO suite, as well as outperform OpenVLA in diverse real-world pick and place tasks.

Cite this Paper

BibTeX

@InProceedings{pmlr-v305-bendikas25a,
  title = 	 {Focusing on What Matters: Object-Agent-centric Tokenization for Vision Language Action models},
  author =       {Bendikas, Rokas and Dijkman, Daniel and Peschl, Markus and Haresh, Sanjay and Mazzaglia, Pietro},
  booktitle = 	 {Proceedings of The 9th Conference on Robot Learning},
  pages = 	 {3869--3887},
  year = 	 {2025},
  editor = 	 {Lim, Joseph and Song, Shuran and Park, Hae-Won},
  volume = 	 {305},
  series = 	 {Proceedings of Machine Learning Research},
  month = 	 {27--30 Sep},
  publisher =    {PMLR},
  pdf = 	 {https://raw.githubusercontent.com/mlresearch/v305/main/assets/bendikas25a/bendikas25a.pdf},
  url = 	 {https://proceedings.mlr.press/v305/bendikas25a.html},
  abstract = 	 {Vision-Language-Action (VLA) models offer a pivotal approach to learning robotic manipulation at scale by repurposing large pre-trained Vision-Language-Models (VLM) to output robotic actions. However, adapting VLMs for robotic domains comes with an unnecessarily high computational cost, which we attribute to the tokenization scheme of visual inputs. In this work, we aim to enable efficient VLA training by proposing Oat-VLA, an Object-Agent-centric Tokenization for VLAs. Building on the insights of object-centric representation learning, our method introduces an inductive bias towards scene objects and the agent’s own visual information. As a result, we find that Oat-VLA can drastically reduce the number of visual tokens  to just a few tokens without sacrificing performance. We reveal that Oat-VLA converges at least twice as fast as OpenVLA on the LIBERO suite, as well as outperform OpenVLA in diverse real-world pick and place tasks.}
}

Endnote

%0 Conference Paper
%T Focusing on What Matters: Object-Agent-centric Tokenization for Vision Language Action models
%A Rokas Bendikas
%A Daniel Dijkman
%A Markus Peschl
%A Sanjay Haresh
%A Pietro Mazzaglia
%B Proceedings of The 9th Conference on Robot Learning
%C Proceedings of Machine Learning Research
%D 2025
%E Joseph Lim
%E Shuran Song
%E Hae-Won Park	
%F pmlr-v305-bendikas25a
%I PMLR
%P 3869--3887
%U https://proceedings.mlr.press/v305/bendikas25a.html
%V 305
%X Vision-Language-Action (VLA) models offer a pivotal approach to learning robotic manipulation at scale by repurposing large pre-trained Vision-Language-Models (VLM) to output robotic actions. However, adapting VLMs for robotic domains comes with an unnecessarily high computational cost, which we attribute to the tokenization scheme of visual inputs. In this work, we aim to enable efficient VLA training by proposing Oat-VLA, an Object-Agent-centric Tokenization for VLAs. Building on the insights of object-centric representation learning, our method introduces an inductive bias towards scene objects and the agent’s own visual information. As a result, we find that Oat-VLA can drastically reduce the number of visual tokens  to just a few tokens without sacrificing performance. We reveal that Oat-VLA converges at least twice as fast as OpenVLA on the LIBERO suite, as well as outperform OpenVLA in diverse real-world pick and place tasks.

APA

Bendikas, R., Dijkman, D., Peschl, M., Haresh, S. & Mazzaglia, P.. (2025). Focusing on What Matters: Object-Agent-centric Tokenization for Vision Language Action models. Proceedings of The 9th Conference on Robot Learning, in Proceedings of Machine Learning Research 305:3869-3887 Available from https://proceedings.mlr.press/v305/bendikas25a.html.

Focusing on What Matters: Object-Agent-centric Tokenization for Vision Language Action models

Abstract

Cite this Paper

Related Material