Mechanistic Interpretability for Steering Vision-Language-Action Models

Bear Häon; Kaylene Caswell Stocking; Ian Chuang; Claire Tomlin

Mechanistic Interpretability for Steering Vision-Language-Action Models

Bear Häon, Kaylene Caswell Stocking, Ian Chuang, Claire Tomlin

Proceedings of The 9th Conference on Robot Learning, PMLR 305:2743-2762, 2025.

Abstract

Vision-Language-Action (VLA) models are a promising path to realizing generalist embodied agents that can quickly adapt to new tasks, modalities, and environments. However, methods for interpreting and steering VLAs fall far short of classical robotics pipelines, which are grounded in explicit models of kinematics, dynamics, and control. This lack of mechanistic insight is a central challenge for deploying learned policies in real-world robotics, where robustness and explainability are critical. Motivated by advances in mechanistic interpretability for large language models, we introduce the first framework for interpreting and steering VLAs via their internal representations, enabling direct intervention in model behavior at inference time. We project feedforward activations within transformer layers onto the token embedding basis, identifying sparse semantic directions - such as speed and direction - that are causally linked to action selection. Leveraging these findings, we introduce a general-purpose activation steering method that modulates behavior in real time, without fine-tuning, reward signals, or environment interaction. We evaluate this method on two recent open-source VLAs, Pi0 and OpenVLA, and demonstrate zero-shot behavioral control in simulation (LIBERO) and on a physical robot (UR5). This work demonstrates that interpretable components of embodied VLAs can be systematically harnessed for control—establishing a new paradigm for transparent and steerable foundation models in robotics.

Cite this Paper

BibTeX

@InProceedings{pmlr-v305-haon25a,
  title = 	 {Mechanistic Interpretability for Steering Vision-Language-Action Models},
  author =       {H\"{a}on, Bear and Stocking, Kaylene Caswell and Chuang, Ian and Tomlin, Claire},
  booktitle = 	 {Proceedings of The 9th Conference on Robot Learning},
  pages = 	 {2743--2762},
  year = 	 {2025},
  editor = 	 {Lim, Joseph and Song, Shuran and Park, Hae-Won},
  volume = 	 {305},
  series = 	 {Proceedings of Machine Learning Research},
  month = 	 {27--30 Sep},
  publisher =    {PMLR},
  pdf = 	 {https://raw.githubusercontent.com/mlresearch/v305/main/assets/haon25a/haon25a.pdf},
  url = 	 {https://proceedings.mlr.press/v305/haon25a.html},
  abstract = 	 {Vision-Language-Action (VLA) models are a promising path to realizing generalist embodied agents that can quickly adapt to new tasks, modalities, and environments. However, methods for interpreting and steering VLAs fall far short of classical robotics pipelines, which are grounded in explicit models of kinematics, dynamics, and control. This lack of mechanistic insight is a central challenge for deploying learned policies in real-world robotics, where robustness and explainability are critical. Motivated by advances in mechanistic interpretability for large language models, we introduce the first framework for interpreting and steering VLAs via their internal representations, enabling direct intervention in model behavior at inference time. We project feedforward activations within transformer layers onto the token embedding basis, identifying sparse semantic directions - such as speed and direction - that are causally linked to action selection. Leveraging these findings, we introduce a general-purpose activation steering method that modulates behavior in real time, without fine-tuning, reward signals, or environment interaction. We evaluate this method on two recent open-source VLAs, Pi0 and OpenVLA, and demonstrate zero-shot behavioral control in simulation (LIBERO) and on a physical robot (UR5). This work demonstrates that interpretable components of embodied VLAs can be systematically harnessed for control—establishing a new paradigm for transparent and steerable foundation models in robotics.}
}

Endnote

%0 Conference Paper
%T Mechanistic Interpretability for Steering Vision-Language-Action Models
%A Bear Häon
%A Kaylene Caswell Stocking
%A Ian Chuang
%A Claire Tomlin
%B Proceedings of The 9th Conference on Robot Learning
%C Proceedings of Machine Learning Research
%D 2025
%E Joseph Lim
%E Shuran Song
%E Hae-Won Park	
%F pmlr-v305-haon25a
%I PMLR
%P 2743--2762
%U https://proceedings.mlr.press/v305/haon25a.html
%V 305
%X Vision-Language-Action (VLA) models are a promising path to realizing generalist embodied agents that can quickly adapt to new tasks, modalities, and environments. However, methods for interpreting and steering VLAs fall far short of classical robotics pipelines, which are grounded in explicit models of kinematics, dynamics, and control. This lack of mechanistic insight is a central challenge for deploying learned policies in real-world robotics, where robustness and explainability are critical. Motivated by advances in mechanistic interpretability for large language models, we introduce the first framework for interpreting and steering VLAs via their internal representations, enabling direct intervention in model behavior at inference time. We project feedforward activations within transformer layers onto the token embedding basis, identifying sparse semantic directions - such as speed and direction - that are causally linked to action selection. Leveraging these findings, we introduce a general-purpose activation steering method that modulates behavior in real time, without fine-tuning, reward signals, or environment interaction. We evaluate this method on two recent open-source VLAs, Pi0 and OpenVLA, and demonstrate zero-shot behavioral control in simulation (LIBERO) and on a physical robot (UR5). This work demonstrates that interpretable components of embodied VLAs can be systematically harnessed for control—establishing a new paradigm for transparent and steerable foundation models in robotics.

APA

Häon, B., Stocking, K.C., Chuang, I. & Tomlin, C.. (2025). Mechanistic Interpretability for Steering Vision-Language-Action Models. Proceedings of The 9th Conference on Robot Learning, in Proceedings of Machine Learning Research 305:2743-2762 Available from https://proceedings.mlr.press/v305/haon25a.html.

Mechanistic Interpretability for Steering Vision-Language-Action Models

Abstract

Cite this Paper

Related Material