Elucidating the Design Space of Torque-aware Vision-Language-Action Models

Zongzheng Zhang; Haobo Xu; Zhuo Yang; Chenghao Yue; Zehao Lin; Huan-ang Gao; Ziwei Wang; Hao Zhao

Elucidating the Design Space of Torque-aware Vision-Language-Action Models

Zongzheng Zhang, Haobo Xu, Zhuo Yang, Chenghao Yue, Zehao Lin, Huan-ang Gao, Ziwei Wang, Hao Zhao

Proceedings of The 9th Conference on Robot Learning, PMLR 305:4019-4037, 2025.

Abstract

Many robotic manipulation tasks require sensing and responding to force signals such as torque to assess whether the task has been successfully completed and to enable closed-loop control. However, current Vision-Language-Action (VLA) models lack the ability to integrate such subtle physical feedback. In this work, we explore Torque-aware VLA models, aiming to bridge this gap by systematically studying the design space for incorporating torque signals into existing VLA architectures. We identify and evaluate several strategies, leading to three key findings. First, introducing torque adapters into the decoder consistently outperforms inserting them into the encoder. This is because torque signals align more closely with the decoder’s input, and the decoder is more sensitive to variations in input. Second, torque history proves to be a critical signal. We find that the most effective way to incorporate it is by summarizing the entire history into a single token, as this preserves the original input pattern of the decoder. Third, inspired by joint prediction and planning paradigms in autonomous driving, we propose predicting torque as an auxiliary output, which further improves performance. This strategy encourages the model to build a physically grounded internal representation of interaction dynamics. Extensive quantitative and qualitative experiments across contact-rich manipulation benchmarks validate our findings. Code, models, and datasets will be released.

Cite this Paper

BibTeX

@InProceedings{pmlr-v305-zhang25k,
  title = 	 {Elucidating the Design Space of Torque-aware Vision-Language-Action Models},
  author =       {Zhang, Zongzheng and Xu, Haobo and Yang, Zhuo and Yue, Chenghao and Lin, Zehao and Gao, Huan-ang and Wang, Ziwei and Zhao, Hao},
  booktitle = 	 {Proceedings of The 9th Conference on Robot Learning},
  pages = 	 {4019--4037},
  year = 	 {2025},
  editor = 	 {Lim, Joseph and Song, Shuran and Park, Hae-Won},
  volume = 	 {305},
  series = 	 {Proceedings of Machine Learning Research},
  month = 	 {27--30 Sep},
  publisher =    {PMLR},
  pdf = 	 {https://raw.githubusercontent.com/mlresearch/v305/main/assets/zhang25k/zhang25k.pdf},
  url = 	 {https://proceedings.mlr.press/v305/zhang25k.html},
  abstract = 	 {Many robotic manipulation tasks require sensing and responding to force signals such as torque to assess whether the task has been successfully completed and to enable closed-loop control. However, current Vision-Language-Action (VLA) models lack the ability to integrate such subtle physical feedback. In this work, we explore Torque-aware VLA models, aiming to bridge this gap by systematically studying the design space for incorporating torque signals into existing VLA architectures. We identify and evaluate several strategies, leading to three key findings. First, introducing torque adapters into the decoder consistently outperforms inserting them into the encoder.  This is because torque signals align more closely with the decoder’s input, and the decoder is more sensitive to variations in input. Second, torque history proves to be a critical signal. We find that the most effective way to incorporate it is by summarizing the entire history into a single token, as this preserves the original input pattern of the decoder. Third, inspired by joint prediction and planning paradigms in autonomous driving, we propose predicting torque as an auxiliary output, which further improves performance. This strategy encourages the model to build a physically grounded internal representation of interaction dynamics. Extensive quantitative and qualitative experiments across contact-rich manipulation benchmarks validate our findings. Code, models, and datasets will be released.}
}

Endnote

%0 Conference Paper
%T Elucidating the Design Space of Torque-aware Vision-Language-Action Models
%A Zongzheng Zhang
%A Haobo Xu
%A Zhuo Yang
%A Chenghao Yue
%A Zehao Lin
%A Huan-ang Gao
%A Ziwei Wang
%A Hao Zhao
%B Proceedings of The 9th Conference on Robot Learning
%C Proceedings of Machine Learning Research
%D 2025
%E Joseph Lim
%E Shuran Song
%E Hae-Won Park	
%F pmlr-v305-zhang25k
%I PMLR
%P 4019--4037
%U https://proceedings.mlr.press/v305/zhang25k.html
%V 305
%X Many robotic manipulation tasks require sensing and responding to force signals such as torque to assess whether the task has been successfully completed and to enable closed-loop control. However, current Vision-Language-Action (VLA) models lack the ability to integrate such subtle physical feedback. In this work, we explore Torque-aware VLA models, aiming to bridge this gap by systematically studying the design space for incorporating torque signals into existing VLA architectures. We identify and evaluate several strategies, leading to three key findings. First, introducing torque adapters into the decoder consistently outperforms inserting them into the encoder.  This is because torque signals align more closely with the decoder’s input, and the decoder is more sensitive to variations in input. Second, torque history proves to be a critical signal. We find that the most effective way to incorporate it is by summarizing the entire history into a single token, as this preserves the original input pattern of the decoder. Third, inspired by joint prediction and planning paradigms in autonomous driving, we propose predicting torque as an auxiliary output, which further improves performance. This strategy encourages the model to build a physically grounded internal representation of interaction dynamics. Extensive quantitative and qualitative experiments across contact-rich manipulation benchmarks validate our findings. Code, models, and datasets will be released.

APA

Zhang, Z., Xu, H., Yang, Z., Yue, C., Lin, Z., Gao, H., Wang, Z. & Zhao, H.. (2025). Elucidating the Design Space of Torque-aware Vision-Language-Action Models. Proceedings of The 9th Conference on Robot Learning, in Proceedings of Machine Learning Research 305:4019-4037 Available from https://proceedings.mlr.press/v305/zhang25k.html.

Elucidating the Design Space of Torque-aware Vision-Language-Action Models

Abstract

Cite this Paper

Related Material