HiRT: Enhancing Robotic Control with Hierarchical Robot Transformers

Jianke Zhang; Yanjiang Guo; Xiaoyu Chen; Yen-Jen Wang; Yucheng Hu; Chengming Shi; Jianyu Chen

HiRT: Enhancing Robotic Control with Hierarchical Robot Transformers

Jianke Zhang, Yanjiang Guo, Xiaoyu Chen, Yen-Jen Wang, Yucheng Hu, Chengming Shi, Jianyu Chen

Proceedings of The 8th Conference on Robot Learning, PMLR 270:933-946, 2025.

Abstract

Large Vision-Language-Action (VLA) models, leveraging powerful pre-trained Vision-Language Models (VLMs) backends, have shown promise in robotic control due to their impressive generalization ability. However, the success comes at a cost. Their reliance on VLM backends with billions of parameters leads to high computational costs and inference latency, limiting the testing scenarios to mainly quasi-static tasks and hindering performance in dynamic tasks requiring rapid interactions. To address these limitations, this paper proposes \textbf{HiRT}, a \textbf{Hi}erarchical \textbf{R}obot \textbf{T}ransformer framework that enables flexible frequency and performance trade-off. HiRT keeps VLMs running at low frequencies to capture temporarily invariant features while enabling real-time interaction through a high-frequency vision-based policy guided by the slowly updated features. Experiment results in both simulation and real-world settings demonstrate significant improvements over baseline methods. Empirically, we achieve a 58% reduction in inference time delay while maintaining comparable success rates. Additionally, on novel dynamic manipulation benchmarks which are challenging for previous VLA models, HiRT improves the success rate from 48% to 75%.

Cite this Paper

BibTeX

@InProceedings{pmlr-v270-zhang25b,
  title = 	 {HiRT: Enhancing Robotic Control with Hierarchical Robot Transformers},
  author =       {Zhang, Jianke and Guo, Yanjiang and Chen, Xiaoyu and Wang, Yen-Jen and Hu, Yucheng and Shi, Chengming and Chen, Jianyu},
  booktitle = 	 {Proceedings of The 8th Conference on Robot Learning},
  pages = 	 {933--946},
  year = 	 {2025},
  editor = 	 {Agrawal, Pulkit and Kroemer, Oliver and Burgard, Wolfram},
  volume = 	 {270},
  series = 	 {Proceedings of Machine Learning Research},
  month = 	 {06--09 Nov},
  publisher =    {PMLR},
  pdf = 	 {https://raw.githubusercontent.com/mlresearch/v270/main/assets/zhang25b/zhang25b.pdf},
  url = 	 {https://proceedings.mlr.press/v270/zhang25b.html},
  abstract = 	 {Large Vision-Language-Action (VLA) models, leveraging powerful pre-trained Vision-Language Models (VLMs) backends, have shown promise in robotic control due to their impressive generalization ability. However, the success comes at a cost. Their reliance on VLM backends with billions of parameters leads to high computational costs and inference latency, limiting the testing scenarios to mainly quasi-static tasks and hindering performance in dynamic tasks requiring rapid interactions. To address these limitations, this paper proposes \textbf{HiRT}, a \textbf{Hi}erarchical \textbf{R}obot \textbf{T}ransformer framework that enables flexible frequency and performance trade-off. HiRT keeps VLMs running at low frequencies to capture temporarily invariant features while enabling real-time interaction through a high-frequency vision-based policy guided by the slowly updated features. Experiment results in both simulation and real-world settings demonstrate significant improvements over baseline methods. Empirically, we achieve a 58% reduction in inference time delay while maintaining comparable success rates. Additionally, on novel dynamic manipulation benchmarks which are challenging for previous VLA models, HiRT improves the success rate from 48% to 75%.}
}

Endnote

%0 Conference Paper
%T HiRT: Enhancing Robotic Control with Hierarchical Robot Transformers
%A Jianke Zhang
%A Yanjiang Guo
%A Xiaoyu Chen
%A Yen-Jen Wang
%A Yucheng Hu
%A Chengming Shi
%A Jianyu Chen
%B Proceedings of The 8th Conference on Robot Learning
%C Proceedings of Machine Learning Research
%D 2025
%E Pulkit Agrawal
%E Oliver Kroemer
%E Wolfram Burgard	
%F pmlr-v270-zhang25b
%I PMLR
%P 933--946
%U https://proceedings.mlr.press/v270/zhang25b.html
%V 270
%X Large Vision-Language-Action (VLA) models, leveraging powerful pre-trained Vision-Language Models (VLMs) backends, have shown promise in robotic control due to their impressive generalization ability. However, the success comes at a cost. Their reliance on VLM backends with billions of parameters leads to high computational costs and inference latency, limiting the testing scenarios to mainly quasi-static tasks and hindering performance in dynamic tasks requiring rapid interactions. To address these limitations, this paper proposes \textbf{HiRT}, a \textbf{Hi}erarchical \textbf{R}obot \textbf{T}ransformer framework that enables flexible frequency and performance trade-off. HiRT keeps VLMs running at low frequencies to capture temporarily invariant features while enabling real-time interaction through a high-frequency vision-based policy guided by the slowly updated features. Experiment results in both simulation and real-world settings demonstrate significant improvements over baseline methods. Empirically, we achieve a 58% reduction in inference time delay while maintaining comparable success rates. Additionally, on novel dynamic manipulation benchmarks which are challenging for previous VLA models, HiRT improves the success rate from 48% to 75%.

APA

Zhang, J., Guo, Y., Chen, X., Wang, Y., Hu, Y., Shi, C. & Chen, J.. (2025). HiRT: Enhancing Robotic Control with Hierarchical Robot Transformers. Proceedings of The 8th Conference on Robot Learning, in Proceedings of Machine Learning Research 270:933-946 Available from https://proceedings.mlr.press/v270/zhang25b.html.

HiRT: Enhancing Robotic Control with Hierarchical Robot Transformers

Abstract

Cite this Paper

Related Material