HiRT: Enhancing Robotic Control with Hierarchical Robot Transformers

Jianke Zhang, Yanjiang Guo, Xiaoyu Chen, Yen-Jen Wang, Yucheng Hu, Chengming Shi, Jianyu Chen
Proceedings of The 8th Conference on Robot Learning, PMLR 270:933-946, 2025.

Abstract

Large Vision-Language-Action (VLA) models, leveraging powerful pre-trained Vision-Language Models (VLMs) backends, have shown promise in robotic control due to their impressive generalization ability. However, the success comes at a cost. Their reliance on VLM backends with billions of parameters leads to high computational costs and inference latency, limiting the testing scenarios to mainly quasi-static tasks and hindering performance in dynamic tasks requiring rapid interactions. To address these limitations, this paper proposes \textbf{HiRT}, a \textbf{Hi}erarchical \textbf{R}obot \textbf{T}ransformer framework that enables flexible frequency and performance trade-off. HiRT keeps VLMs running at low frequencies to capture temporarily invariant features while enabling real-time interaction through a high-frequency vision-based policy guided by the slowly updated features. Experiment results in both simulation and real-world settings demonstrate significant improvements over baseline methods. Empirically, we achieve a 58% reduction in inference time delay while maintaining comparable success rates. Additionally, on novel dynamic manipulation benchmarks which are challenging for previous VLA models, HiRT improves the success rate from 48% to 75%.

Cite this Paper


BibTeX
@InProceedings{pmlr-v270-zhang25b, title = {HiRT: Enhancing Robotic Control with Hierarchical Robot Transformers}, author = {Zhang, Jianke and Guo, Yanjiang and Chen, Xiaoyu and Wang, Yen-Jen and Hu, Yucheng and Shi, Chengming and Chen, Jianyu}, booktitle = {Proceedings of The 8th Conference on Robot Learning}, pages = {933--946}, year = {2025}, editor = {Agrawal, Pulkit and Kroemer, Oliver and Burgard, Wolfram}, volume = {270}, series = {Proceedings of Machine Learning Research}, month = {06--09 Nov}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v270/main/assets/zhang25b/zhang25b.pdf}, url = {https://proceedings.mlr.press/v270/zhang25b.html}, abstract = {Large Vision-Language-Action (VLA) models, leveraging powerful pre-trained Vision-Language Models (VLMs) backends, have shown promise in robotic control due to their impressive generalization ability. However, the success comes at a cost. Their reliance on VLM backends with billions of parameters leads to high computational costs and inference latency, limiting the testing scenarios to mainly quasi-static tasks and hindering performance in dynamic tasks requiring rapid interactions. To address these limitations, this paper proposes \textbf{HiRT}, a \textbf{Hi}erarchical \textbf{R}obot \textbf{T}ransformer framework that enables flexible frequency and performance trade-off. HiRT keeps VLMs running at low frequencies to capture temporarily invariant features while enabling real-time interaction through a high-frequency vision-based policy guided by the slowly updated features. Experiment results in both simulation and real-world settings demonstrate significant improvements over baseline methods. Empirically, we achieve a 58% reduction in inference time delay while maintaining comparable success rates. Additionally, on novel dynamic manipulation benchmarks which are challenging for previous VLA models, HiRT improves the success rate from 48% to 75%.} }
Endnote
%0 Conference Paper %T HiRT: Enhancing Robotic Control with Hierarchical Robot Transformers %A Jianke Zhang %A Yanjiang Guo %A Xiaoyu Chen %A Yen-Jen Wang %A Yucheng Hu %A Chengming Shi %A Jianyu Chen %B Proceedings of The 8th Conference on Robot Learning %C Proceedings of Machine Learning Research %D 2025 %E Pulkit Agrawal %E Oliver Kroemer %E Wolfram Burgard %F pmlr-v270-zhang25b %I PMLR %P 933--946 %U https://proceedings.mlr.press/v270/zhang25b.html %V 270 %X Large Vision-Language-Action (VLA) models, leveraging powerful pre-trained Vision-Language Models (VLMs) backends, have shown promise in robotic control due to their impressive generalization ability. However, the success comes at a cost. Their reliance on VLM backends with billions of parameters leads to high computational costs and inference latency, limiting the testing scenarios to mainly quasi-static tasks and hindering performance in dynamic tasks requiring rapid interactions. To address these limitations, this paper proposes \textbf{HiRT}, a \textbf{Hi}erarchical \textbf{R}obot \textbf{T}ransformer framework that enables flexible frequency and performance trade-off. HiRT keeps VLMs running at low frequencies to capture temporarily invariant features while enabling real-time interaction through a high-frequency vision-based policy guided by the slowly updated features. Experiment results in both simulation and real-world settings demonstrate significant improvements over baseline methods. Empirically, we achieve a 58% reduction in inference time delay while maintaining comparable success rates. Additionally, on novel dynamic manipulation benchmarks which are challenging for previous VLA models, HiRT improves the success rate from 48% to 75%.
APA
Zhang, J., Guo, Y., Chen, X., Wang, Y., Hu, Y., Shi, C. & Chen, J.. (2025). HiRT: Enhancing Robotic Control with Hierarchical Robot Transformers. Proceedings of The 8th Conference on Robot Learning, in Proceedings of Machine Learning Research 270:933-946 Available from https://proceedings.mlr.press/v270/zhang25b.html.

Related Material