Embodied Learning of Reward for Musculoskeletal Control with Vision Language Models

Saraswati Soedarmadji, Yunyue Wei, Chen Zhang, Yisong Yue, Yanan Sui
Proceedings of The 8th Annual Learning for Dynamics and Control Conference, PMLR 331:1675-1693, 2026.

Abstract

Discovering effective reward functions remains a fundamental challenge in motor control of high-dimensional musculoskeletal systems. While humans can describe movement goals explicitly such as "walking forward with an upright posture," the underlying control strategies that realize these goals are largely implicit, making it difficult to directly design rewards from high-level goals and natural language descriptions. We introduce Motion from Vision-Language Representation (MoVLR), a framework that leverages vision-language models (VLMs) to bridge the gap between goal specification and movement control. Rather than relying on handcrafted rewards, MoVLR iteratively explores the reward space through iterative interaction between control optimization and VLM feedback, aligning control policies with physically coordinated behaviors. Our approach transforms language and vision-based assessments into structured guidance for embodied learning, enabling the discovery and refinement of reward functions for high-dimensional musculoskeletal locomotion and manipulation. These results suggest that VLMs can effectively ground abstract motion descriptions in the implicit principles governing physiological motor control.

Cite this Paper


BibTeX
@InProceedings{pmlr-v331-soedarmadji26a, title = {Embodied Learning of Reward for Musculoskeletal Control with Vision Language Models}, author = {Soedarmadji, Saraswati and Wei, Yunyue and Zhang, Chen and Yue, Yisong and Sui, Yanan}, booktitle = {Proceedings of The 8th Annual Learning for Dynamics and Control Conference}, pages = {1675--1693}, year = {2026}, editor = {Sukhatme, Gaurav and Lindemann, Lars and Tu, Stephen and Wierman, Adam and Atanasov, Nikolay}, volume = {331}, series = {Proceedings of Machine Learning Research}, month = {17--19 Jun}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v331/main/assets/soedarmadji26a/soedarmadji26a.pdf}, url = {https://proceedings.mlr.press/v331/soedarmadji26a.html}, abstract = {Discovering effective reward functions remains a fundamental challenge in motor control of high-dimensional musculoskeletal systems. While humans can describe movement goals explicitly such as "walking forward with an upright posture," the underlying control strategies that realize these goals are largely implicit, making it difficult to directly design rewards from high-level goals and natural language descriptions. We introduce Motion from Vision-Language Representation (MoVLR), a framework that leverages vision-language models (VLMs) to bridge the gap between goal specification and movement control. Rather than relying on handcrafted rewards, MoVLR iteratively explores the reward space through iterative interaction between control optimization and VLM feedback, aligning control policies with physically coordinated behaviors. Our approach transforms language and vision-based assessments into structured guidance for embodied learning, enabling the discovery and refinement of reward functions for high-dimensional musculoskeletal locomotion and manipulation. These results suggest that VLMs can effectively ground abstract motion descriptions in the implicit principles governing physiological motor control.} }
Endnote
%0 Conference Paper %T Embodied Learning of Reward for Musculoskeletal Control with Vision Language Models %A Saraswati Soedarmadji %A Yunyue Wei %A Chen Zhang %A Yisong Yue %A Yanan Sui %B Proceedings of The 8th Annual Learning for Dynamics and Control Conference %C Proceedings of Machine Learning Research %D 2026 %E Gaurav Sukhatme %E Lars Lindemann %E Stephen Tu %E Adam Wierman %E Nikolay Atanasov %F pmlr-v331-soedarmadji26a %I PMLR %P 1675--1693 %U https://proceedings.mlr.press/v331/soedarmadji26a.html %V 331 %X Discovering effective reward functions remains a fundamental challenge in motor control of high-dimensional musculoskeletal systems. While humans can describe movement goals explicitly such as "walking forward with an upright posture," the underlying control strategies that realize these goals are largely implicit, making it difficult to directly design rewards from high-level goals and natural language descriptions. We introduce Motion from Vision-Language Representation (MoVLR), a framework that leverages vision-language models (VLMs) to bridge the gap between goal specification and movement control. Rather than relying on handcrafted rewards, MoVLR iteratively explores the reward space through iterative interaction between control optimization and VLM feedback, aligning control policies with physically coordinated behaviors. Our approach transforms language and vision-based assessments into structured guidance for embodied learning, enabling the discovery and refinement of reward functions for high-dimensional musculoskeletal locomotion and manipulation. These results suggest that VLMs can effectively ground abstract motion descriptions in the implicit principles governing physiological motor control.
APA
Soedarmadji, S., Wei, Y., Zhang, C., Yue, Y. & Sui, Y.. (2026). Embodied Learning of Reward for Musculoskeletal Control with Vision Language Models. Proceedings of The 8th Annual Learning for Dynamics and Control Conference, in Proceedings of Machine Learning Research 331:1675-1693 Available from https://proceedings.mlr.press/v331/soedarmadji26a.html.

Related Material