CASPER: Inferring Diverse Intents for Assistive Teleoperation with Vision Language Models

Huihan Liu, Rutav Shah, Shuijing Liu, Jack Pittenger, Mingyo Seo, Yuchen Cui, Yonatan Bisk, Roberto Martín-Martín, Yuke Zhu
Proceedings of The 9th Conference on Robot Learning, PMLR 305:2462-2483, 2025.

Abstract

Assistive teleoperation, where control is shared between a human and a robot, enables efficient and intuitive human-robot collaboration in diverse and unstructured environments. A central challenge in real-world assistive teleoperation is for the robot to infer a wide range of human intentions from user control inputs and to assist users with correct actions. Existing methods are either confined to simple, predefined scenarios or restricted to task-specific data distributions at training, limiting their support for real-world assistance. We introduce Casper, an assistive teleoperation system that leverages commonsense knowledge embedded in pre-trained visual language models (VLMs) for real-time intent inference and flexible skill execution. Casper incorporates an open-world perception module for a generalized understanding of novel objects and scenes, a VLM-powered intent inference mechanism that leverages commonsense reasoning to interpret snippets of teleoperated user input, and a skill library that expands the scope of prior assistive teleoperation systems to support diverse, long-horizon mobile manipulation tasks. Extensive empirical evaluation, including human studies and system ablations, demonstrates that Casper improves task performance, reduces human cognitive load, and achieves higher user satisfaction than direct teleoperation and assistive teleoperation baselines. More information is available at https://casper-corl25.github.io/

Cite this Paper


BibTeX
@InProceedings{pmlr-v305-liu25d, title = {CASPER: Inferring Diverse Intents for Assistive Teleoperation with Vision Language Models}, author = {Liu, Huihan and Shah, Rutav and Liu, Shuijing and Pittenger, Jack and Seo, Mingyo and Cui, Yuchen and Bisk, Yonatan and Mart\'{i}n-Mart\'{i}n, Roberto and Zhu, Yuke}, booktitle = {Proceedings of The 9th Conference on Robot Learning}, pages = {2462--2483}, year = {2025}, editor = {Lim, Joseph and Song, Shuran and Park, Hae-Won}, volume = {305}, series = {Proceedings of Machine Learning Research}, month = {27--30 Sep}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v305/main/assets/liu25d/liu25d.pdf}, url = {https://proceedings.mlr.press/v305/liu25d.html}, abstract = {Assistive teleoperation, where control is shared between a human and a robot, enables efficient and intuitive human-robot collaboration in diverse and unstructured environments. A central challenge in real-world assistive teleoperation is for the robot to infer a wide range of human intentions from user control inputs and to assist users with correct actions. Existing methods are either confined to simple, predefined scenarios or restricted to task-specific data distributions at training, limiting their support for real-world assistance. We introduce Casper, an assistive teleoperation system that leverages commonsense knowledge embedded in pre-trained visual language models (VLMs) for real-time intent inference and flexible skill execution. Casper incorporates an open-world perception module for a generalized understanding of novel objects and scenes, a VLM-powered intent inference mechanism that leverages commonsense reasoning to interpret snippets of teleoperated user input, and a skill library that expands the scope of prior assistive teleoperation systems to support diverse, long-horizon mobile manipulation tasks. Extensive empirical evaluation, including human studies and system ablations, demonstrates that Casper improves task performance, reduces human cognitive load, and achieves higher user satisfaction than direct teleoperation and assistive teleoperation baselines. More information is available at https://casper-corl25.github.io/} }
Endnote
%0 Conference Paper %T CASPER: Inferring Diverse Intents for Assistive Teleoperation with Vision Language Models %A Huihan Liu %A Rutav Shah %A Shuijing Liu %A Jack Pittenger %A Mingyo Seo %A Yuchen Cui %A Yonatan Bisk %A Roberto Martín-Martín %A Yuke Zhu %B Proceedings of The 9th Conference on Robot Learning %C Proceedings of Machine Learning Research %D 2025 %E Joseph Lim %E Shuran Song %E Hae-Won Park %F pmlr-v305-liu25d %I PMLR %P 2462--2483 %U https://proceedings.mlr.press/v305/liu25d.html %V 305 %X Assistive teleoperation, where control is shared between a human and a robot, enables efficient and intuitive human-robot collaboration in diverse and unstructured environments. A central challenge in real-world assistive teleoperation is for the robot to infer a wide range of human intentions from user control inputs and to assist users with correct actions. Existing methods are either confined to simple, predefined scenarios or restricted to task-specific data distributions at training, limiting their support for real-world assistance. We introduce Casper, an assistive teleoperation system that leverages commonsense knowledge embedded in pre-trained visual language models (VLMs) for real-time intent inference and flexible skill execution. Casper incorporates an open-world perception module for a generalized understanding of novel objects and scenes, a VLM-powered intent inference mechanism that leverages commonsense reasoning to interpret snippets of teleoperated user input, and a skill library that expands the scope of prior assistive teleoperation systems to support diverse, long-horizon mobile manipulation tasks. Extensive empirical evaluation, including human studies and system ablations, demonstrates that Casper improves task performance, reduces human cognitive load, and achieves higher user satisfaction than direct teleoperation and assistive teleoperation baselines. More information is available at https://casper-corl25.github.io/
APA
Liu, H., Shah, R., Liu, S., Pittenger, J., Seo, M., Cui, Y., Bisk, Y., Martín-Martín, R. & Zhu, Y.. (2025). CASPER: Inferring Diverse Intents for Assistive Teleoperation with Vision Language Models. Proceedings of The 9th Conference on Robot Learning, in Proceedings of Machine Learning Research 305:2462-2483 Available from https://proceedings.mlr.press/v305/liu25d.html.

Related Material