KISA: A Unified Keyframe Identifier and Skill Annotator for Long-Horizon Robotics Demonstrations

Longxin Kou, Fei Ni, Yan Zheng, Jinyi Liu, Yifu Yuan, Zibin Dong, Jianye Hao
Proceedings of the 41st International Conference on Machine Learning, PMLR 235:25441-25474, 2024.

Abstract

Robotic manipulation tasks often span over long horizons and encapsulate multiple subtasks with different skills. Learning policies directly from long-horizon demonstrations is challenging without intermediate keyframes guidance and corresponding skill annotations. Existing approaches for keyframe identification often struggle to offer reliable decomposition for low accuracy and fail to provide semantic relevance between keyframes and skills. For this, we propose a unified Keyframe Identifier and Skill Anotator (KISA) that utilizes pretrained visual-language representations for precise and interpretable decomposition of unlabeled demonstrations. Specifically, we develop a simple yet effective temporal enhancement module that enriches frame-level representations with expanded receptive fields to capture semantic dynamics at the video level. We further propose coarse contrastive learning and fine-grained monotonic encouragement to enhance the alignment between visual representations from keyframes and language representations from skills. The experimental results across three benchmarks demonstrate that KISA outperforms competitive baselines in terms of accuracy and interpretability of keyframe identification. Moreover, KISA exhibits robust generalization capabilities and the flexibility to incorporate various pretrained representations.

Cite this Paper


BibTeX
@InProceedings{pmlr-v235-kou24b, title = {{KISA}: A Unified Keyframe Identifier and Skill Annotator for Long-Horizon Robotics Demonstrations}, author = {Kou, Longxin and Ni, Fei and Zheng, Yan and Liu, Jinyi and Yuan, Yifu and Dong, Zibin and Hao, Jianye}, booktitle = {Proceedings of the 41st International Conference on Machine Learning}, pages = {25441--25474}, year = {2024}, editor = {Salakhutdinov, Ruslan and Kolter, Zico and Heller, Katherine and Weller, Adrian and Oliver, Nuria and Scarlett, Jonathan and Berkenkamp, Felix}, volume = {235}, series = {Proceedings of Machine Learning Research}, month = {21--27 Jul}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v235/main/assets/kou24b/kou24b.pdf}, url = {https://proceedings.mlr.press/v235/kou24b.html}, abstract = {Robotic manipulation tasks often span over long horizons and encapsulate multiple subtasks with different skills. Learning policies directly from long-horizon demonstrations is challenging without intermediate keyframes guidance and corresponding skill annotations. Existing approaches for keyframe identification often struggle to offer reliable decomposition for low accuracy and fail to provide semantic relevance between keyframes and skills. For this, we propose a unified Keyframe Identifier and Skill Anotator (KISA) that utilizes pretrained visual-language representations for precise and interpretable decomposition of unlabeled demonstrations. Specifically, we develop a simple yet effective temporal enhancement module that enriches frame-level representations with expanded receptive fields to capture semantic dynamics at the video level. We further propose coarse contrastive learning and fine-grained monotonic encouragement to enhance the alignment between visual representations from keyframes and language representations from skills. The experimental results across three benchmarks demonstrate that KISA outperforms competitive baselines in terms of accuracy and interpretability of keyframe identification. Moreover, KISA exhibits robust generalization capabilities and the flexibility to incorporate various pretrained representations.} }
Endnote
%0 Conference Paper %T KISA: A Unified Keyframe Identifier and Skill Annotator for Long-Horizon Robotics Demonstrations %A Longxin Kou %A Fei Ni %A Yan Zheng %A Jinyi Liu %A Yifu Yuan %A Zibin Dong %A Jianye Hao %B Proceedings of the 41st International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2024 %E Ruslan Salakhutdinov %E Zico Kolter %E Katherine Heller %E Adrian Weller %E Nuria Oliver %E Jonathan Scarlett %E Felix Berkenkamp %F pmlr-v235-kou24b %I PMLR %P 25441--25474 %U https://proceedings.mlr.press/v235/kou24b.html %V 235 %X Robotic manipulation tasks often span over long horizons and encapsulate multiple subtasks with different skills. Learning policies directly from long-horizon demonstrations is challenging without intermediate keyframes guidance and corresponding skill annotations. Existing approaches for keyframe identification often struggle to offer reliable decomposition for low accuracy and fail to provide semantic relevance between keyframes and skills. For this, we propose a unified Keyframe Identifier and Skill Anotator (KISA) that utilizes pretrained visual-language representations for precise and interpretable decomposition of unlabeled demonstrations. Specifically, we develop a simple yet effective temporal enhancement module that enriches frame-level representations with expanded receptive fields to capture semantic dynamics at the video level. We further propose coarse contrastive learning and fine-grained monotonic encouragement to enhance the alignment between visual representations from keyframes and language representations from skills. The experimental results across three benchmarks demonstrate that KISA outperforms competitive baselines in terms of accuracy and interpretability of keyframe identification. Moreover, KISA exhibits robust generalization capabilities and the flexibility to incorporate various pretrained representations.
APA
Kou, L., Ni, F., Zheng, Y., Liu, J., Yuan, Y., Dong, Z. & Hao, J.. (2024). KISA: A Unified Keyframe Identifier and Skill Annotator for Long-Horizon Robotics Demonstrations. Proceedings of the 41st International Conference on Machine Learning, in Proceedings of Machine Learning Research 235:25441-25474 Available from https://proceedings.mlr.press/v235/kou24b.html.

Related Material