VinT-6D: A Large-Scale Object-in-hand Dataset from Vision, Touch and Proprioception

Zhaoliang Wan, Yonggen Ling, Senlin Yi, Lu Qi, Wang Wei Lee, Minglei Lu, Sicheng Yang, Xiao Teng, Peng Lu, Xu Yang, Ming-Hsuan Yang, Hui Cheng
Proceedings of the 41st International Conference on Machine Learning, PMLR 235:49921-49940, 2024.

Abstract

This paper addresses the scarcity of large-scale datasets for accurate object-in-hand pose estimation, which is crucial for robotic in-hand manipulation within the "Perception-Planning-Control" paradigm. Specifically, we introduce VinT-6D, the first extensive multi-modal dataset integrating vision, touch, and proprioception, to enhance robotic manipulation. VinT-6D comprises 2 million VinT-Sim and 0.1 million VinT-Real entries, collected via simulations in Mujoco and Blender and a custom-designed real-world platform. This dataset is tailored for robotic hands, offering models with whole-hand tactile perception and high-quality, well-aligned data. To the best of our knowledge, the VinT-Real is the largest considering the collection difficulties in the real-world environment so it can bridge the gap of simulation to real compared to the previous works. Built upon VinT-6D, we present a benchmark method that shows significant improvements in performance by fusing multi-modal information. The project is available at https://VinT-6D.github.io/.

Cite this Paper


BibTeX
@InProceedings{pmlr-v235-wan24d, title = {{V}in{T}-6{D}: A Large-Scale Object-in-hand Dataset from Vision, Touch and Proprioception}, author = {Wan, Zhaoliang and Ling, Yonggen and Yi, Senlin and Qi, Lu and Lee, Wang Wei and Lu, Minglei and Yang, Sicheng and Teng, Xiao and Lu, Peng and Yang, Xu and Yang, Ming-Hsuan and Cheng, Hui}, booktitle = {Proceedings of the 41st International Conference on Machine Learning}, pages = {49921--49940}, year = {2024}, editor = {Salakhutdinov, Ruslan and Kolter, Zico and Heller, Katherine and Weller, Adrian and Oliver, Nuria and Scarlett, Jonathan and Berkenkamp, Felix}, volume = {235}, series = {Proceedings of Machine Learning Research}, month = {21--27 Jul}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v235/main/assets/wan24d/wan24d.pdf}, url = {https://proceedings.mlr.press/v235/wan24d.html}, abstract = {This paper addresses the scarcity of large-scale datasets for accurate object-in-hand pose estimation, which is crucial for robotic in-hand manipulation within the "Perception-Planning-Control" paradigm. Specifically, we introduce VinT-6D, the first extensive multi-modal dataset integrating vision, touch, and proprioception, to enhance robotic manipulation. VinT-6D comprises 2 million VinT-Sim and 0.1 million VinT-Real entries, collected via simulations in Mujoco and Blender and a custom-designed real-world platform. This dataset is tailored for robotic hands, offering models with whole-hand tactile perception and high-quality, well-aligned data. To the best of our knowledge, the VinT-Real is the largest considering the collection difficulties in the real-world environment so it can bridge the gap of simulation to real compared to the previous works. Built upon VinT-6D, we present a benchmark method that shows significant improvements in performance by fusing multi-modal information. The project is available at https://VinT-6D.github.io/.} }
Endnote
%0 Conference Paper %T VinT-6D: A Large-Scale Object-in-hand Dataset from Vision, Touch and Proprioception %A Zhaoliang Wan %A Yonggen Ling %A Senlin Yi %A Lu Qi %A Wang Wei Lee %A Minglei Lu %A Sicheng Yang %A Xiao Teng %A Peng Lu %A Xu Yang %A Ming-Hsuan Yang %A Hui Cheng %B Proceedings of the 41st International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2024 %E Ruslan Salakhutdinov %E Zico Kolter %E Katherine Heller %E Adrian Weller %E Nuria Oliver %E Jonathan Scarlett %E Felix Berkenkamp %F pmlr-v235-wan24d %I PMLR %P 49921--49940 %U https://proceedings.mlr.press/v235/wan24d.html %V 235 %X This paper addresses the scarcity of large-scale datasets for accurate object-in-hand pose estimation, which is crucial for robotic in-hand manipulation within the "Perception-Planning-Control" paradigm. Specifically, we introduce VinT-6D, the first extensive multi-modal dataset integrating vision, touch, and proprioception, to enhance robotic manipulation. VinT-6D comprises 2 million VinT-Sim and 0.1 million VinT-Real entries, collected via simulations in Mujoco and Blender and a custom-designed real-world platform. This dataset is tailored for robotic hands, offering models with whole-hand tactile perception and high-quality, well-aligned data. To the best of our knowledge, the VinT-Real is the largest considering the collection difficulties in the real-world environment so it can bridge the gap of simulation to real compared to the previous works. Built upon VinT-6D, we present a benchmark method that shows significant improvements in performance by fusing multi-modal information. The project is available at https://VinT-6D.github.io/.
APA
Wan, Z., Ling, Y., Yi, S., Qi, L., Lee, W.W., Lu, M., Yang, S., Teng, X., Lu, P., Yang, X., Yang, M. & Cheng, H.. (2024). VinT-6D: A Large-Scale Object-in-hand Dataset from Vision, Touch and Proprioception. Proceedings of the 41st International Conference on Machine Learning, in Proceedings of Machine Learning Research 235:49921-49940 Available from https://proceedings.mlr.press/v235/wan24d.html.

Related Material