A Touch, Vision, and Language Dataset for Multimodal Alignment

Letian Fu; Gaurav Datta; Huang Huang; William Chung-Ho Panitch; Jaimyn Drake; Joseph Ortiz; Mustafa Mukadam; Mike Lambeta; Roberto Calandra; Ken Goldberg

A Touch, Vision, and Language Dataset for Multimodal Alignment

Letian Fu, Gaurav Datta, Huang Huang, William Chung-Ho Panitch, Jaimyn Drake, Joseph Ortiz, Mustafa Mukadam, Mike Lambeta, Roberto Calandra, Ken Goldberg

Proceedings of the 41st International Conference on Machine Learning, PMLR 235:14080-14101, 2024.

Abstract

Touch is an important sensing modality for humans, but it has not yet been incorporated into a multimodal generative language model. This is partially due to the difficulty of obtaining natural language labels for tactile data and the complexity of aligning tactile readings with both visual observations and language descriptions. As a step towards bridging that gap, this work introduces a new dataset of 44K in-the-wild visiontouch pairs, with English language labels annotated by humans (10%) and textual pseudo-labels from GPT-4V (90%). We use this dataset to train a vision-language-aligned tactile encoder for open-vocabulary classification and a touch-visionlanguage (TVL) model for text generation using the trained encoder. Results suggest that by incorporating touch, the TVL model improves (+29% classification accuracy) tactile-vision-language alignment over existing models trained on any pair of those modalities. Although only a small fraction of the dataset is human labeled, the TVL model demonstrates improved visual-tactile understanding over GPT-4V (+12%) and open-source vision-language models (+32%) on a new touch-vision understanding benchmark. Code, checkpoints and data are available on https: //tactile-vlm.github.io.

Cite this Paper

BibTeX

@InProceedings{pmlr-v235-fu24b,
  title = 	 {A Touch, Vision, and Language Dataset for Multimodal Alignment},
  author =       {Fu, Letian and Datta, Gaurav and Huang, Huang and Panitch, William Chung-Ho and Drake, Jaimyn and Ortiz, Joseph and Mukadam, Mustafa and Lambeta, Mike and Calandra, Roberto and Goldberg, Ken},
  booktitle = 	 {Proceedings of the 41st International Conference on Machine Learning},
  pages = 	 {14080--14101},
  year = 	 {2024},
  editor = 	 {Salakhutdinov, Ruslan and Kolter, Zico and Heller, Katherine and Weller, Adrian and Oliver, Nuria and Scarlett, Jonathan and Berkenkamp, Felix},
  volume = 	 {235},
  series = 	 {Proceedings of Machine Learning Research},
  month = 	 {21--27 Jul},
  publisher =    {PMLR},
  pdf = 	 {https://raw.githubusercontent.com/mlresearch/v235/main/assets/fu24b/fu24b.pdf},
  url = 	 {https://proceedings.mlr.press/v235/fu24b.html},
  abstract = 	 {Touch is an important sensing modality for humans, but it has not yet been incorporated into a multimodal generative language model. This is partially due to the difficulty of obtaining natural language labels for tactile data and the complexity of aligning tactile readings with both visual observations and language descriptions. As a step towards bridging that gap, this work introduces a new dataset of 44K in-the-wild visiontouch pairs, with English language labels annotated by humans (10%) and textual pseudo-labels from GPT-4V (90%). We use this dataset to train a vision-language-aligned tactile encoder for open-vocabulary classification and a touch-visionlanguage (TVL) model for text generation using the trained encoder. Results suggest that by incorporating touch, the TVL model improves (+29% classification accuracy) tactile-vision-language alignment over existing models trained on any pair of those modalities. Although only a small fraction of the dataset is human labeled, the TVL model demonstrates improved visual-tactile understanding over GPT-4V (+12%) and open-source vision-language models (+32%) on a new touch-vision understanding benchmark. Code, checkpoints and data are available on https: //tactile-vlm.github.io.}
}

Endnote

%0 Conference Paper
%T A Touch, Vision, and Language Dataset for Multimodal Alignment
%A Letian Fu
%A Gaurav Datta
%A Huang Huang
%A William Chung-Ho Panitch
%A Jaimyn Drake
%A Joseph Ortiz
%A Mustafa Mukadam
%A Mike Lambeta
%A Roberto Calandra
%A Ken Goldberg
%B Proceedings of the 41st International Conference on Machine Learning
%C Proceedings of Machine Learning Research
%D 2024
%E Ruslan Salakhutdinov
%E Zico Kolter
%E Katherine Heller
%E Adrian Weller
%E Nuria Oliver
%E Jonathan Scarlett
%E Felix Berkenkamp	
%F pmlr-v235-fu24b
%I PMLR
%P 14080--14101
%U https://proceedings.mlr.press/v235/fu24b.html
%V 235
%X Touch is an important sensing modality for humans, but it has not yet been incorporated into a multimodal generative language model. This is partially due to the difficulty of obtaining natural language labels for tactile data and the complexity of aligning tactile readings with both visual observations and language descriptions. As a step towards bridging that gap, this work introduces a new dataset of 44K in-the-wild visiontouch pairs, with English language labels annotated by humans (10%) and textual pseudo-labels from GPT-4V (90%). We use this dataset to train a vision-language-aligned tactile encoder for open-vocabulary classification and a touch-visionlanguage (TVL) model for text generation using the trained encoder. Results suggest that by incorporating touch, the TVL model improves (+29% classification accuracy) tactile-vision-language alignment over existing models trained on any pair of those modalities. Although only a small fraction of the dataset is human labeled, the TVL model demonstrates improved visual-tactile understanding over GPT-4V (+12%) and open-source vision-language models (+32%) on a new touch-vision understanding benchmark. Code, checkpoints and data are available on https: //tactile-vlm.github.io.

APA

Fu, L., Datta, G., Huang, H., Panitch, W.C., Drake, J., Ortiz, J., Mukadam, M., Lambeta, M., Calandra, R. & Goldberg, K.. (2024). A Touch, Vision, and Language Dataset for Multimodal Alignment. Proceedings of the 41st International Conference on Machine Learning, in Proceedings of Machine Learning Research 235:14080-14101 Available from https://proceedings.mlr.press/v235/fu24b.html.

A Touch, Vision, and Language Dataset for Multimodal Alignment

Abstract

Cite this Paper

Related Material