Sparsh: Self-supervised touch representations for vision-based tactile sensing

Carolina Higuera; Akash Sharma; Chaithanya Krishna Bodduluri; Taosha Fan; Patrick Lancaster; Mrinal Kalakrishnan; Michael Kaess; Byron Boots; Mike Lambeta; Tingfan Wu; Mustafa Mukadam

Sparsh: Self-supervised touch representations for vision-based tactile sensing

Carolina Higuera, Akash Sharma, Chaithanya Krishna Bodduluri, Taosha Fan, Patrick Lancaster, Mrinal Kalakrishnan, Michael Kaess, Byron Boots, Mike Lambeta, Tingfan Wu, Mustafa Mukadam

Proceedings of The 8th Conference on Robot Learning, PMLR 270:885-915, 2025.

Abstract

In this work, we introduce general purpose touch representations for the increasingly accessible class of vision-based tactile sensors. Such sensors have led to many recent advances in robot manipulation as they markedly complement vision, yet solutions today often rely on task and sensor specific handcrafted perception models. Collecting real data at scale with task centric ground truth labels, like contact forces and slip, is a challenge further compounded by sensors of various form factor differing in aspects like lighting and gel markings. To tackle this, we turn to self-supervised learning (SSL) that has demonstrated remarkable performance in computer vision. We present Sparsh, a family of SSL models that can support various vision-based tactile sensors, alleviating the need for custom labels through pre-training on 460k+ tactile images with masking and self-distillation in pixel and latent spaces. We also build TacBench, to facilitate standardized benchmarking across sensors and models, comprising of six tasks ranging from comprehending tactile properties to enabling physical perception and manipulation planning. In evaluations, we find that SSL pre-training for touch representation outperforms task and sensor-specific end-to-end training by 95.1% on average over TacBench, and Sparsh (DINO) and Sparsh (IJEPA) are the most competitive, indicating the merits of learning in latent space for tactile images. Project page: https://sparsh-ssl.github.io

Cite this Paper

BibTeX

@InProceedings{pmlr-v270-higuera25a,
  title = 	 {Sparsh: Self-supervised touch representations for vision-based tactile sensing},
  author =       {Higuera, Carolina and Sharma, Akash and Bodduluri, Chaithanya Krishna and Fan, Taosha and Lancaster, Patrick and Kalakrishnan, Mrinal and Kaess, Michael and Boots, Byron and Lambeta, Mike and Wu, Tingfan and Mukadam, Mustafa},
  booktitle = 	 {Proceedings of The 8th Conference on Robot Learning},
  pages = 	 {885--915},
  year = 	 {2025},
  editor = 	 {Agrawal, Pulkit and Kroemer, Oliver and Burgard, Wolfram},
  volume = 	 {270},
  series = 	 {Proceedings of Machine Learning Research},
  month = 	 {06--09 Nov},
  publisher =    {PMLR},
  pdf = 	 {https://raw.githubusercontent.com/mlresearch/v270/main/assets/higuera25a/higuera25a.pdf},
  url = 	 {https://proceedings.mlr.press/v270/higuera25a.html},
  abstract = 	 {In this work, we introduce general purpose touch representations for the increasingly accessible class of vision-based tactile sensors. Such sensors have led to many recent advances in robot manipulation as they markedly complement vision, yet solutions today often rely on task and sensor specific handcrafted perception models. Collecting real data at scale with task centric ground truth labels, like contact forces and slip, is a challenge further compounded by sensors of various form factor differing in aspects like lighting and gel markings. To tackle this, we turn to self-supervised learning (SSL) that has demonstrated remarkable performance in computer vision. We present Sparsh, a family of SSL models that can support various vision-based tactile sensors, alleviating the need for custom labels through pre-training on 460k+ tactile images with masking and self-distillation in pixel and latent spaces. We also build TacBench, to facilitate standardized benchmarking across sensors and models, comprising of six tasks ranging from comprehending tactile properties to enabling physical perception and manipulation planning. In evaluations, we find that SSL pre-training for touch representation outperforms task and sensor-specific end-to-end training by 95.1% on average over TacBench, and Sparsh (DINO) and Sparsh (IJEPA) are the most competitive, indicating the merits of learning in latent space for tactile images. Project page: https://sparsh-ssl.github.io}
}

Endnote

%0 Conference Paper
%T Sparsh: Self-supervised touch representations for vision-based tactile sensing
%A Carolina Higuera
%A Akash Sharma
%A Chaithanya Krishna Bodduluri
%A Taosha Fan
%A Patrick Lancaster
%A Mrinal Kalakrishnan
%A Michael Kaess
%A Byron Boots
%A Mike Lambeta
%A Tingfan Wu
%A Mustafa Mukadam
%B Proceedings of The 8th Conference on Robot Learning
%C Proceedings of Machine Learning Research
%D 2025
%E Pulkit Agrawal
%E Oliver Kroemer
%E Wolfram Burgard	
%F pmlr-v270-higuera25a
%I PMLR
%P 885--915
%U https://proceedings.mlr.press/v270/higuera25a.html
%V 270
%X In this work, we introduce general purpose touch representations for the increasingly accessible class of vision-based tactile sensors. Such sensors have led to many recent advances in robot manipulation as they markedly complement vision, yet solutions today often rely on task and sensor specific handcrafted perception models. Collecting real data at scale with task centric ground truth labels, like contact forces and slip, is a challenge further compounded by sensors of various form factor differing in aspects like lighting and gel markings. To tackle this, we turn to self-supervised learning (SSL) that has demonstrated remarkable performance in computer vision. We present Sparsh, a family of SSL models that can support various vision-based tactile sensors, alleviating the need for custom labels through pre-training on 460k+ tactile images with masking and self-distillation in pixel and latent spaces. We also build TacBench, to facilitate standardized benchmarking across sensors and models, comprising of six tasks ranging from comprehending tactile properties to enabling physical perception and manipulation planning. In evaluations, we find that SSL pre-training for touch representation outperforms task and sensor-specific end-to-end training by 95.1% on average over TacBench, and Sparsh (DINO) and Sparsh (IJEPA) are the most competitive, indicating the merits of learning in latent space for tactile images. Project page: https://sparsh-ssl.github.io

APA

Higuera, C., Sharma, A., Bodduluri, C.K., Fan, T., Lancaster, P., Kalakrishnan, M., Kaess, M., Boots, B., Lambeta, M., Wu, T. & Mukadam, M.. (2025). Sparsh: Self-supervised touch representations for vision-based tactile sensing. Proceedings of The 8th Conference on Robot Learning, in Proceedings of Machine Learning Research 270:885-915 Available from https://proceedings.mlr.press/v270/higuera25a.html.

Sparsh: Self-supervised touch representations for vision-based tactile sensing

Abstract

Cite this Paper

Related Material