Multi-Modal Self-Supervised Learning for Surgical Feedback Effectiveness Assessment

Arushi Gupta; Rafal Dariusz Kocielnik; Jiayun Wang; Firdavs Nasriddinov; Cherine Yang; Elyssa Wong; Anima Anandkumar; Andrew Hung

Multi-Modal Self-Supervised Learning for Surgical Feedback Effectiveness Assessment

Arushi Gupta, Rafal Dariusz Kocielnik, Jiayun Wang, Firdavs Nasriddinov, Cherine Yang, Elyssa Wong, Anima Anandkumar, Andrew Hung

Proceedings of the 4th Machine Learning for Health Symposium, PMLR 259:440-455, 2025.

Abstract

During surgical training, real-time feedback from trainers to trainees is important for preventing errors and enhancing long-term skill acquisition. Accurately predicting the effectiveness of this feedback, specifically whether it leads to a change in trainee behavior, is crucial for developing methods for improving surgical training and education. However, relying on human annotations to assess feedback effectiveness is laborious and prone to biases, underscoring the need for an automated, scalable, and objective method. Creating such an automated system poses challenges, as it requires an understanding of both the verbal feedback delivered by the trainer and the visual context of the real-time surgical scene. To address this, we propose a method that integrates information from transcribed verbal feedback and corresponding surgical video to predict feedback effectiveness. Our findings show that both transcribed feedback and surgical video are individually predictive of trainee behavior changes, and their combination achieves an AUROC of 0.70 ± 0.02, improving prediction accuracy by up to 6.6%. Additionally, we introduce self-supervised fine-tuning as a strategy for enhancing surgical video representation learning, which is scalable and further enhances prediction performance. Our results demonstrate the potential of multi-modal learning to advance the automated assessment of surgical feedback.

Cite this Paper

BibTeX

@InProceedings{pmlr-v259-gupta25a,
  title = 	 {Multi-Modal Self-Supervised Learning for Surgical Feedback Effectiveness Assessment},
  author =       {Gupta, Arushi and Kocielnik, Rafal Dariusz and Wang, Jiayun and Nasriddinov, Firdavs and Yang, Cherine and Wong, Elyssa and Anandkumar, Anima and Hung, Andrew},
  booktitle = 	 {Proceedings of the 4th Machine Learning for Health Symposium},
  pages = 	 {440--455},
  year = 	 {2025},
  editor = 	 {Hegselmann, Stefan and Zhou, Helen and Healey, Elizabeth and Chang, Trenton and Ellington, Caleb and Mhasawade, Vishwali and Tonekaboni, Sana and Argaw, Peniel and Zhang, Haoran},
  volume = 	 {259},
  series = 	 {Proceedings of Machine Learning Research},
  month = 	 {15--16 Dec},
  publisher =    {PMLR},
  pdf = 	 {https://raw.githubusercontent.com/mlresearch/v259/main/assets/gupta25a/gupta25a.pdf},
  url = 	 {https://proceedings.mlr.press/v259/gupta25a.html},
  abstract = 	 {During surgical training, real-time feedback from trainers to trainees is important for preventing errors and enhancing long-term skill acquisition. Accurately predicting the effectiveness of this feedback, specifically whether it leads to a change in trainee behavior, is crucial for developing methods for improving surgical training and education. However, relying on human annotations to assess feedback effectiveness is laborious and prone to biases, underscoring the need for an automated, scalable, and objective method. Creating such an automated system poses challenges, as it requires an understanding of both the verbal feedback delivered by the trainer and the visual context of the real-time surgical scene. To address this, we propose a method that integrates information from transcribed verbal feedback and corresponding surgical video to predict feedback effectiveness. Our findings show that both transcribed feedback and surgical video are individually predictive of trainee behavior changes, and their combination achieves an AUROC of 0.70 ± 0.02, improving prediction accuracy by up to 6.6%. Additionally, we introduce self-supervised fine-tuning as a strategy for enhancing surgical video representation learning, which is scalable and further enhances prediction performance. Our results demonstrate the potential of multi-modal learning to advance the automated assessment of surgical feedback.}
}

Endnote

%0 Conference Paper
%T Multi-Modal Self-Supervised Learning for Surgical Feedback Effectiveness Assessment
%A Arushi Gupta
%A Rafal Dariusz Kocielnik
%A Jiayun Wang
%A Firdavs Nasriddinov
%A Cherine Yang
%A Elyssa Wong
%A Anima Anandkumar
%A Andrew Hung
%B Proceedings of the 4th Machine Learning for Health Symposium
%C Proceedings of Machine Learning Research
%D 2025
%E Stefan Hegselmann
%E Helen Zhou
%E Elizabeth Healey
%E Trenton Chang
%E Caleb Ellington
%E Vishwali Mhasawade
%E Sana Tonekaboni
%E Peniel Argaw
%E Haoran Zhang	
%F pmlr-v259-gupta25a
%I PMLR
%P 440--455
%U https://proceedings.mlr.press/v259/gupta25a.html
%V 259
%X During surgical training, real-time feedback from trainers to trainees is important for preventing errors and enhancing long-term skill acquisition. Accurately predicting the effectiveness of this feedback, specifically whether it leads to a change in trainee behavior, is crucial for developing methods for improving surgical training and education. However, relying on human annotations to assess feedback effectiveness is laborious and prone to biases, underscoring the need for an automated, scalable, and objective method. Creating such an automated system poses challenges, as it requires an understanding of both the verbal feedback delivered by the trainer and the visual context of the real-time surgical scene. To address this, we propose a method that integrates information from transcribed verbal feedback and corresponding surgical video to predict feedback effectiveness. Our findings show that both transcribed feedback and surgical video are individually predictive of trainee behavior changes, and their combination achieves an AUROC of 0.70 ± 0.02, improving prediction accuracy by up to 6.6%. Additionally, we introduce self-supervised fine-tuning as a strategy for enhancing surgical video representation learning, which is scalable and further enhances prediction performance. Our results demonstrate the potential of multi-modal learning to advance the automated assessment of surgical feedback.

APA

Gupta, A., Kocielnik, R.D., Wang, J., Nasriddinov, F., Yang, C., Wong, E., Anandkumar, A. & Hung, A.. (2025). Multi-Modal Self-Supervised Learning for Surgical Feedback Effectiveness Assessment. Proceedings of the 4th Machine Learning for Health Symposium, in Proceedings of Machine Learning Research 259:440-455 Available from https://proceedings.mlr.press/v259/gupta25a.html.

Multi-Modal Self-Supervised Learning for Surgical Feedback Effectiveness Assessment

Abstract

Cite this Paper

Related Material