Deep Multimodal Fusion for Surgical Feedback Classification

Rafal Kocielnik, Elyssa Y. Wong, Timothy N. Chu, Lydia Lin, De-An Huang, Jiayun Wang, Anima Anandkumar, Andrew J. Hung
Proceedings of the 3rd Machine Learning for Health Symposium, PMLR 225:256-267, 2023.

Abstract

Quantification of real-time informal feedback delivered by an experienced surgeon to a trainee during surgery is important for skill improvements in surgical training. Such feedback in the live operating room is inherently multimodal, consisting of verbal conversations (e.g., questions and answers) as well as non-verbal elements (e.g., through visual cues like pointing to anatomic elements). In this work, we leverage a clinically-validated five-category classification of surgical feedback: “Anatomic” , “Technical” , “Procedural” , “Praise” and “Visual Aid” . We then develop a multi-label machine learning model to classify these five categories of surgical feedback from inputs of text, audio, and video modalities. The ultimate goal of our work is to help automate the annotation of real-time contextual surgical feedback at scale. Our automated classification of surgical feedback achieves AUCs ranging from 71.5 to 77.6 with the fusion improving performance by 3.1 {\}% . We also show that high-quality manual transcriptions of feedback audio from experts improve AUCs to between 76.5 and 96.2, which demonstrates a clear path toward future improvements. Empirically, we find that the Staged training strategy, with first pre-training each modality separately and then training them jointly, is more effective than training different modalities altogether. We also present intuitive findings on the importance of modalities for different feedback categories. This work offers an important first look at the feasibility of automated classification of real-world live surgical feedback based on text, audio, and video modalities.

Cite this Paper


BibTeX
@InProceedings{pmlr-v225-kocielnik23a, title = {Deep Multimodal Fusion for Surgical Feedback Classification}, author = {Kocielnik, Rafal and Wong, Elyssa Y. and Chu, Timothy N. and Lin, Lydia and Huang, De-An and Wang, Jiayun and Anandkumar, Anima and Hung, Andrew J.}, booktitle = {Proceedings of the 3rd Machine Learning for Health Symposium}, pages = {256--267}, year = {2023}, editor = {Hegselmann, Stefan and Parziale, Antonio and Shanmugam, Divya and Tang, Shengpu and Asiedu, Mercy Nyamewaa and Chang, Serina and Hartvigsen, Tom and Singh, Harvineet}, volume = {225}, series = {Proceedings of Machine Learning Research}, month = {10 Dec}, publisher = {PMLR}, pdf = {https://proceedings.mlr.press/v225/kocielnik23a/kocielnik23a.pdf}, url = {https://proceedings.mlr.press/v225/kocielnik23a.html}, abstract = {Quantification of real-time informal feedback delivered by an experienced surgeon to a trainee during surgery is important for skill improvements in surgical training. Such feedback in the live operating room is inherently multimodal, consisting of verbal conversations (e.g., questions and answers) as well as non-verbal elements (e.g., through visual cues like pointing to anatomic elements). In this work, we leverage a clinically-validated five-category classification of surgical feedback: “Anatomic” , “Technical” , “Procedural” , “Praise” and “Visual Aid” . We then develop a multi-label machine learning model to classify these five categories of surgical feedback from inputs of text, audio, and video modalities. The ultimate goal of our work is to help automate the annotation of real-time contextual surgical feedback at scale. Our automated classification of surgical feedback achieves AUCs ranging from 71.5 to 77.6 with the fusion improving performance by 3.1 {\}% . We also show that high-quality manual transcriptions of feedback audio from experts improve AUCs to between 76.5 and 96.2, which demonstrates a clear path toward future improvements. Empirically, we find that the Staged training strategy, with first pre-training each modality separately and then training them jointly, is more effective than training different modalities altogether. We also present intuitive findings on the importance of modalities for different feedback categories. This work offers an important first look at the feasibility of automated classification of real-world live surgical feedback based on text, audio, and video modalities.} }
Endnote
%0 Conference Paper %T Deep Multimodal Fusion for Surgical Feedback Classification %A Rafal Kocielnik %A Elyssa Y. Wong %A Timothy N. Chu %A Lydia Lin %A De-An Huang %A Jiayun Wang %A Anima Anandkumar %A Andrew J. Hung %B Proceedings of the 3rd Machine Learning for Health Symposium %C Proceedings of Machine Learning Research %D 2023 %E Stefan Hegselmann %E Antonio Parziale %E Divya Shanmugam %E Shengpu Tang %E Mercy Nyamewaa Asiedu %E Serina Chang %E Tom Hartvigsen %E Harvineet Singh %F pmlr-v225-kocielnik23a %I PMLR %P 256--267 %U https://proceedings.mlr.press/v225/kocielnik23a.html %V 225 %X Quantification of real-time informal feedback delivered by an experienced surgeon to a trainee during surgery is important for skill improvements in surgical training. Such feedback in the live operating room is inherently multimodal, consisting of verbal conversations (e.g., questions and answers) as well as non-verbal elements (e.g., through visual cues like pointing to anatomic elements). In this work, we leverage a clinically-validated five-category classification of surgical feedback: “Anatomic” , “Technical” , “Procedural” , “Praise” and “Visual Aid” . We then develop a multi-label machine learning model to classify these five categories of surgical feedback from inputs of text, audio, and video modalities. The ultimate goal of our work is to help automate the annotation of real-time contextual surgical feedback at scale. Our automated classification of surgical feedback achieves AUCs ranging from 71.5 to 77.6 with the fusion improving performance by 3.1 {\}% . We also show that high-quality manual transcriptions of feedback audio from experts improve AUCs to between 76.5 and 96.2, which demonstrates a clear path toward future improvements. Empirically, we find that the Staged training strategy, with first pre-training each modality separately and then training them jointly, is more effective than training different modalities altogether. We also present intuitive findings on the importance of modalities for different feedback categories. This work offers an important first look at the feasibility of automated classification of real-world live surgical feedback based on text, audio, and video modalities.
APA
Kocielnik, R., Wong, E.Y., Chu, T.N., Lin, L., Huang, D., Wang, J., Anandkumar, A. & Hung, A.J.. (2023). Deep Multimodal Fusion for Surgical Feedback Classification. Proceedings of the 3rd Machine Learning for Health Symposium, in Proceedings of Machine Learning Research 225:256-267 Available from https://proceedings.mlr.press/v225/kocielnik23a.html.

Related Material