Generating Natural-Language Surgical Feedback: From Structured Representation to Domain-Grounded Evaluation

Firdavs Nasriddinov; Rafal Kocielnik; Anima Anandkumar; Andrew J. Hung

Generating Natural-Language Surgical Feedback: From Structured Representation to Domain-Grounded Evaluation

Firdavs Nasriddinov, Rafal Kocielnik, Anima Anandkumar, Andrew J. Hung

Proceedings of the Fifth Machine Learning for Health Symposium, PMLR 297:952-984, 2026.

Abstract

High-quality intraoperative feedback from a surgical trainer is pivotal for improving trainee performance and long-term skill acquisition. Automating natural, trainer-style feedback promises timely, accessible, and consistent guidance at scale–but requires models that understand clinically relevant representations. We present a structure-aware pipeline that learns a surgical action ontology from real trainer-to-trainee transcripts (33 surgeries) and uses it to condition feedback generation. We contribute by 1) mining Instrument-Action-Target ({IAT}) triplets from real-world feedback text and clustering surface forms into normalized categories, 2) fine-tuning a video-to-{IAT} model that leverages the surgical procedure and task contexts, as well as fine-grained temporal instrument motion (crucial for representing instruments and actions over time), and 3) demonstrating how to effectively leverage {IAT} triplet representation to guide {GPT}-4o in generating clinically-grounded natural, trainer-style feedback. We show that, on Task 1: Video-to-{IAT} recognition, our context injection and temporal tracking deliver consistent {AUC} gains – Instrument: 0.67 to 0.74, Action: 0.60 to 0.63, Tissue: 0.74 to 0.79. For Task 2: Feedback text generation (1 [opposite/unsafe] - 3 [admissible] - 5 [perfect match] fidelity rubric against human trainer), {GPT}-4o from video alone scores 2.17; {IAT} conditioning reaches 2.44 (+12.4%), increasing the admissible generations with score $\geq$3: 21% to 42%. Traditional metrics also improve: Word Error Rate ({WER}): 15–31% lower and {ROUGE} (phrase/substring overlap): 9–64% higher. Grounding generation in explicit {IAT} structure improves fidelity and yields clinician-verifiable rationales, supporting auditable use in surgical training.

Cite this Paper

BibTeX

@InProceedings{pmlr-v297-nasriddinov26a,
  title = 	 {Generating Natural-Language Surgical Feedback: From Structured Representation to Domain-Grounded Evaluation},
  author =       {Nasriddinov, Firdavs and Kocielnik, Rafal and Anandkumar, Anima and Hung, Andrew J.},
  booktitle = 	 {Proceedings of the Fifth Machine Learning for Health Symposium},
  pages = 	 {952--984},
  year = 	 {2026},
  editor = 	 {Argaw, Peniel and Zhang, Haoran and Jabbour, Sarah and Chandak, Payal and Ji, Jerry and Mukherjee, Sumit and Salaudeen, Olawale and Chang, Trenton and Healey, Elizabeth and Gröger, Fabian and Adibi, Amin and Hegselmann, Stefan and Wild, Benjamin and Noori, Ayush},
  volume = 	 {297},
  series = 	 {Proceedings of Machine Learning Research},
  month = 	 {13--14 Dec},
  publisher =    {PMLR},
  pdf = 	 {https://raw.githubusercontent.com/mlresearch/v297/main/assets/nasriddinov26a/nasriddinov26a.pdf},
  url = 	 {https://proceedings.mlr.press/v297/nasriddinov26a.html},
  abstract = 	 {High-quality intraoperative feedback from a surgical trainer is pivotal for improving trainee performance and long-term skill acquisition. Automating natural, trainer-style feedback promises timely, accessible, and consistent guidance at scale–but requires models that understand clinically relevant representations. We present a structure-aware pipeline that learns a surgical action ontology from real trainer-to-trainee transcripts (33 surgeries) and uses it to condition feedback generation. We contribute by 1) mining Instrument-Action-Target ({IAT}) triplets from real-world feedback text and clustering surface forms into normalized categories, 2) fine-tuning a video-to-{IAT} model that leverages the surgical procedure and task contexts, as well as fine-grained temporal instrument motion (crucial for representing instruments and actions over time), and 3) demonstrating how to effectively leverage {IAT} triplet representation to guide {GPT}-4o in generating clinically-grounded natural, trainer-style feedback. We show that, on Task 1: Video-to-{IAT} recognition, our context injection and temporal tracking deliver consistent {AUC} gains – Instrument: 0.67 to 0.74, Action: 0.60 to 0.63, Tissue: 0.74 to 0.79. For Task 2: Feedback text generation (1 [opposite/unsafe] - 3 [admissible] - 5 [perfect match] fidelity rubric against human trainer), {GPT}-4o from video alone scores 2.17; {IAT} conditioning reaches 2.44 (+12.4%), increasing the admissible generations with score $\geq$3: 21% to 42%. Traditional metrics also improve: Word Error Rate ({WER}): 15–31% lower and {ROUGE} (phrase/substring overlap): 9–64% higher. Grounding generation in explicit {IAT} structure improves fidelity and yields clinician-verifiable rationales, supporting auditable use in surgical training.}
}

Endnote

%0 Conference Paper
%T Generating Natural-Language Surgical Feedback: From Structured Representation to Domain-Grounded Evaluation
%A Firdavs Nasriddinov
%A Rafal Kocielnik
%A Anima Anandkumar
%A Andrew J. Hung
%B Proceedings of the Fifth Machine Learning for Health Symposium
%C Proceedings of Machine Learning Research
%D 2026
%E Peniel Argaw
%E Haoran Zhang
%E Sarah Jabbour
%E Payal Chandak
%E Jerry Ji
%E Sumit Mukherjee
%E Olawale Salaudeen
%E Trenton Chang
%E Elizabeth Healey
%E Fabian Gröger
%E Amin Adibi
%E Stefan Hegselmann
%E Benjamin Wild
%E Ayush Noori	
%F pmlr-v297-nasriddinov26a
%I PMLR
%P 952--984
%U https://proceedings.mlr.press/v297/nasriddinov26a.html
%V 297
%X High-quality intraoperative feedback from a surgical trainer is pivotal for improving trainee performance and long-term skill acquisition. Automating natural, trainer-style feedback promises timely, accessible, and consistent guidance at scale–but requires models that understand clinically relevant representations. We present a structure-aware pipeline that learns a surgical action ontology from real trainer-to-trainee transcripts (33 surgeries) and uses it to condition feedback generation. We contribute by 1) mining Instrument-Action-Target ({IAT}) triplets from real-world feedback text and clustering surface forms into normalized categories, 2) fine-tuning a video-to-{IAT} model that leverages the surgical procedure and task contexts, as well as fine-grained temporal instrument motion (crucial for representing instruments and actions over time), and 3) demonstrating how to effectively leverage {IAT} triplet representation to guide {GPT}-4o in generating clinically-grounded natural, trainer-style feedback. We show that, on Task 1: Video-to-{IAT} recognition, our context injection and temporal tracking deliver consistent {AUC} gains – Instrument: 0.67 to 0.74, Action: 0.60 to 0.63, Tissue: 0.74 to 0.79. For Task 2: Feedback text generation (1 [opposite/unsafe] - 3 [admissible] - 5 [perfect match] fidelity rubric against human trainer), {GPT}-4o from video alone scores 2.17; {IAT} conditioning reaches 2.44 (+12.4%), increasing the admissible generations with score $\geq$3: 21% to 42%. Traditional metrics also improve: Word Error Rate ({WER}): 15–31% lower and {ROUGE} (phrase/substring overlap): 9–64% higher. Grounding generation in explicit {IAT} structure improves fidelity and yields clinician-verifiable rationales, supporting auditable use in surgical training.

APA

Nasriddinov, F., Kocielnik, R., Anandkumar, A. & Hung, A.J.. (2026). Generating Natural-Language Surgical Feedback: From Structured Representation to Domain-Grounded Evaluation. Proceedings of the Fifth Machine Learning for Health Symposium, in Proceedings of Machine Learning Research 297:952-984 Available from https://proceedings.mlr.press/v297/nasriddinov26a.html.

Generating Natural-Language Surgical Feedback: From Structured Representation to Domain-Grounded Evaluation

Abstract

Cite this Paper

Related Material