Enhancing Surgical Documentation through Multimodal Visual-Temporal Transformers and Generative AI

Cristian Cosentino, Hugo Georgenthum, Fabrizio Marozzo, Pietro Liò
Proceedings of the Fifth Machine Learning for Health Symposium, PMLR 297:350-368, 2026.

Abstract

Automatic summarization of surgical videos is critical for improving procedural documentation, supporting surgical training, and facilitating post-operative analysis. Despite recent advances in computer vision and natural language processing, most existing methods either focus on tool detection or clip-level captioning, lacking an integrated approach that produces full, clinically meaningful reports. We introduce a multimodal framework that leverages visual transformers and large language models to generate comprehensive surgical video summaries. The method unfolds in three stages: (i) extraction of frame-level features to capture tools, tissues, and surgical actions, (ii) integration of temporal context through a {ViViT}-based encoder combined with frame-level captions, and (iii) synthesis of clip-level descriptions into structured surgical reports using a dedicated {LLM}. We evaluate the framework on the CholecT50 dataset of 50 laparoscopic videos, achieving 96% precision in tool detection and a {BERT} score of 0.74 for temporal summarization. These results demonstrate the potential of combining computer vision and language models to advance {AI}-assisted reporting, offering a step toward reliable, interpretable, and efficient clinical documentation.

Cite this Paper


BibTeX
@InProceedings{pmlr-v297-cosentino26a, title = {Enhancing Surgical Documentation through Multimodal Visual-Temporal Transformers and Generative AI}, author = {Cosentino, Cristian and Georgenthum, Hugo and Marozzo, Fabrizio and Li{\`o}, Pietro}, booktitle = {Proceedings of the Fifth Machine Learning for Health Symposium}, pages = {350--368}, year = {2026}, editor = {Argaw, Peniel and Zhang, Haoran and Jabbour, Sarah and Chandak, Payal and Ji, Jerry and Mukherjee, Sumit and Salaudeen, Olawale and Chang, Trenton and Healey, Elizabeth and Gröger, Fabian and Adibi, Amin and Hegselmann, Stefan and Wild, Benjamin and Noori, Ayush}, volume = {297}, series = {Proceedings of Machine Learning Research}, month = {13--14 Dec}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v297/main/assets/cosentino26a/cosentino26a.pdf}, url = {https://proceedings.mlr.press/v297/cosentino26a.html}, abstract = {Automatic summarization of surgical videos is critical for improving procedural documentation, supporting surgical training, and facilitating post-operative analysis. Despite recent advances in computer vision and natural language processing, most existing methods either focus on tool detection or clip-level captioning, lacking an integrated approach that produces full, clinically meaningful reports. We introduce a multimodal framework that leverages visual transformers and large language models to generate comprehensive surgical video summaries. The method unfolds in three stages: (i) extraction of frame-level features to capture tools, tissues, and surgical actions, (ii) integration of temporal context through a {ViViT}-based encoder combined with frame-level captions, and (iii) synthesis of clip-level descriptions into structured surgical reports using a dedicated {LLM}. We evaluate the framework on the CholecT50 dataset of 50 laparoscopic videos, achieving 96% precision in tool detection and a {BERT} score of 0.74 for temporal summarization. These results demonstrate the potential of combining computer vision and language models to advance {AI}-assisted reporting, offering a step toward reliable, interpretable, and efficient clinical documentation.} }
Endnote
%0 Conference Paper %T Enhancing Surgical Documentation through Multimodal Visual-Temporal Transformers and Generative AI %A Cristian Cosentino %A Hugo Georgenthum %A Fabrizio Marozzo %A Pietro Liò %B Proceedings of the Fifth Machine Learning for Health Symposium %C Proceedings of Machine Learning Research %D 2026 %E Peniel Argaw %E Haoran Zhang %E Sarah Jabbour %E Payal Chandak %E Jerry Ji %E Sumit Mukherjee %E Olawale Salaudeen %E Trenton Chang %E Elizabeth Healey %E Fabian Gröger %E Amin Adibi %E Stefan Hegselmann %E Benjamin Wild %E Ayush Noori %F pmlr-v297-cosentino26a %I PMLR %P 350--368 %U https://proceedings.mlr.press/v297/cosentino26a.html %V 297 %X Automatic summarization of surgical videos is critical for improving procedural documentation, supporting surgical training, and facilitating post-operative analysis. Despite recent advances in computer vision and natural language processing, most existing methods either focus on tool detection or clip-level captioning, lacking an integrated approach that produces full, clinically meaningful reports. We introduce a multimodal framework that leverages visual transformers and large language models to generate comprehensive surgical video summaries. The method unfolds in three stages: (i) extraction of frame-level features to capture tools, tissues, and surgical actions, (ii) integration of temporal context through a {ViViT}-based encoder combined with frame-level captions, and (iii) synthesis of clip-level descriptions into structured surgical reports using a dedicated {LLM}. We evaluate the framework on the CholecT50 dataset of 50 laparoscopic videos, achieving 96% precision in tool detection and a {BERT} score of 0.74 for temporal summarization. These results demonstrate the potential of combining computer vision and language models to advance {AI}-assisted reporting, offering a step toward reliable, interpretable, and efficient clinical documentation.
APA
Cosentino, C., Georgenthum, H., Marozzo, F. & Liò, P.. (2026). Enhancing Surgical Documentation through Multimodal Visual-Temporal Transformers and Generative AI. Proceedings of the Fifth Machine Learning for Health Symposium, in Proceedings of Machine Learning Research 297:350-368 Available from https://proceedings.mlr.press/v297/cosentino26a.html.

Related Material