[edit]
Enhancing Surgical Documentation through Multimodal Visual-Temporal Transformers and Generative AI
Proceedings of the Fifth Machine Learning for Health Symposium, PMLR 297:350-368, 2026.
Abstract
Automatic summarization of surgical videos is critical for improving procedural documentation, supporting surgical training, and facilitating post-operative analysis. Despite recent advances in computer vision and natural language processing, most existing methods either focus on tool detection or clip-level captioning, lacking an integrated approach that produces full, clinically meaningful reports. We introduce a multimodal framework that leverages visual transformers and large language models to generate comprehensive surgical video summaries. The method unfolds in three stages: (i) extraction of frame-level features to capture tools, tissues, and surgical actions, (ii) integration of temporal context through a {ViViT}-based encoder combined with frame-level captions, and (iii) synthesis of clip-level descriptions into structured surgical reports using a dedicated {LLM}. We evaluate the framework on the CholecT50 dataset of 50 laparoscopic videos, achieving 96% precision in tool detection and a {BERT} score of 0.74 for temporal summarization. These results demonstrate the potential of combining computer vision and language models to advance {AI}-assisted reporting, offering a step toward reliable, interpretable, and efficient clinical documentation.