CatVLM: Enhancing Temporal Understanding in Cataract Surgery Videos with Boundary-Aware VLM

Jay N. Paranjape, Nisarg Shah, Nanthini Narayanan, Shameema Sikder, S. Swaroop Vedula, Vishal M. Patel
Proceedings of The 9th International Conference on Medical Imaging with Deep Learning, PMLR 315:947-959, 2026.

Abstract

Recent studies have shown the effectiveness of Vision Language Models (VLMs) for understanding and analyzing videos in the medical domain and supporting various Question-Answer (QA) tasks. Yet, current VLMs fall short in addressing queries that require temporal reasoning—a critical capability for surgical video understanding. In this work, we introduce CatVLM, a boundary-aware VLM, designed to capture temporal dynamics in untrimmed cataract surgery videos. CatVLM is capable of performing three clinically relevant tasks that demand moment-level awareness: Video Moment Retrieval (VMR), Video Captioning (VC), and Counting. To facilitate the training of such a model, we generate a bank of QA annotations for each task and propose a method to integrate video clips with the timestamps they occur. To the best of our knowledge, this work is one of the first approaches to explicitly incorporate temporal boundary awareness into VLMs for cataracts as well as the medical domain. We evaluate CatVLM on two public cataract surgery datasets, establishing new baselines across all three tasks. All the code, model checkpoints and annotations will be released post-review

Cite this Paper


BibTeX
@InProceedings{pmlr-v315-paranjape26a, title = {CatVLM: Enhancing Temporal Understanding in Cataract Surgery Videos with Boundary-Aware VLM}, author = {Paranjape, Jay N. and Shah, Nisarg and Narayanan, Nanthini and Sikder, Shameema and Vedula, S. Swaroop and Patel, Vishal M.}, booktitle = {Proceedings of The 9th International Conference on Medical Imaging with Deep Learning}, pages = {947--959}, year = {2026}, editor = {Huo, Yuankai and Gao, Mingchen and Kuo, Chang-Fu and Jin, Yueming and Deng, Ruining}, volume = {315}, series = {Proceedings of Machine Learning Research}, month = {08--10 Jul}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v315/main/assets/paranjape26a/paranjape26a.pdf}, url = {https://proceedings.mlr.press/v315/paranjape26a.html}, abstract = {Recent studies have shown the effectiveness of Vision Language Models (VLMs) for understanding and analyzing videos in the medical domain and supporting various Question-Answer (QA) tasks. Yet, current VLMs fall short in addressing queries that require temporal reasoning—a critical capability for surgical video understanding. In this work, we introduce CatVLM, a boundary-aware VLM, designed to capture temporal dynamics in untrimmed cataract surgery videos. CatVLM is capable of performing three clinically relevant tasks that demand moment-level awareness: Video Moment Retrieval (VMR), Video Captioning (VC), and Counting. To facilitate the training of such a model, we generate a bank of QA annotations for each task and propose a method to integrate video clips with the timestamps they occur. To the best of our knowledge, this work is one of the first approaches to explicitly incorporate temporal boundary awareness into VLMs for cataracts as well as the medical domain. We evaluate CatVLM on two public cataract surgery datasets, establishing new baselines across all three tasks. All the code, model checkpoints and annotations will be released post-review} }
Endnote
%0 Conference Paper %T CatVLM: Enhancing Temporal Understanding in Cataract Surgery Videos with Boundary-Aware VLM %A Jay N. Paranjape %A Nisarg Shah %A Nanthini Narayanan %A Shameema Sikder %A S. Swaroop Vedula %A Vishal M. Patel %B Proceedings of The 9th International Conference on Medical Imaging with Deep Learning %C Proceedings of Machine Learning Research %D 2026 %E Yuankai Huo %E Mingchen Gao %E Chang-Fu Kuo %E Yueming Jin %E Ruining Deng %F pmlr-v315-paranjape26a %I PMLR %P 947--959 %U https://proceedings.mlr.press/v315/paranjape26a.html %V 315 %X Recent studies have shown the effectiveness of Vision Language Models (VLMs) for understanding and analyzing videos in the medical domain and supporting various Question-Answer (QA) tasks. Yet, current VLMs fall short in addressing queries that require temporal reasoning—a critical capability for surgical video understanding. In this work, we introduce CatVLM, a boundary-aware VLM, designed to capture temporal dynamics in untrimmed cataract surgery videos. CatVLM is capable of performing three clinically relevant tasks that demand moment-level awareness: Video Moment Retrieval (VMR), Video Captioning (VC), and Counting. To facilitate the training of such a model, we generate a bank of QA annotations for each task and propose a method to integrate video clips with the timestamps they occur. To the best of our knowledge, this work is one of the first approaches to explicitly incorporate temporal boundary awareness into VLMs for cataracts as well as the medical domain. We evaluate CatVLM on two public cataract surgery datasets, establishing new baselines across all three tasks. All the code, model checkpoints and annotations will be released post-review
APA
Paranjape, J.N., Shah, N., Narayanan, N., Sikder, S., Vedula, S.S. & Patel, V.M.. (2026). CatVLM: Enhancing Temporal Understanding in Cataract Surgery Videos with Boundary-Aware VLM. Proceedings of The 9th International Conference on Medical Imaging with Deep Learning, in Proceedings of Machine Learning Research 315:947-959 Available from https://proceedings.mlr.press/v315/paranjape26a.html.

Related Material