Spatio-Temporal Landmark Detection via Selective Fine-Tuning of Echocardiography Foundation Models

Preetraj Bhoodoo, Sarina Thomas, Elisabeth Wetzer, Anne Schistad Solberg, Guy Ben-Yosef
Proceedings of the 7th Northern Lights Deep Learning Conference (NLDL), PMLR 307:39-48, 2026.

Abstract

Foundation models (FMs) have shown remarkable capabilities across computer vision tasks, yet their effectiveness for complex medical downstream tasks remains underexplored. This work investigates whether state-of-the-art video-based FMs for echocardiography can perform precise spatio-temporal landmark detection without extensive fine-tuning. We evaluate two recent powerful FMs, namely EchoPrime, and PanEcho, pre-trained on few millions of echocardiographic video-text pairs, for left-ventricular contour detection on EchoNet-Dynamic. We compare encoder regimes (frozen, partially frozen, fully trainable) and decoder heads (MLP vs. GCN), and benchmark against strong non-FM backbones (ResNet-18 2D/3D, ViT-Base, MViTv2-Small). Frozen encoders perform poorly and variably ($\approx$78.00 Dice, ED), whereas selectively unfreezing two blocks with GCN+augmentation yields a large jump ($91.71\pm3.49$ Dice, ED), recovering most of the improvement. Fully trainable EchoPrime (GCN+augmentation) achieves $93.13\pm3.11/90.95\pm3.71$ Dice (ED/ES), which is SOTA for regression-based models on EchoNet. Deploying separate, fully fine-tuned models for each task quickly becomes impractical in resource-constrained settings. Our results suggest that partially fine-tuning the FM is a resource-efficient strategy that recovers most of the performance benefits of end-to-end training, while avoiding the overhead of maintaining a separate model for each task. The code is available at \href{https://github.com/preetrajb/EchoVLMLandmarks}{https://github.com/preetrajb/EchoVLMLandmarks}.

Cite this Paper


BibTeX
@InProceedings{pmlr-v307-bhoodoo26a, title = {Spatio-Temporal Landmark Detection via Selective Fine-Tuning of Echocardiography Foundation Models}, author = {Bhoodoo, Preetraj and Thomas, Sarina and Wetzer, Elisabeth and Solberg, Anne Schistad and Ben-Yosef, Guy}, booktitle = {Proceedings of the 7th Northern Lights Deep Learning Conference (NLDL)}, pages = {39--48}, year = {2026}, editor = {Kim, Hyeongji and Ramírez Rivera, Adín and Ricaud, Benjamin}, volume = {307}, series = {Proceedings of Machine Learning Research}, month = {06--08 Jan}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v307/main/assets/bhoodoo26a/bhoodoo26a.pdf}, url = {https://proceedings.mlr.press/v307/bhoodoo26a.html}, abstract = {Foundation models (FMs) have shown remarkable capabilities across computer vision tasks, yet their effectiveness for complex medical downstream tasks remains underexplored. This work investigates whether state-of-the-art video-based FMs for echocardiography can perform precise spatio-temporal landmark detection without extensive fine-tuning. We evaluate two recent powerful FMs, namely EchoPrime, and PanEcho, pre-trained on few millions of echocardiographic video-text pairs, for left-ventricular contour detection on EchoNet-Dynamic. We compare encoder regimes (frozen, partially frozen, fully trainable) and decoder heads (MLP vs. GCN), and benchmark against strong non-FM backbones (ResNet-18 2D/3D, ViT-Base, MViTv2-Small). Frozen encoders perform poorly and variably ($\approx$78.00 Dice, ED), whereas selectively unfreezing two blocks with GCN+augmentation yields a large jump ($91.71\pm3.49$ Dice, ED), recovering most of the improvement. Fully trainable EchoPrime (GCN+augmentation) achieves $93.13\pm3.11/90.95\pm3.71$ Dice (ED/ES), which is SOTA for regression-based models on EchoNet. Deploying separate, fully fine-tuned models for each task quickly becomes impractical in resource-constrained settings. Our results suggest that partially fine-tuning the FM is a resource-efficient strategy that recovers most of the performance benefits of end-to-end training, while avoiding the overhead of maintaining a separate model for each task. The code is available at \href{https://github.com/preetrajb/EchoVLMLandmarks}{https://github.com/preetrajb/EchoVLMLandmarks}.} }
Endnote
%0 Conference Paper %T Spatio-Temporal Landmark Detection via Selective Fine-Tuning of Echocardiography Foundation Models %A Preetraj Bhoodoo %A Sarina Thomas %A Elisabeth Wetzer %A Anne Schistad Solberg %A Guy Ben-Yosef %B Proceedings of the 7th Northern Lights Deep Learning Conference (NLDL) %C Proceedings of Machine Learning Research %D 2026 %E Hyeongji Kim %E Adín Ramírez Rivera %E Benjamin Ricaud %F pmlr-v307-bhoodoo26a %I PMLR %P 39--48 %U https://proceedings.mlr.press/v307/bhoodoo26a.html %V 307 %X Foundation models (FMs) have shown remarkable capabilities across computer vision tasks, yet their effectiveness for complex medical downstream tasks remains underexplored. This work investigates whether state-of-the-art video-based FMs for echocardiography can perform precise spatio-temporal landmark detection without extensive fine-tuning. We evaluate two recent powerful FMs, namely EchoPrime, and PanEcho, pre-trained on few millions of echocardiographic video-text pairs, for left-ventricular contour detection on EchoNet-Dynamic. We compare encoder regimes (frozen, partially frozen, fully trainable) and decoder heads (MLP vs. GCN), and benchmark against strong non-FM backbones (ResNet-18 2D/3D, ViT-Base, MViTv2-Small). Frozen encoders perform poorly and variably ($\approx$78.00 Dice, ED), whereas selectively unfreezing two blocks with GCN+augmentation yields a large jump ($91.71\pm3.49$ Dice, ED), recovering most of the improvement. Fully trainable EchoPrime (GCN+augmentation) achieves $93.13\pm3.11/90.95\pm3.71$ Dice (ED/ES), which is SOTA for regression-based models on EchoNet. Deploying separate, fully fine-tuned models for each task quickly becomes impractical in resource-constrained settings. Our results suggest that partially fine-tuning the FM is a resource-efficient strategy that recovers most of the performance benefits of end-to-end training, while avoiding the overhead of maintaining a separate model for each task. The code is available at \href{https://github.com/preetrajb/EchoVLMLandmarks}{https://github.com/preetrajb/EchoVLMLandmarks}.
APA
Bhoodoo, P., Thomas, S., Wetzer, E., Solberg, A.S. & Ben-Yosef, G.. (2026). Spatio-Temporal Landmark Detection via Selective Fine-Tuning of Echocardiography Foundation Models. Proceedings of the 7th Northern Lights Deep Learning Conference (NLDL), in Proceedings of Machine Learning Research 307:39-48 Available from https://proceedings.mlr.press/v307/bhoodoo26a.html.

Related Material