Federated EndoViT: Pretraining Vision Transformers via Federated Learning on Endoscopic Image Collections

Max Kirchner; Alexander C. Jenke; Sebastian Bodenstedt; Fiona R. Kolbinger; Oliver L. Saldanha; Jakob N. Kather; Martin Wagner; Stefanie Speidel

Federated EndoViT: Pretraining Vision Transformers via Federated Learning on Endoscopic Image Collections

Max Kirchner, Alexander C. Jenke, Sebastian Bodenstedt, Fiona R. Kolbinger, Oliver L. Saldanha, Jakob N. Kather, Martin Wagner, Stefanie Speidel

Proceedings of The 9th International Conference on Medical Imaging with Deep Learning, PMLR 315:1903-1934, 2026.

Abstract

Purpose: Data privacy regulations hinder the creation of generalizable foundation models (FMs) for surgery by preventing multi-institutional data aggregation. This study investigates federated learning (FL) as a privacy-preserving solution to collaboratively train robust surgical FMs. Methods: We introduce Federated EndoViT (FL-EndoViT), a federated framework that validates the Masked Autoencoder (MAE) pretraining strategy in a decentralized surgical setting. To ensure convergence under severe data heterogeneity, the architecture integrates adaptive Sharpness-Aware Minimization (FedSAM). Pretrained on the large-scale Endo700k dataset, FL-EndoViT is evaluated against a centralized baseline on different tasks including scene segmentation, action recognition, and phase recognition. Results: FedSAM is critical for successful pretraining, overcoming the convergence failures of standard federated methods. The resulting FL-EndoViT performs comparably to its centralized counterpart, with significant advantages in data-scarce, high-resolution segmentation and generalization to new surgical events. We also establish that full, end-to-end fine-tuning is necessary for optimal performance. Conclusion: This work validates FL with adaptive optimization as a viable paradigm for creating robust, privacy-preserving surgical FMs. Our findings provide a scalable framework for collaborative Surgical Data Science and underscore the optimizer’s critical role in handling data heterogeneity. Future work should explore video-based models to incorporate spatiotemporal dynamics.

Cite this Paper

BibTeX

@InProceedings{pmlr-v315-kirchner26a,
  title = 	 {Federated EndoViT: Pretraining Vision Transformers via Federated Learning on Endoscopic Image Collections},
  author =       {Kirchner, Max and Jenke, Alexander C. and Bodenstedt, Sebastian and Kolbinger, Fiona R. and Saldanha, Oliver L. and Kather, Jakob N. and Wagner, Martin and Speidel, Stefanie},
  booktitle = 	 {Proceedings of The 9th International Conference on Medical Imaging with Deep Learning},
  pages = 	 {1903--1934},
  year = 	 {2026},
  editor = 	 {Huo, Yuankai and Gao, Mingchen and Kuo, Chang-Fu and Jin, Yueming and Deng, Ruining},
  volume = 	 {315},
  series = 	 {Proceedings of Machine Learning Research},
  month = 	 {08--10 Jul},
  publisher =    {PMLR},
  pdf = 	 {https://raw.githubusercontent.com/mlresearch/v315/main/assets/kirchner26a/kirchner26a.pdf},
  url = 	 {https://proceedings.mlr.press/v315/kirchner26a.html},
  abstract = 	 {Purpose: Data privacy regulations hinder the creation of generalizable foundation models (FMs) for surgery by preventing multi-institutional data aggregation. This study investigates federated learning (FL) as a privacy-preserving solution to collaboratively train robust surgical FMs. Methods: We introduce Federated EndoViT (FL-EndoViT), a federated framework that validates the Masked Autoencoder (MAE) pretraining strategy in a decentralized surgical setting. To ensure convergence under severe data heterogeneity, the architecture integrates adaptive Sharpness-Aware Minimization (FedSAM). Pretrained on the large-scale Endo700k dataset, FL-EndoViT is evaluated against a centralized baseline on different tasks including scene segmentation, action recognition, and phase recognition. Results: FedSAM is critical for successful pretraining, overcoming the convergence failures of standard federated methods. The resulting FL-EndoViT performs comparably to its centralized counterpart, with significant advantages in data-scarce, high-resolution segmentation and generalization to new surgical events. We also establish that full, end-to-end fine-tuning is necessary for optimal performance. Conclusion: This work validates FL with adaptive optimization as a viable paradigm for creating robust, privacy-preserving surgical FMs. Our findings provide a scalable framework for collaborative Surgical Data Science and underscore the optimizer’s critical role in handling data heterogeneity. Future work should explore video-based models to incorporate spatiotemporal dynamics.}
}

Endnote

%0 Conference Paper
%T Federated EndoViT: Pretraining Vision Transformers via Federated Learning on Endoscopic Image Collections
%A Max Kirchner
%A Alexander C. Jenke
%A Sebastian Bodenstedt
%A Fiona R. Kolbinger
%A Oliver L. Saldanha
%A Jakob N. Kather
%A Martin Wagner
%A Stefanie Speidel
%B Proceedings of The 9th International Conference on Medical Imaging with Deep Learning
%C Proceedings of Machine Learning Research
%D 2026
%E Yuankai Huo
%E Mingchen Gao
%E Chang-Fu Kuo
%E Yueming Jin
%E Ruining Deng	
%F pmlr-v315-kirchner26a
%I PMLR
%P 1903--1934
%U https://proceedings.mlr.press/v315/kirchner26a.html
%V 315
%X Purpose: Data privacy regulations hinder the creation of generalizable foundation models (FMs) for surgery by preventing multi-institutional data aggregation. This study investigates federated learning (FL) as a privacy-preserving solution to collaboratively train robust surgical FMs. Methods: We introduce Federated EndoViT (FL-EndoViT), a federated framework that validates the Masked Autoencoder (MAE) pretraining strategy in a decentralized surgical setting. To ensure convergence under severe data heterogeneity, the architecture integrates adaptive Sharpness-Aware Minimization (FedSAM). Pretrained on the large-scale Endo700k dataset, FL-EndoViT is evaluated against a centralized baseline on different tasks including scene segmentation, action recognition, and phase recognition. Results: FedSAM is critical for successful pretraining, overcoming the convergence failures of standard federated methods. The resulting FL-EndoViT performs comparably to its centralized counterpart, with significant advantages in data-scarce, high-resolution segmentation and generalization to new surgical events. We also establish that full, end-to-end fine-tuning is necessary for optimal performance. Conclusion: This work validates FL with adaptive optimization as a viable paradigm for creating robust, privacy-preserving surgical FMs. Our findings provide a scalable framework for collaborative Surgical Data Science and underscore the optimizer’s critical role in handling data heterogeneity. Future work should explore video-based models to incorporate spatiotemporal dynamics.

APA

Kirchner, M., Jenke, A.C., Bodenstedt, S., Kolbinger, F.R., Saldanha, O.L., Kather, J.N., Wagner, M. & Speidel, S.. (2026). Federated EndoViT: Pretraining Vision Transformers via Federated Learning on Endoscopic Image Collections. Proceedings of The 9th International Conference on Medical Imaging with Deep Learning, in Proceedings of Machine Learning Research 315:1903-1934 Available from https://proceedings.mlr.press/v315/kirchner26a.html.

Related Material

Download PDF