Discriminative Self-Supervised Pre-Training for Esophagitis Detection in Upper GI Endoscopy Images

Tobias Friedetzki; Naveen Chandraiah; Emil Svoboda; Pavel Pecina; Frank Puppe; Adrian Krenzer

Discriminative Self-Supervised Pre-Training for Esophagitis Detection in Upper GI Endoscopy Images

Tobias Friedetzki, Naveen Chandraiah, Emil Svoboda, Pavel Pecina, Frank Puppe, Adrian Krenzer

Proceedings of The 9th International Conference on Medical Imaging with Deep Learning, PMLR 315:1137-1152, 2026.

Abstract

Early and accurate detection of esophagitis in upper gastrointestinal endoscopy is essential for guiding targeted treatment and preventing progression to severe diseases such as esophageal cancer. Although deep learning methods have shown promise in supporting esophagitis diagnosis, their performance heavily relies on large amounts of labeled data, which are scarce. Consequently, supervised models often struggle to generalize to the high visual variability and subtle lesion differences encountered in real-world endoscopic examinations. In this work, we study discriminative self-supervised pre-training as a means of leveraging large-scale unlabeled data for robust representation learning. Multiple Vision Transformer models are pre-trained using the DINO framework on 395,201 unlabeled gastrointestinal endoscopy images and subsequently fine-tuned on a curated esophagitis dataset from three clinical centers. Our results demonstrate that self-supervised pre-training on in-domain endoscopic images significantly improves esophagitis detection performance compared to supervised pre-training on natural image datasets such as ImageNet. Specifically, in-domain DINO pre-training yields an average performance gain of 6.60 percentage points in AUPRC on the downstream detection task, with the best-performing model achieving an AUPRC of 89.82%. These findings highlight the importance of in-domain self-supervised learning for reducing annotation dependency and improving model robustness in upper GI endoscopy analysis.

Cite this Paper

BibTeX

@InProceedings{pmlr-v315-friedetzki26a,
  title = 	 {Discriminative Self-Supervised Pre-Training for Esophagitis Detection in Upper GI Endoscopy Images},
  author =       {Friedetzki, Tobias and Chandraiah, Naveen and Svoboda, Emil and Pecina, Pavel and Puppe, Frank and Krenzer, Adrian},
  booktitle = 	 {Proceedings of The 9th International Conference on Medical Imaging with Deep Learning},
  pages = 	 {1137--1152},
  year = 	 {2026},
  editor = 	 {Huo, Yuankai and Gao, Mingchen and Kuo, Chang-Fu and Jin, Yueming and Deng, Ruining},
  volume = 	 {315},
  series = 	 {Proceedings of Machine Learning Research},
  month = 	 {08--10 Jul},
  publisher =    {PMLR},
  pdf = 	 {https://raw.githubusercontent.com/mlresearch/v315/main/assets/friedetzki26a/friedetzki26a.pdf},
  url = 	 {https://proceedings.mlr.press/v315/friedetzki26a.html},
  abstract = 	 {Early and accurate detection of esophagitis in upper gastrointestinal endoscopy is essential for guiding targeted treatment and preventing progression to severe diseases such as esophageal cancer. Although deep learning methods have shown promise in supporting esophagitis diagnosis, their performance heavily relies on large amounts of labeled data, which are scarce. Consequently, supervised models often struggle to generalize to the high visual variability and subtle lesion differences encountered in real-world endoscopic examinations. In this work, we study discriminative self-supervised pre-training as a means of leveraging large-scale unlabeled data for robust representation learning. Multiple Vision Transformer models are pre-trained using the DINO framework on 395,201 unlabeled gastrointestinal endoscopy images and subsequently fine-tuned on a curated esophagitis dataset from three clinical centers. Our results demonstrate that self-supervised pre-training on in-domain endoscopic images significantly improves esophagitis detection performance compared to supervised pre-training on natural image datasets such as ImageNet. Specifically, in-domain DINO pre-training yields an average performance gain of 6.60 percentage points in AUPRC on the downstream detection task, with the best-performing model achieving an AUPRC of 89.82%. These findings highlight the importance of in-domain self-supervised learning for reducing annotation dependency and improving model robustness in upper GI endoscopy analysis.}
}

Endnote

%0 Conference Paper
%T Discriminative Self-Supervised Pre-Training for Esophagitis Detection in Upper GI Endoscopy Images
%A Tobias Friedetzki
%A Naveen Chandraiah
%A Emil Svoboda
%A Pavel Pecina
%A Frank Puppe
%A Adrian Krenzer
%B Proceedings of The 9th International Conference on Medical Imaging with Deep Learning
%C Proceedings of Machine Learning Research
%D 2026
%E Yuankai Huo
%E Mingchen Gao
%E Chang-Fu Kuo
%E Yueming Jin
%E Ruining Deng	
%F pmlr-v315-friedetzki26a
%I PMLR
%P 1137--1152
%U https://proceedings.mlr.press/v315/friedetzki26a.html
%V 315
%X Early and accurate detection of esophagitis in upper gastrointestinal endoscopy is essential for guiding targeted treatment and preventing progression to severe diseases such as esophageal cancer. Although deep learning methods have shown promise in supporting esophagitis diagnosis, their performance heavily relies on large amounts of labeled data, which are scarce. Consequently, supervised models often struggle to generalize to the high visual variability and subtle lesion differences encountered in real-world endoscopic examinations. In this work, we study discriminative self-supervised pre-training as a means of leveraging large-scale unlabeled data for robust representation learning. Multiple Vision Transformer models are pre-trained using the DINO framework on 395,201 unlabeled gastrointestinal endoscopy images and subsequently fine-tuned on a curated esophagitis dataset from three clinical centers. Our results demonstrate that self-supervised pre-training on in-domain endoscopic images significantly improves esophagitis detection performance compared to supervised pre-training on natural image datasets such as ImageNet. Specifically, in-domain DINO pre-training yields an average performance gain of 6.60 percentage points in AUPRC on the downstream detection task, with the best-performing model achieving an AUPRC of 89.82%. These findings highlight the importance of in-domain self-supervised learning for reducing annotation dependency and improving model robustness in upper GI endoscopy analysis.

APA

Friedetzki, T., Chandraiah, N., Svoboda, E., Pecina, P., Puppe, F. & Krenzer, A.. (2026). Discriminative Self-Supervised Pre-Training for Esophagitis Detection in Upper GI Endoscopy Images. Proceedings of The 9th International Conference on Medical Imaging with Deep Learning, in Proceedings of Machine Learning Research 315:1137-1152 Available from https://proceedings.mlr.press/v315/friedetzki26a.html.

Related Material

Download PDF