[edit]
The Self-Supervision Regime and Encoder Fit for Histopathology Image Analysis
Proceedings of the 4th Machine Learning for Health Symposium, PMLR 259:48-60, 2025.
Abstract
This study systematically compares vision transformers (ViTs) and residual neural networks (ResNets) for slide-level histopathology analysis using self-supervised learning (SSL) strategies SimCLR and DINO, alongside ImageNet-initialized counterparts. Binary and multiclass classification were evaluated on the TCGA-CRCk and PANDA datasets using the multiple instance learning paradigm. GradCAM and attention rollout were applied for visual explainability for ResNets and ViTs, respectively. The results showed that SimCLR excelled with smaller architectures, while DINO performed better with larger ones. ResNets underperformed with DINO regardless of architecture size, whereas ViTs excelled on the PANDA dataset but required scaling to perform well on TCGA-CRCk, a more complex dataset. An emerging self-supervision regime and encoder fit suggests that DINO-ViTs and SimCLR-ResNets are well-suited pairs, though the effects of SSL methods on encoders remain unclear. ViTs demonstrated superior scalability and interpretability compared to ResNets, but further research is needed to better understand the interplay between SSL methods and encoders.