Adaptive Inference for Medical Vision Transformers: Token Reduction or Early Exit?

Ji Young Byun, HyunSeo Lee, Jordan Shuff, Rengaraj Venkatesh, Nakul S. Shekhawat, Kunal S. Parikh, Rama Chellappa
Proceedings of The 9th International Conference on Medical Imaging with Deep Learning, PMLR 315:2171-2191, 2026.

Abstract

Vision Transformers (ViTs) have demonstrated exceptional performance in medical image analysis, yet their computational demands hinder clinical deployment, particularly in time-sensitive applications. Medical imaging requires sample-adaptive optimization due to dataset heterogeneity across modalities and sample complexity; uniform strategies do not well balance efficiency and accuracy. We propose a unified adaptive inference framework that combines Token Reduction (TR) and Early Exiting (EE) through dataset-specific profiling. Our approach quantifies spatial redundancy via Jensen-Shannon Divergence (JSD) and prediction confidence at intermediate layers to train a lightweight predictor that dynamically selects inference strategies at test time. Across five medical datasets, including a real-world cataract dataset (INSIGHT), our framework achieves 71.4% average floating-point operations (FLOPs) reduction with only 0.1pp accuracy loss, substantially outperforming individual strategies (EE-only: 55.9%, TR-only: 57.7%). On PathMNIST, our adaptive inference framework simultaneously improves accuracy by 1.3pp while reducing computation by 77.2%. On INSIGHT, we maintain baseline accuracy with 69.8% FLOPs reduction, demonstrating robust real-world clinical applicability.

Cite this Paper


BibTeX
@InProceedings{pmlr-v315-byun26b, title = {Adaptive Inference for Medical Vision Transformers: Token Reduction or Early Exit?}, author = {Byun, Ji Young and Lee, HyunSeo and Shuff, Jordan and Venkatesh, Rengaraj and Shekhawat, Nakul S. and Parikh, Kunal S. and Chellappa, Rama}, booktitle = {Proceedings of The 9th International Conference on Medical Imaging with Deep Learning}, pages = {2171--2191}, year = {2026}, editor = {Huo, Yuankai and Gao, Mingchen and Kuo, Chang-Fu and Jin, Yueming and Deng, Ruining}, volume = {315}, series = {Proceedings of Machine Learning Research}, month = {08--10 Jul}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v315/main/assets/byun26b/byun26b.pdf}, url = {https://proceedings.mlr.press/v315/byun26b.html}, abstract = {Vision Transformers (ViTs) have demonstrated exceptional performance in medical image analysis, yet their computational demands hinder clinical deployment, particularly in time-sensitive applications. Medical imaging requires sample-adaptive optimization due to dataset heterogeneity across modalities and sample complexity; uniform strategies do not well balance efficiency and accuracy. We propose a unified adaptive inference framework that combines Token Reduction (TR) and Early Exiting (EE) through dataset-specific profiling. Our approach quantifies spatial redundancy via Jensen-Shannon Divergence (JSD) and prediction confidence at intermediate layers to train a lightweight predictor that dynamically selects inference strategies at test time. Across five medical datasets, including a real-world cataract dataset (INSIGHT), our framework achieves 71.4% average floating-point operations (FLOPs) reduction with only 0.1pp accuracy loss, substantially outperforming individual strategies (EE-only: 55.9%, TR-only: 57.7%). On PathMNIST, our adaptive inference framework simultaneously improves accuracy by 1.3pp while reducing computation by 77.2%. On INSIGHT, we maintain baseline accuracy with 69.8% FLOPs reduction, demonstrating robust real-world clinical applicability.} }
Endnote
%0 Conference Paper %T Adaptive Inference for Medical Vision Transformers: Token Reduction or Early Exit? %A Ji Young Byun %A HyunSeo Lee %A Jordan Shuff %A Rengaraj Venkatesh %A Nakul S. Shekhawat %A Kunal S. Parikh %A Rama Chellappa %B Proceedings of The 9th International Conference on Medical Imaging with Deep Learning %C Proceedings of Machine Learning Research %D 2026 %E Yuankai Huo %E Mingchen Gao %E Chang-Fu Kuo %E Yueming Jin %E Ruining Deng %F pmlr-v315-byun26b %I PMLR %P 2171--2191 %U https://proceedings.mlr.press/v315/byun26b.html %V 315 %X Vision Transformers (ViTs) have demonstrated exceptional performance in medical image analysis, yet their computational demands hinder clinical deployment, particularly in time-sensitive applications. Medical imaging requires sample-adaptive optimization due to dataset heterogeneity across modalities and sample complexity; uniform strategies do not well balance efficiency and accuracy. We propose a unified adaptive inference framework that combines Token Reduction (TR) and Early Exiting (EE) through dataset-specific profiling. Our approach quantifies spatial redundancy via Jensen-Shannon Divergence (JSD) and prediction confidence at intermediate layers to train a lightweight predictor that dynamically selects inference strategies at test time. Across five medical datasets, including a real-world cataract dataset (INSIGHT), our framework achieves 71.4% average floating-point operations (FLOPs) reduction with only 0.1pp accuracy loss, substantially outperforming individual strategies (EE-only: 55.9%, TR-only: 57.7%). On PathMNIST, our adaptive inference framework simultaneously improves accuracy by 1.3pp while reducing computation by 77.2%. On INSIGHT, we maintain baseline accuracy with 69.8% FLOPs reduction, demonstrating robust real-world clinical applicability.
APA
Byun, J.Y., Lee, H., Shuff, J., Venkatesh, R., Shekhawat, N.S., Parikh, K.S. & Chellappa, R.. (2026). Adaptive Inference for Medical Vision Transformers: Token Reduction or Early Exit?. Proceedings of The 9th International Conference on Medical Imaging with Deep Learning, in Proceedings of Machine Learning Research 315:2171-2191 Available from https://proceedings.mlr.press/v315/byun26b.html.

Related Material