VariViT: A Vision Transformer for Variable Image Sizes

Aswathi Varma; Suprosanna Shit; Chinmay Prabhakar; Daniel Scholz; Hongwei Bran Li; Bjoern Menze; Daniel Rueckert; Benedikt Wiestler

VariViT: A Vision Transformer for Variable Image Sizes

Aswathi Varma, Suprosanna Shit, Chinmay Prabhakar, Daniel Scholz, Hongwei Bran Li, Bjoern Menze, Daniel Rueckert, Benedikt Wiestler

Proceedings of The 7nd International Conference on Medical Imaging with Deep Learning, PMLR 250:1571-1583, 2024.

Abstract

Vision Transformers (ViTs) have emerged as the state-of-the-art architecture in representation learning, leveraging self-attention mechanisms to excel in various tasks. ViTs split images into fixed-size patches, constraining them to a predefined size and necessitating pre-processing steps like resizing, padding, or cropping. This poses challenges in medical imaging, particularly with irregularly shaped structures like tumors. A fixed bounding box crop size produces input images with highly variable foreground-to-background ratios. Resizing medical images can degrade information and introduce artifacts, impacting diagnosis. Hence, tailoring variable-sized crops to regions of interest can enhance feature representation capabilities. Moreover, large images are computationally expensive, and smaller sizes risk information loss, presenting a computation-accuracy tradeoff. We propose VariViT, an improved ViT model crafted to handle variable image sizes while maintaining a consistent patch size. VariViT employs a novel positional embedding resizing scheme for a variable number of patches. We also implement a new batching strategy within VariViT to reduce computational complexity, resulting in faster training and inference times. In our evaluations on two 3D brain MRI datasets, VariViT surpasses vanilla ViTs and ResNet in glioma genotype prediction and brain tumor classification. It achieves F1-scores of 75.5% and 76.3%, respectively, learning more discriminative features. Our proposed batching strategy reduces computation time by up to 30% compared to conventional architectures. These findings underscore the efficacy of VariViT in image representation learning.

Cite this Paper

BibTeX

@InProceedings{pmlr-v250-varma24a,
  title = 	 {VariViT: A Vision Transformer for Variable Image Sizes},
  author =       {Varma, Aswathi and Shit, Suprosanna and Prabhakar, Chinmay and Scholz, Daniel and Li, Hongwei Bran and Menze, Bjoern and Rueckert, Daniel and Wiestler, Benedikt},
  booktitle = 	 {Proceedings of The 7nd International Conference on Medical Imaging with Deep Learning},
  pages = 	 {1571--1583},
  year = 	 {2024},
  editor = 	 {Burgos, Ninon and Petitjean, Caroline and Vakalopoulou, Maria and Christodoulidis, Stergios and Coupe, Pierrick and Delingette, Hervé and Lartizien, Carole and Mateus, Diana},
  volume = 	 {250},
  series = 	 {Proceedings of Machine Learning Research},
  month = 	 {03--05 Jul},
  publisher =    {PMLR},
  pdf = 	 {https://raw.githubusercontent.com/mlresearch/v250/main/assets/varma24a/varma24a.pdf},
  url = 	 {https://proceedings.mlr.press/v250/varma24a.html},
  abstract = 	 {Vision Transformers (ViTs) have emerged as the state-of-the-art architecture in representation learning, leveraging self-attention mechanisms to excel in various tasks. ViTs split images into fixed-size patches, constraining them to a predefined size and necessitating pre-processing steps like resizing, padding, or cropping. This poses challenges in medical imaging, particularly with irregularly shaped structures like tumors. A fixed bounding box crop size produces input images with highly variable foreground-to-background ratios. Resizing medical images can degrade information and introduce artifacts, impacting diagnosis. Hence, tailoring variable-sized crops to regions of interest can enhance feature representation capabilities. Moreover, large images are computationally expensive, and smaller sizes risk information loss, presenting a computation-accuracy tradeoff. We propose VariViT, an improved ViT model crafted to handle variable image sizes while maintaining a consistent patch size. VariViT employs a novel positional embedding resizing scheme for a variable number of patches. We also implement a new batching strategy within VariViT to reduce computational complexity, resulting in faster training and inference times. In our evaluations on two 3D brain MRI datasets, VariViT surpasses vanilla ViTs and ResNet in glioma genotype prediction and brain tumor classification. It achieves F1-scores of 75.5% and 76.3%, respectively, learning more discriminative features. Our proposed batching strategy reduces computation time by up to 30% compared to conventional architectures. These findings underscore the efficacy of VariViT in image representation learning.}
}

Endnote

%0 Conference Paper
%T VariViT: A Vision Transformer for Variable Image Sizes
%A Aswathi Varma
%A Suprosanna Shit
%A Chinmay Prabhakar
%A Daniel Scholz
%A Hongwei Bran Li
%A Bjoern Menze
%A Daniel Rueckert
%A Benedikt Wiestler
%B Proceedings of The 7nd International Conference on Medical Imaging with Deep Learning
%C Proceedings of Machine Learning Research
%D 2024
%E Ninon Burgos
%E Caroline Petitjean
%E Maria Vakalopoulou
%E Stergios Christodoulidis
%E Pierrick Coupe
%E Hervé Delingette
%E Carole Lartizien
%E Diana Mateus	
%F pmlr-v250-varma24a
%I PMLR
%P 1571--1583
%U https://proceedings.mlr.press/v250/varma24a.html
%V 250
%X Vision Transformers (ViTs) have emerged as the state-of-the-art architecture in representation learning, leveraging self-attention mechanisms to excel in various tasks. ViTs split images into fixed-size patches, constraining them to a predefined size and necessitating pre-processing steps like resizing, padding, or cropping. This poses challenges in medical imaging, particularly with irregularly shaped structures like tumors. A fixed bounding box crop size produces input images with highly variable foreground-to-background ratios. Resizing medical images can degrade information and introduce artifacts, impacting diagnosis. Hence, tailoring variable-sized crops to regions of interest can enhance feature representation capabilities. Moreover, large images are computationally expensive, and smaller sizes risk information loss, presenting a computation-accuracy tradeoff. We propose VariViT, an improved ViT model crafted to handle variable image sizes while maintaining a consistent patch size. VariViT employs a novel positional embedding resizing scheme for a variable number of patches. We also implement a new batching strategy within VariViT to reduce computational complexity, resulting in faster training and inference times. In our evaluations on two 3D brain MRI datasets, VariViT surpasses vanilla ViTs and ResNet in glioma genotype prediction and brain tumor classification. It achieves F1-scores of 75.5% and 76.3%, respectively, learning more discriminative features. Our proposed batching strategy reduces computation time by up to 30% compared to conventional architectures. These findings underscore the efficacy of VariViT in image representation learning.

APA

Varma, A., Shit, S., Prabhakar, C., Scholz, D., Li, H.B., Menze, B., Rueckert, D. & Wiestler, B.. (2024). VariViT: A Vision Transformer for Variable Image Sizes. Proceedings of The 7nd International Conference on Medical Imaging with Deep Learning, in Proceedings of Machine Learning Research 250:1571-1583 Available from https://proceedings.mlr.press/v250/varma24a.html.

Related Material

Download PDF