Multi-scale Hierarchical Vision Transformer with Cascaded Attention Decoding for Medical Image Segmentation

Md Mostafijur Rahman; Radu Marculescu

Multi-scale Hierarchical Vision Transformer with Cascaded Attention Decoding for Medical Image Segmentation

Md Mostafijur Rahman, Radu Marculescu

Medical Imaging with Deep Learning, PMLR 227:1526-1544, 2024.

Abstract

Transformers have shown great success in medical image segmentation. However, transformers may exhibit a limited generalization ability due to the underlying single-scale self-attention (SA) mechanism. In this paper, we address this issue by introducing a Multi-scale hiERarchical vIsion Transformer (MERIT) backbone network, which improves the generalizability of the model by computing SA at multiple scales. We also incorporate an attention-based decoder, namely Cascaded Attention Decoding (CASCADE), for further refinement of multi-stage features generated by MERIT. Finally, we introduce an effective multi-stage feature mixing loss aggregation (MUTATION) method for better model training via implicit ensembling. Our experiments on two widely used medical image segmentation benchmarks (i.e., Synapse Multi-organ, ACDC) demonstrate the superior performance of MERIT over state-of-the-art methods. Our MERIT architecture and MUTATION loss aggregation can be used with downstream medical image and semantic segmentation tasks.

Cite this Paper

BibTeX


@InProceedings{pmlr-v227-rahman24a,
  title = 	 {Multi-scale Hierarchical Vision Transformer with Cascaded Attention Decoding for Medical Image Segmentation},
  author =       {Rahman, Md Mostafijur and Marculescu, Radu},
  booktitle = 	 {Medical Imaging with Deep Learning},
  pages = 	 {1526--1544},
  year = 	 {2024},
  editor = 	 {Oguz, Ipek and Noble, Jack and Li, Xiaoxiao and Styner, Martin and Baumgartner, Christian and Rusu, Mirabela and Heinmann, Tobias and Kontos, Despina and Landman, Bennett and Dawant, Benoit},
  volume = 	 {227},
  series = 	 {Proceedings of Machine Learning Research},
  month = 	 {10--12 Jul},
  publisher =    {PMLR},
  pdf = 	 {https://proceedings.mlr.press/v227/rahman24a/rahman24a.pdf},
  url = 	 {https://proceedings.mlr.press/v227/rahman24a.html},
  abstract = 	 {Transformers have shown great success in medical image segmentation. However, transformers may exhibit a limited generalization ability due to the underlying single-scale self-attention (SA) mechanism. In this paper, we address this issue by introducing a Multi-scale hiERarchical vIsion Transformer (MERIT) backbone network, which improves the generalizability of the model by computing SA at multiple scales. We also incorporate an attention-based decoder, namely Cascaded Attention Decoding (CASCADE), for further refinement of multi-stage features generated by MERIT. Finally, we introduce an effective multi-stage feature mixing loss aggregation (MUTATION) method for better model training via implicit ensembling. Our experiments on two widely used medical image segmentation benchmarks (i.e., Synapse Multi-organ, ACDC) demonstrate the superior performance of MERIT over state-of-the-art methods. Our MERIT architecture and MUTATION loss aggregation can be used with downstream medical image and semantic segmentation tasks.}
}

Endnote

%0 Conference Paper
%T Multi-scale Hierarchical Vision Transformer with Cascaded Attention Decoding for Medical Image Segmentation
%A Md Mostafijur Rahman
%A Radu Marculescu
%B Medical Imaging with Deep Learning
%C Proceedings of Machine Learning Research
%D 2024
%E Ipek Oguz
%E Jack Noble
%E Xiaoxiao Li
%E Martin Styner
%E Christian Baumgartner
%E Mirabela Rusu
%E Tobias Heinmann
%E Despina Kontos
%E Bennett Landman
%E Benoit Dawant	
%F pmlr-v227-rahman24a
%I PMLR
%P 1526--1544
%U https://proceedings.mlr.press/v227/rahman24a.html
%V 227
%X Transformers have shown great success in medical image segmentation. However, transformers may exhibit a limited generalization ability due to the underlying single-scale self-attention (SA) mechanism. In this paper, we address this issue by introducing a Multi-scale hiERarchical vIsion Transformer (MERIT) backbone network, which improves the generalizability of the model by computing SA at multiple scales. We also incorporate an attention-based decoder, namely Cascaded Attention Decoding (CASCADE), for further refinement of multi-stage features generated by MERIT. Finally, we introduce an effective multi-stage feature mixing loss aggregation (MUTATION) method for better model training via implicit ensembling. Our experiments on two widely used medical image segmentation benchmarks (i.e., Synapse Multi-organ, ACDC) demonstrate the superior performance of MERIT over state-of-the-art methods. Our MERIT architecture and MUTATION loss aggregation can be used with downstream medical image and semantic segmentation tasks.

APA


Rahman, M.M. & Marculescu, R.. (2024). Multi-scale Hierarchical Vision Transformer with Cascaded Attention Decoding for Medical Image Segmentation. Medical Imaging with Deep Learning, in Proceedings of Machine Learning Research 227:1526-1544 Available from https://proceedings.mlr.press/v227/rahman24a.html.

Related Material

Download PDF