[edit]
Hybrid Ladder Transformers with Efficient Parallel-Cross Attention for Medical Image Segmentation
Proceedings of The 5th International Conference on Medical Imaging with Deep Learning, PMLR 172:808-819, 2022.
Abstract
Most existing transformer-based network architectures for computer vision tasks are large (in number of parameters) and require large-scale datasets for training. However, the
relatively small number of data samples in medical imaging compared to the datasets for vision applications makes it difficult to effectively train transformers for medical imaging
applications. Further, transformer-based architectures encode long-range dependencies in the data and are able to learn more global representations. This could bridge the gap with convolutional neural networks (CNNs), which primarily operate on features extracted in local image neighbourhoods. In this work, we present a hybrid transformer-based approach for segmentation of medical images that works in conjunction with a CNN. We propose to use learnable global attention heads along with the traditional convolutional segmentation network architecture to encode long-range dependencies. Specifically, in our proposed architecture the local information extracted by the convolution operations and the global information learned by the self-attention mechanisms are fused using bi-directional cross attention during the encoding process, resulting in what we call a hybrid ladder transformer (HyLT). We evaluate the proposed network on two different medical image segmentation datasets. The results show that it achieves better results than the relevant CNN- and transformer-based architectures