Simple and Efficient Vision Backbone Adapter for Image Semantic Segmentation

Dingjie Peng; Wataru Kameyama

Simple and Efficient Vision Backbone Adapter for Image Semantic Segmentation

Dingjie Peng, Wataru Kameyama

Proceedings of the 15th Asian Conference on Machine Learning, PMLR 222:1071-1086, 2024.

Abstract

Utilizing a pretrained vision backbone to finetune a model for semantic segmentation is common practice in computer vision. However, there are few works intending to enlarge the semantic context learning capacity by incorporating a segmentation adapter into the backbone. Thus, in this paper, we present a simple but efficient segmentation adapter, termed as SegAdapter, which can be plugged into the pretrained vision backbone to improve the performance of existing models for image semantic segmentation. We summarize SegAdapter with three attractive advantages: 1) SegAdapter is a plug-and-play module demonstrating strong adaptability in CNN and Transformer based models such as ConvNext and Segformer, 2) SegAdapter applies a light-weight High-order Spatial Attention (HSA) to make use of intermediate features from the pretrained backbone which extends the model depth and produces auxiliary segmentation maps for model enhancement, 3) SegAdapter builds a powerful vision backbone by incorporating the semantic context into each stage which takes on some of the functions of the segmentation head. So, SegAdapter augmented model can be used in simple designed decode head to avoid heavy computational cost. By plugging multiple SegAdapter layers into different vision backbones, we construct a series of SegAdapter-based segmentation models. We show through the extensive experiments that SegAdapter can be used with mainstream backbones like CNN and Transformer to improve mIoU performance in a large margin while introducing minimal additional parameters and FLOPs.

Cite this Paper

BibTeX


@InProceedings{pmlr-v222-peng24a,
  title = 	 {Simple and Efficient Vision Backbone Adapter for Image Semantic Segmentation},
  author =       {Peng, Dingjie and Kameyama, Wataru},
  booktitle = 	 {Proceedings of the 15th Asian Conference on Machine Learning},
  pages = 	 {1071--1086},
  year = 	 {2024},
  editor = 	 {Yanıkoğlu, Berrin and Buntine, Wray},
  volume = 	 {222},
  series = 	 {Proceedings of Machine Learning Research},
  month = 	 {11--14 Nov},
  publisher =    {PMLR},
  pdf = 	 {https://proceedings.mlr.press/v222/peng24a/peng24a.pdf},
  url = 	 {https://proceedings.mlr.press/v222/peng24a.html},
  abstract = 	 {Utilizing a pretrained vision backbone to finetune a model for semantic segmentation is common practice in computer vision. However, there are few works intending to enlarge the semantic context learning capacity by incorporating a segmentation adapter into the backbone. Thus, in this paper, we present a simple but efficient segmentation adapter, termed as SegAdapter, which can be plugged into the pretrained vision backbone to improve the performance of existing models for image semantic segmentation. We summarize SegAdapter with three attractive advantages: 1) SegAdapter is a plug-and-play module demonstrating strong adaptability in CNN and Transformer based models such as ConvNext and Segformer, 2) SegAdapter applies a light-weight High-order Spatial Attention (HSA) to make use of intermediate features from the pretrained backbone which extends the model depth and produces auxiliary segmentation maps for model enhancement, 3) SegAdapter builds a powerful vision backbone by incorporating the semantic context into each stage which takes on some of the functions of the segmentation head. So, SegAdapter augmented model can be used in simple designed decode head to avoid heavy computational cost. By plugging multiple SegAdapter layers into different vision backbones, we construct a series of SegAdapter-based segmentation models. We show through the extensive experiments that SegAdapter can be used with mainstream backbones like CNN and Transformer to improve mIoU performance in a large margin while introducing minimal additional parameters and FLOPs.}
}

Endnote

%0 Conference Paper
%T Simple and Efficient Vision Backbone Adapter for Image Semantic Segmentation
%A Dingjie Peng
%A Wataru Kameyama
%B Proceedings of the 15th Asian Conference on Machine Learning
%C Proceedings of Machine Learning Research
%D 2024
%E Berrin Yanıkoğlu
%E Wray Buntine	
%F pmlr-v222-peng24a
%I PMLR
%P 1071--1086
%U https://proceedings.mlr.press/v222/peng24a.html
%V 222
%X Utilizing a pretrained vision backbone to finetune a model for semantic segmentation is common practice in computer vision. However, there are few works intending to enlarge the semantic context learning capacity by incorporating a segmentation adapter into the backbone. Thus, in this paper, we present a simple but efficient segmentation adapter, termed as SegAdapter, which can be plugged into the pretrained vision backbone to improve the performance of existing models for image semantic segmentation. We summarize SegAdapter with three attractive advantages: 1) SegAdapter is a plug-and-play module demonstrating strong adaptability in CNN and Transformer based models such as ConvNext and Segformer, 2) SegAdapter applies a light-weight High-order Spatial Attention (HSA) to make use of intermediate features from the pretrained backbone which extends the model depth and produces auxiliary segmentation maps for model enhancement, 3) SegAdapter builds a powerful vision backbone by incorporating the semantic context into each stage which takes on some of the functions of the segmentation head. So, SegAdapter augmented model can be used in simple designed decode head to avoid heavy computational cost. By plugging multiple SegAdapter layers into different vision backbones, we construct a series of SegAdapter-based segmentation models. We show through the extensive experiments that SegAdapter can be used with mainstream backbones like CNN and Transformer to improve mIoU performance in a large margin while introducing minimal additional parameters and FLOPs.

APA


Peng, D. & Kameyama, W.. (2024). Simple and Efficient Vision Backbone Adapter for Image Semantic Segmentation. Proceedings of the 15th Asian Conference on Machine Learning, in Proceedings of Machine Learning Research 222:1071-1086 Available from https://proceedings.mlr.press/v222/peng24a.html.

Related Material

Download PDF