[edit]
Simple and Efficient Vision Backbone Adapter for Image Semantic Segmentation
Proceedings of the 15th Asian Conference on Machine Learning, PMLR 222:1071-1086, 2024.
Abstract
Utilizing a pretrained vision backbone to finetune a model for semantic segmentation is common practice in computer vision. However, there are few works intending to enlarge the semantic context learning capacity by incorporating a segmentation adapter into the backbone. Thus, in this paper, we present a simple but efficient segmentation adapter, termed as SegAdapter, which can be plugged into the pretrained vision backbone to improve the performance of existing models for image semantic segmentation. We summarize SegAdapter with three attractive advantages: 1) SegAdapter is a plug-and-play module demonstrating strong adaptability in CNN and Transformer based models such as ConvNext and Segformer, 2) SegAdapter applies a light-weight High-order Spatial Attention (HSA) to make use of intermediate features from the pretrained backbone which extends the model depth and produces auxiliary segmentation maps for model enhancement, 3) SegAdapter builds a powerful vision backbone by incorporating the semantic context into each stage which takes on some of the functions of the segmentation head. So, SegAdapter augmented model can be used in simple designed decode head to avoid heavy computational cost. By plugging multiple SegAdapter layers into different vision backbones, we construct a series of SegAdapter-based segmentation models. We show through the extensive experiments that SegAdapter can be used with mainstream backbones like CNN and Transformer to improve mIoU performance in a large margin while introducing minimal additional parameters and FLOPs.