Segment Anyword: Mask Prompt Inversion for Open-Set Grounded Segmentation

Zhihua Liu; Amrutha Saseendran; Lei Tong; Xilin He; Fariba Yousefi; Nikolay Burlutskiy; Dino Oglic; Tom Diethe; Philip Alexander Teare; Huiyu Zhou; Chen Jin

Segment Anyword: Mask Prompt Inversion for Open-Set Grounded Segmentation

Zhihua Liu, Amrutha Saseendran, Lei Tong, Xilin He, Fariba Yousefi, Nikolay Burlutskiy, Dino Oglic, Tom Diethe, Philip Alexander Teare, Huiyu Zhou, Chen Jin

Proceedings of the 42nd International Conference on Machine Learning, PMLR 267:39420-39454, 2025.

Abstract

Open-set image segmentation poses a significant challenge because existing methods often demand extensive training or fine-tuning and generally struggle to segment unified objects consistently across diverse text reference expressions. Motivated by this, we propose Segment Anyword, a novel training-free visual concept prompt learning approach for open-set language grounded segmentation that relies on token-level cross-attention maps from a frozen diffusion model to produce segmentation surrogates or mask prompts, which are then refined into targeted object masks. Initial prompts typically lack coherence and consistency as the complexity of the image-text increases, resulting in suboptimal mask fragments. To tackle this issue, we further introduce a novel linguistic-guided visual prompt regularization that binds and clusters visual prompts based on sentence dependency and syntactic structural information, enabling the extraction of robust, noise-tolerant mask prompts, and significant improvements in segmentation accuracy. The proposed approach is effective, generalizes across different open-set segmentation tasks, and achieves state-of-the-art results of 52.5 (+6.8 relative) mIoU on Pascal Context 59, 67.73 (+25.73 relative) cIoU on gRefCOCO, and 67.4 (+1.1 relative to fine-tuned methods) mIoU on GranDf, which is the most complex open-set grounded segmentation task in the field.

Cite this Paper

BibTeX

@InProceedings{pmlr-v267-liu25bj,
  title = 	 {Segment Anyword: Mask Prompt Inversion for Open-Set Grounded Segmentation},
  author =       {Liu, Zhihua and Saseendran, Amrutha and Tong, Lei and He, Xilin and Yousefi, Fariba and Burlutskiy, Nikolay and Oglic, Dino and Diethe, Tom and Teare, Philip Alexander and Zhou, Huiyu and Jin, Chen},
  booktitle = 	 {Proceedings of the 42nd International Conference on Machine Learning},
  pages = 	 {39420--39454},
  year = 	 {2025},
  editor = 	 {Singh, Aarti and Fazel, Maryam and Hsu, Daniel and Lacoste-Julien, Simon and Berkenkamp, Felix and Maharaj, Tegan and Wagstaff, Kiri and Zhu, Jerry},
  volume = 	 {267},
  series = 	 {Proceedings of Machine Learning Research},
  month = 	 {13--19 Jul},
  publisher =    {PMLR},
  pdf = 	 {https://raw.githubusercontent.com/mlresearch/v267/main/assets/liu25bj/liu25bj.pdf},
  url = 	 {https://proceedings.mlr.press/v267/liu25bj.html},
  abstract = 	 {Open-set image segmentation poses a significant challenge because existing methods often demand extensive training or fine-tuning and generally struggle to segment unified objects consistently across diverse text reference expressions. Motivated by this, we propose Segment Anyword, a novel training-free visual concept prompt learning approach for open-set language grounded segmentation that relies on token-level cross-attention maps from a frozen diffusion model to produce segmentation surrogates or mask prompts, which are then refined into targeted object masks. Initial prompts typically lack coherence and consistency as the complexity of the image-text increases, resulting in suboptimal mask fragments. To tackle this issue, we further introduce a novel linguistic-guided visual prompt regularization that binds and clusters visual prompts based on sentence dependency and syntactic structural information, enabling the extraction of robust, noise-tolerant mask prompts, and significant improvements in segmentation accuracy. The proposed approach is effective, generalizes across different open-set segmentation tasks, and achieves state-of-the-art results of 52.5 (+6.8 relative) mIoU on Pascal Context 59, 67.73 (+25.73 relative) cIoU on gRefCOCO, and 67.4 (+1.1 relative to fine-tuned methods) mIoU on GranDf, which is the most complex open-set grounded segmentation task in the field.}
}

Endnote

%0 Conference Paper
%T Segment Anyword: Mask Prompt Inversion for Open-Set Grounded Segmentation
%A Zhihua Liu
%A Amrutha Saseendran
%A Lei Tong
%A Xilin He
%A Fariba Yousefi
%A Nikolay Burlutskiy
%A Dino Oglic
%A Tom Diethe
%A Philip Alexander Teare
%A Huiyu Zhou
%A Chen Jin
%B Proceedings of the 42nd International Conference on Machine Learning
%C Proceedings of Machine Learning Research
%D 2025
%E Aarti Singh
%E Maryam Fazel
%E Daniel Hsu
%E Simon Lacoste-Julien
%E Felix Berkenkamp
%E Tegan Maharaj
%E Kiri Wagstaff
%E Jerry Zhu	
%F pmlr-v267-liu25bj
%I PMLR
%P 39420--39454
%U https://proceedings.mlr.press/v267/liu25bj.html
%V 267
%X Open-set image segmentation poses a significant challenge because existing methods often demand extensive training or fine-tuning and generally struggle to segment unified objects consistently across diverse text reference expressions. Motivated by this, we propose Segment Anyword, a novel training-free visual concept prompt learning approach for open-set language grounded segmentation that relies on token-level cross-attention maps from a frozen diffusion model to produce segmentation surrogates or mask prompts, which are then refined into targeted object masks. Initial prompts typically lack coherence and consistency as the complexity of the image-text increases, resulting in suboptimal mask fragments. To tackle this issue, we further introduce a novel linguistic-guided visual prompt regularization that binds and clusters visual prompts based on sentence dependency and syntactic structural information, enabling the extraction of robust, noise-tolerant mask prompts, and significant improvements in segmentation accuracy. The proposed approach is effective, generalizes across different open-set segmentation tasks, and achieves state-of-the-art results of 52.5 (+6.8 relative) mIoU on Pascal Context 59, 67.73 (+25.73 relative) cIoU on gRefCOCO, and 67.4 (+1.1 relative to fine-tuned methods) mIoU on GranDf, which is the most complex open-set grounded segmentation task in the field.

APA

Liu, Z., Saseendran, A., Tong, L., He, X., Yousefi, F., Burlutskiy, N., Oglic, D., Diethe, T., Teare, P.A., Zhou, H. & Jin, C.. (2025). Segment Anyword: Mask Prompt Inversion for Open-Set Grounded Segmentation. Proceedings of the 42nd International Conference on Machine Learning, in Proceedings of Machine Learning Research 267:39420-39454 Available from https://proceedings.mlr.press/v267/liu25bj.html.

Segment Anyword: Mask Prompt Inversion for Open-Set Grounded Segmentation

Abstract

Cite this Paper

Related Material