Prompting vision-language fusion for Zero-Shot Composed Image Retrieval

Peng Wang, Zining Chen, Zhicheng Zhao, Fei Su
Proceedings of the 16th Asian Conference on Machine Learning, PMLR 260:671-686, 2025.

Abstract

The composed image retrieval (CIR) aims to retrieve target image given the combination of an image and a textual description as a query. Recently, benefiting from vision-language pretrained (VLP) models and large language models (LLM), the use of textual inversion or generating large-scale datasets has become a novel approach for zero-shot CIR task (ZS-CIR). However, the existing ZS-CIR models overlook one case where the textual description is often too brief or inherently inaccurate, making it challenging to effectively integrate the reference image into the query for retrieving the target image. To address this problem, we propose a simple yet effective method—prompting vision-language fusion (PVLF), which adapts representations in VLP models to dynamically fuse the vision and language (V&L) representation spaces. In addition, by injecting the context learnable prompt tokens in Transformer fusion encoder, the PVLF promotes the comprehensive coupling between V&L modalities, enriching the semantic representation of the query. We evaluate the effectiveness and robustness of our method on various VLP backbones, and the experimental results show that the proposed PVLF outperforms previous methods and achieves the state-of-the-art on two public ZS-CIR benchmarks (CIRR and FashionIQ).

Cite this Paper


BibTeX
@InProceedings{pmlr-v260-wang25d, title = {Prompting vision-language fusion for Zero-Shot Composed Image Retrieval}, author = {Wang, Peng and Chen, Zining and Zhao, Zhicheng and Su, Fei}, booktitle = {Proceedings of the 16th Asian Conference on Machine Learning}, pages = {671--686}, year = {2025}, editor = {Nguyen, Vu and Lin, Hsuan-Tien}, volume = {260}, series = {Proceedings of Machine Learning Research}, month = {05--08 Dec}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v260/main/assets/wang25d/wang25d.pdf}, url = {https://proceedings.mlr.press/v260/wang25d.html}, abstract = {The composed image retrieval (CIR) aims to retrieve target image given the combination of an image and a textual description as a query. Recently, benefiting from vision-language pretrained (VLP) models and large language models (LLM), the use of textual inversion or generating large-scale datasets has become a novel approach for zero-shot CIR task (ZS-CIR). However, the existing ZS-CIR models overlook one case where the textual description is often too brief or inherently inaccurate, making it challenging to effectively integrate the reference image into the query for retrieving the target image. To address this problem, we propose a simple yet effective method—prompting vision-language fusion (PVLF), which adapts representations in VLP models to dynamically fuse the vision and language (V&L) representation spaces. In addition, by injecting the context learnable prompt tokens in Transformer fusion encoder, the PVLF promotes the comprehensive coupling between V&L modalities, enriching the semantic representation of the query. We evaluate the effectiveness and robustness of our method on various VLP backbones, and the experimental results show that the proposed PVLF outperforms previous methods and achieves the state-of-the-art on two public ZS-CIR benchmarks (CIRR and FashionIQ).} }
Endnote
%0 Conference Paper %T Prompting vision-language fusion for Zero-Shot Composed Image Retrieval %A Peng Wang %A Zining Chen %A Zhicheng Zhao %A Fei Su %B Proceedings of the 16th Asian Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2025 %E Vu Nguyen %E Hsuan-Tien Lin %F pmlr-v260-wang25d %I PMLR %P 671--686 %U https://proceedings.mlr.press/v260/wang25d.html %V 260 %X The composed image retrieval (CIR) aims to retrieve target image given the combination of an image and a textual description as a query. Recently, benefiting from vision-language pretrained (VLP) models and large language models (LLM), the use of textual inversion or generating large-scale datasets has become a novel approach for zero-shot CIR task (ZS-CIR). However, the existing ZS-CIR models overlook one case where the textual description is often too brief or inherently inaccurate, making it challenging to effectively integrate the reference image into the query for retrieving the target image. To address this problem, we propose a simple yet effective method—prompting vision-language fusion (PVLF), which adapts representations in VLP models to dynamically fuse the vision and language (V&L) representation spaces. In addition, by injecting the context learnable prompt tokens in Transformer fusion encoder, the PVLF promotes the comprehensive coupling between V&L modalities, enriching the semantic representation of the query. We evaluate the effectiveness and robustness of our method on various VLP backbones, and the experimental results show that the proposed PVLF outperforms previous methods and achieves the state-of-the-art on two public ZS-CIR benchmarks (CIRR and FashionIQ).
APA
Wang, P., Chen, Z., Zhao, Z. & Su, F.. (2025). Prompting vision-language fusion for Zero-Shot Composed Image Retrieval. Proceedings of the 16th Asian Conference on Machine Learning, in Proceedings of Machine Learning Research 260:671-686 Available from https://proceedings.mlr.press/v260/wang25d.html.

Related Material