Visual Abstraction: A Plug-and-Play Approach for Text-Visual Retrieval

Guofeng Ding, Yiding Lu, Peng Hu, Mouxing Yang, Yijie Lin, Xi Peng
Proceedings of the 42nd International Conference on Machine Learning, PMLR 267:13825-13844, 2025.

Abstract

Text-to-visual retrieval often struggles with semantic redundancy and granularity mismatches between textual queries and visual content. Unlike existing methods that address these challenges during training, we propose VISual Abstraction (VISA), a test-time approach that enhances retrieval by transforming visual content into textual descriptions using off-the-shelf large models. The generated text descriptions, with their dense semantics, naturally filter out low-level redundant visual information. To further address granularity issues, VISA incorporates a question-answering process, enhancing the text description with the specific granularity information requested by the user. Extensive experiments demonstrate that VISA brings substantial improvements in text-to-image and text-to-video retrieval for both short- and long-context queries, offering a plug-and-play enhancement to existing retrieval systems.

Cite this Paper


BibTeX
@InProceedings{pmlr-v267-ding25b, title = {Visual Abstraction: A Plug-and-Play Approach for Text-Visual Retrieval}, author = {Ding, Guofeng and Lu, Yiding and Hu, Peng and Yang, Mouxing and Lin, Yijie and Peng, Xi}, booktitle = {Proceedings of the 42nd International Conference on Machine Learning}, pages = {13825--13844}, year = {2025}, editor = {Singh, Aarti and Fazel, Maryam and Hsu, Daniel and Lacoste-Julien, Simon and Berkenkamp, Felix and Maharaj, Tegan and Wagstaff, Kiri and Zhu, Jerry}, volume = {267}, series = {Proceedings of Machine Learning Research}, month = {13--19 Jul}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v267/main/assets/ding25b/ding25b.pdf}, url = {https://proceedings.mlr.press/v267/ding25b.html}, abstract = {Text-to-visual retrieval often struggles with semantic redundancy and granularity mismatches between textual queries and visual content. Unlike existing methods that address these challenges during training, we propose VISual Abstraction (VISA), a test-time approach that enhances retrieval by transforming visual content into textual descriptions using off-the-shelf large models. The generated text descriptions, with their dense semantics, naturally filter out low-level redundant visual information. To further address granularity issues, VISA incorporates a question-answering process, enhancing the text description with the specific granularity information requested by the user. Extensive experiments demonstrate that VISA brings substantial improvements in text-to-image and text-to-video retrieval for both short- and long-context queries, offering a plug-and-play enhancement to existing retrieval systems.} }
Endnote
%0 Conference Paper %T Visual Abstraction: A Plug-and-Play Approach for Text-Visual Retrieval %A Guofeng Ding %A Yiding Lu %A Peng Hu %A Mouxing Yang %A Yijie Lin %A Xi Peng %B Proceedings of the 42nd International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2025 %E Aarti Singh %E Maryam Fazel %E Daniel Hsu %E Simon Lacoste-Julien %E Felix Berkenkamp %E Tegan Maharaj %E Kiri Wagstaff %E Jerry Zhu %F pmlr-v267-ding25b %I PMLR %P 13825--13844 %U https://proceedings.mlr.press/v267/ding25b.html %V 267 %X Text-to-visual retrieval often struggles with semantic redundancy and granularity mismatches between textual queries and visual content. Unlike existing methods that address these challenges during training, we propose VISual Abstraction (VISA), a test-time approach that enhances retrieval by transforming visual content into textual descriptions using off-the-shelf large models. The generated text descriptions, with their dense semantics, naturally filter out low-level redundant visual information. To further address granularity issues, VISA incorporates a question-answering process, enhancing the text description with the specific granularity information requested by the user. Extensive experiments demonstrate that VISA brings substantial improvements in text-to-image and text-to-video retrieval for both short- and long-context queries, offering a plug-and-play enhancement to existing retrieval systems.
APA
Ding, G., Lu, Y., Hu, P., Yang, M., Lin, Y. & Peng, X.. (2025). Visual Abstraction: A Plug-and-Play Approach for Text-Visual Retrieval. Proceedings of the 42nd International Conference on Machine Learning, in Proceedings of Machine Learning Research 267:13825-13844 Available from https://proceedings.mlr.press/v267/ding25b.html.

Related Material