Visual Abstraction: A Plug-and-Play Approach for Text-Visual Retrieval

Guofeng Ding; Yiding Lu; Peng Hu; Mouxing Yang; Yijie Lin; Xi Peng

Visual Abstraction: A Plug-and-Play Approach for Text-Visual Retrieval

Guofeng Ding, Yiding Lu, Peng Hu, Mouxing Yang, Yijie Lin, Xi Peng

Proceedings of the 42nd International Conference on Machine Learning, PMLR 267:13825-13844, 2025.

Abstract

Text-to-visual retrieval often struggles with semantic redundancy and granularity mismatches between textual queries and visual content. Unlike existing methods that address these challenges during training, we propose VISual Abstraction (VISA), a test-time approach that enhances retrieval by transforming visual content into textual descriptions using off-the-shelf large models. The generated text descriptions, with their dense semantics, naturally filter out low-level redundant visual information. To further address granularity issues, VISA incorporates a question-answering process, enhancing the text description with the specific granularity information requested by the user. Extensive experiments demonstrate that VISA brings substantial improvements in text-to-image and text-to-video retrieval for both short- and long-context queries, offering a plug-and-play enhancement to existing retrieval systems.

Cite this Paper

BibTeX

@InProceedings{pmlr-v267-ding25b,
  title = 	 {Visual Abstraction: A Plug-and-Play Approach for Text-Visual Retrieval},
  author =       {Ding, Guofeng and Lu, Yiding and Hu, Peng and Yang, Mouxing and Lin, Yijie and Peng, Xi},
  booktitle = 	 {Proceedings of the 42nd International Conference on Machine Learning},
  pages = 	 {13825--13844},
  year = 	 {2025},
  editor = 	 {Singh, Aarti and Fazel, Maryam and Hsu, Daniel and Lacoste-Julien, Simon and Berkenkamp, Felix and Maharaj, Tegan and Wagstaff, Kiri and Zhu, Jerry},
  volume = 	 {267},
  series = 	 {Proceedings of Machine Learning Research},
  month = 	 {13--19 Jul},
  publisher =    {PMLR},
  pdf = 	 {https://raw.githubusercontent.com/mlresearch/v267/main/assets/ding25b/ding25b.pdf},
  url = 	 {https://proceedings.mlr.press/v267/ding25b.html},
  abstract = 	 {Text-to-visual retrieval often struggles with semantic redundancy and granularity mismatches between textual queries and visual content. Unlike existing methods that address these challenges during training, we propose VISual Abstraction (VISA), a test-time approach that enhances retrieval by transforming visual content into textual descriptions using off-the-shelf large models. The generated text descriptions, with their dense semantics, naturally filter out low-level redundant visual information. To further address granularity issues, VISA incorporates a question-answering process, enhancing the text description with the specific granularity information requested by the user. Extensive experiments demonstrate that VISA brings substantial improvements in text-to-image and text-to-video retrieval for both short- and long-context queries, offering a plug-and-play enhancement to existing retrieval systems.}
}

Endnote

%0 Conference Paper
%T Visual Abstraction: A Plug-and-Play Approach for Text-Visual Retrieval
%A Guofeng Ding
%A Yiding Lu
%A Peng Hu
%A Mouxing Yang
%A Yijie Lin
%A Xi Peng
%B Proceedings of the 42nd International Conference on Machine Learning
%C Proceedings of Machine Learning Research
%D 2025
%E Aarti Singh
%E Maryam Fazel
%E Daniel Hsu
%E Simon Lacoste-Julien
%E Felix Berkenkamp
%E Tegan Maharaj
%E Kiri Wagstaff
%E Jerry Zhu	
%F pmlr-v267-ding25b
%I PMLR
%P 13825--13844
%U https://proceedings.mlr.press/v267/ding25b.html
%V 267
%X Text-to-visual retrieval often struggles with semantic redundancy and granularity mismatches between textual queries and visual content. Unlike existing methods that address these challenges during training, we propose VISual Abstraction (VISA), a test-time approach that enhances retrieval by transforming visual content into textual descriptions using off-the-shelf large models. The generated text descriptions, with their dense semantics, naturally filter out low-level redundant visual information. To further address granularity issues, VISA incorporates a question-answering process, enhancing the text description with the specific granularity information requested by the user. Extensive experiments demonstrate that VISA brings substantial improvements in text-to-image and text-to-video retrieval for both short- and long-context queries, offering a plug-and-play enhancement to existing retrieval systems.

APA

Ding, G., Lu, Y., Hu, P., Yang, M., Lin, Y. & Peng, X.. (2025). Visual Abstraction: A Plug-and-Play Approach for Text-Visual Retrieval. Proceedings of the 42nd International Conference on Machine Learning, in Proceedings of Machine Learning Research 267:13825-13844 Available from https://proceedings.mlr.press/v267/ding25b.html.

Visual Abstraction: A Plug-and-Play Approach for Text-Visual Retrieval

Abstract

Cite this Paper

Related Material