Windsock is Dancing: Adaptive Multimodal Retrieval-Augmented Generation

Shu Zhao; Tianyi Shen; Nilesh Ahuja; Omesh Tickoo; Vijaykrishnan Narayanan

Windsock is Dancing: Adaptive Multimodal Retrieval-Augmented Generation

Shu Zhao, Tianyi Shen, Nilesh Ahuja, Omesh Tickoo, Vijaykrishnan Narayanan

Proceedings of UniReps: the Third Edition of the Workshop on Unifying Representations in Neural Models, PMLR 322:168-182, 2026.

Abstract

Multimodal Retrieval-Augmented Generation (MRAG) has emerged as a promising method to generate factual and up-to-date responses of Multimodal Large Language Models (MLLMs) by incorporating non-parametric knowledge from external knowledge bases. However, existing MRAG approaches suffer from static retrieval strategies, inflexible modality selection, and suboptimal utilization of retrieved information, leading to three critical challenges: determining when to retrieve, what modalities to incorporate, and how to utilize retrieved information effectively. To address these challenges, we introduce Windsock, a query-dependent module making decisions on retrieval necessity and modality selection, effectively reducing computational overhead and improving response quality. Additionally, we propose Dynamic Noise-Resistance (DANCE) Instruction Tuning, an adaptive training strategy that enhances MLLMs’ ability to utilize retrieved information while maintaining robustness against noise. Moreover, we adopt a self-assessment approach leveraging knowledge within MLLMs to convert question-answering datasets to MRAG training datasets. Extensive experiments demonstrate that our proposed method significantly improves both efficiency and generation quality by 17.70% while reducing 8.95% retrieval times.

Cite this Paper

BibTeX

@InProceedings{pmlr-v322-zhao26a,
  title = 	 {Windsock is Dancing: Adaptive Multimodal Retrieval-Augmented Generation},
  author =       {Zhao, Shu and Shen, Tianyi and Ahuja, Nilesh and Tickoo, Omesh and Narayanan, Vijaykrishnan},
  booktitle = 	 {Proceedings of UniReps: the Third Edition of the Workshop on Unifying Representations in Neural Models},
  pages = 	 {168--182},
  year = 	 {2026},
  editor = 	 {Fumero, Marco and Domine, Clementine and L"ahner, Zorah and Cannistraci, Irene and Zhao, Bo and Williams, Alex},
  volume = 	 {322},
  series = 	 {Proceedings of Machine Learning Research},
  month = 	 {06 Dec},
  publisher =    {PMLR},
  pdf = 	 {https://raw.githubusercontent.com/mlresearch/v322/main/assets/zhao26a/zhao26a.pdf},
  url = 	 {https://proceedings.mlr.press/v322/zhao26a.html},
  abstract = 	 {Multimodal Retrieval-Augmented Generation (MRAG) has emerged as a promising method to generate factual and up-to-date responses of Multimodal Large Language Models (MLLMs) by incorporating non-parametric knowledge from external knowledge bases. However, existing MRAG approaches suffer from static retrieval strategies, inflexible modality selection, and suboptimal utilization of retrieved information, leading to three critical challenges: determining when to retrieve, what modalities to incorporate, and how to utilize retrieved information effectively. To address these challenges, we introduce Windsock, a query-dependent module making decisions on retrieval necessity and modality selection, effectively reducing computational overhead and improving response quality. Additionally, we propose Dynamic Noise-Resistance (DANCE) Instruction Tuning, an adaptive training strategy that enhances MLLMs’ ability to utilize retrieved information while maintaining robustness against noise. Moreover, we adopt a self-assessment approach leveraging knowledge within MLLMs to convert question-answering datasets to MRAG training datasets. Extensive experiments demonstrate that our proposed method significantly improves both efficiency and generation quality by 17.70% while reducing 8.95% retrieval times.}
}

Endnote

%0 Conference Paper
%T Windsock is Dancing: Adaptive Multimodal Retrieval-Augmented Generation
%A Shu Zhao
%A Tianyi Shen
%A Nilesh Ahuja
%A Omesh Tickoo
%A Vijaykrishnan Narayanan
%B Proceedings of UniReps: the Third Edition of the Workshop on Unifying Representations in Neural Models
%C Proceedings of Machine Learning Research
%D 2026
%E Marco Fumero
%E Clementine Domine
%E Zorah L"ahner
%E Irene Cannistraci
%E Bo Zhao
%E Alex Williams	
%F pmlr-v322-zhao26a
%I PMLR
%P 168--182
%U https://proceedings.mlr.press/v322/zhao26a.html
%V 322
%X Multimodal Retrieval-Augmented Generation (MRAG) has emerged as a promising method to generate factual and up-to-date responses of Multimodal Large Language Models (MLLMs) by incorporating non-parametric knowledge from external knowledge bases. However, existing MRAG approaches suffer from static retrieval strategies, inflexible modality selection, and suboptimal utilization of retrieved information, leading to three critical challenges: determining when to retrieve, what modalities to incorporate, and how to utilize retrieved information effectively. To address these challenges, we introduce Windsock, a query-dependent module making decisions on retrieval necessity and modality selection, effectively reducing computational overhead and improving response quality. Additionally, we propose Dynamic Noise-Resistance (DANCE) Instruction Tuning, an adaptive training strategy that enhances MLLMs’ ability to utilize retrieved information while maintaining robustness against noise. Moreover, we adopt a self-assessment approach leveraging knowledge within MLLMs to convert question-answering datasets to MRAG training datasets. Extensive experiments demonstrate that our proposed method significantly improves both efficiency and generation quality by 17.70% while reducing 8.95% retrieval times.

APA

Zhao, S., Shen, T., Ahuja, N., Tickoo, O. & Narayanan, V.. (2026). Windsock is Dancing: Adaptive Multimodal Retrieval-Augmented Generation. Proceedings of UniReps: the Third Edition of the Workshop on Unifying Representations in Neural Models, in Proceedings of Machine Learning Research 322:168-182 Available from https://proceedings.mlr.press/v322/zhao26a.html.

Windsock is Dancing: Adaptive Multimodal Retrieval-Augmented Generation

Abstract

Cite this Paper

Related Material