Windsock is Dancing: Adaptive Multimodal Retrieval-Augmented Generation

Shu Zhao, Tianyi Shen, Nilesh Ahuja, Omesh Tickoo, Vijaykrishnan Narayanan
Proceedings of UniReps: the Third Edition of the Workshop on Unifying Representations in Neural Models, PMLR 322:168-182, 2026.

Abstract

Multimodal Retrieval-Augmented Generation (MRAG) has emerged as a promising method to generate factual and up-to-date responses of Multimodal Large Language Models (MLLMs) by incorporating non-parametric knowledge from external knowledge bases. However, existing MRAG approaches suffer from static retrieval strategies, inflexible modality selection, and suboptimal utilization of retrieved information, leading to three critical challenges: determining when to retrieve, what modalities to incorporate, and how to utilize retrieved information effectively. To address these challenges, we introduce Windsock, a query-dependent module making decisions on retrieval necessity and modality selection, effectively reducing computational overhead and improving response quality. Additionally, we propose Dynamic Noise-Resistance (DANCE) Instruction Tuning, an adaptive training strategy that enhances MLLMs’ ability to utilize retrieved information while maintaining robustness against noise. Moreover, we adopt a self-assessment approach leveraging knowledge within MLLMs to convert question-answering datasets to MRAG training datasets. Extensive experiments demonstrate that our proposed method significantly improves both efficiency and generation quality by 17.70% while reducing 8.95% retrieval times.

Cite this Paper


BibTeX
@InProceedings{pmlr-v322-zhao26a, title = {Windsock is Dancing: Adaptive Multimodal Retrieval-Augmented Generation}, author = {Zhao, Shu and Shen, Tianyi and Ahuja, Nilesh and Tickoo, Omesh and Narayanan, Vijaykrishnan}, booktitle = {Proceedings of UniReps: the Third Edition of the Workshop on Unifying Representations in Neural Models}, pages = {168--182}, year = {2026}, editor = {Fumero, Marco and Domine, Clementine and L"ahner, Zorah and Cannistraci, Irene and Zhao, Bo and Williams, Alex}, volume = {322}, series = {Proceedings of Machine Learning Research}, month = {06 Dec}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v322/main/assets/zhao26a/zhao26a.pdf}, url = {https://proceedings.mlr.press/v322/zhao26a.html}, abstract = {Multimodal Retrieval-Augmented Generation (MRAG) has emerged as a promising method to generate factual and up-to-date responses of Multimodal Large Language Models (MLLMs) by incorporating non-parametric knowledge from external knowledge bases. However, existing MRAG approaches suffer from static retrieval strategies, inflexible modality selection, and suboptimal utilization of retrieved information, leading to three critical challenges: determining when to retrieve, what modalities to incorporate, and how to utilize retrieved information effectively. To address these challenges, we introduce Windsock, a query-dependent module making decisions on retrieval necessity and modality selection, effectively reducing computational overhead and improving response quality. Additionally, we propose Dynamic Noise-Resistance (DANCE) Instruction Tuning, an adaptive training strategy that enhances MLLMs’ ability to utilize retrieved information while maintaining robustness against noise. Moreover, we adopt a self-assessment approach leveraging knowledge within MLLMs to convert question-answering datasets to MRAG training datasets. Extensive experiments demonstrate that our proposed method significantly improves both efficiency and generation quality by 17.70% while reducing 8.95% retrieval times.} }
Endnote
%0 Conference Paper %T Windsock is Dancing: Adaptive Multimodal Retrieval-Augmented Generation %A Shu Zhao %A Tianyi Shen %A Nilesh Ahuja %A Omesh Tickoo %A Vijaykrishnan Narayanan %B Proceedings of UniReps: the Third Edition of the Workshop on Unifying Representations in Neural Models %C Proceedings of Machine Learning Research %D 2026 %E Marco Fumero %E Clementine Domine %E Zorah L"ahner %E Irene Cannistraci %E Bo Zhao %E Alex Williams %F pmlr-v322-zhao26a %I PMLR %P 168--182 %U https://proceedings.mlr.press/v322/zhao26a.html %V 322 %X Multimodal Retrieval-Augmented Generation (MRAG) has emerged as a promising method to generate factual and up-to-date responses of Multimodal Large Language Models (MLLMs) by incorporating non-parametric knowledge from external knowledge bases. However, existing MRAG approaches suffer from static retrieval strategies, inflexible modality selection, and suboptimal utilization of retrieved information, leading to three critical challenges: determining when to retrieve, what modalities to incorporate, and how to utilize retrieved information effectively. To address these challenges, we introduce Windsock, a query-dependent module making decisions on retrieval necessity and modality selection, effectively reducing computational overhead and improving response quality. Additionally, we propose Dynamic Noise-Resistance (DANCE) Instruction Tuning, an adaptive training strategy that enhances MLLMs’ ability to utilize retrieved information while maintaining robustness against noise. Moreover, we adopt a self-assessment approach leveraging knowledge within MLLMs to convert question-answering datasets to MRAG training datasets. Extensive experiments demonstrate that our proposed method significantly improves both efficiency and generation quality by 17.70% while reducing 8.95% retrieval times.
APA
Zhao, S., Shen, T., Ahuja, N., Tickoo, O. & Narayanan, V.. (2026). Windsock is Dancing: Adaptive Multimodal Retrieval-Augmented Generation. Proceedings of UniReps: the Third Edition of the Workshop on Unifying Representations in Neural Models, in Proceedings of Machine Learning Research 322:168-182 Available from https://proceedings.mlr.press/v322/zhao26a.html.

Related Material