Modularized Self-Reflected Video Reasoner for Multimodal LLM with Application to Video Question Answering

Zihan Song; Xin Wang; Zi Qian; Hong Chen; Longtao Huang; Hui Xue; Wenwu Zhu

Modularized Self-Reflected Video Reasoner for Multimodal LLM with Application to Video Question Answering

Zihan Song, Xin Wang, Zi Qian, Hong Chen, Longtao Huang, Hui Xue, Wenwu Zhu

Proceedings of the 42nd International Conference on Machine Learning, PMLR 267:56389-56413, 2025.

Abstract

Multimodal Large Language Models (Multimodal LLMs) have shown their strength in Video Question Answering (VideoQA). However, due to the black-box nature of end-to-end training strategies, existing approaches based on Multimodal LLMs suffer from the lack of interpretability for VideoQA: they can neither present reasoning paths nor indicate where the answers are derived from the video. To address this issue, we propose MSR-ViR (Modularized Self-Reflected Video Reasoner), which for the first time integrates modular networks to Multimodal LLMs, capable of providing VideoQA with explicit reasoning paths for more interpretability. Specifically, a MoST-Grounding (Modularized Spatial-Temporal Grounding) network is proposed to decompose complex questions via tree-structured policies, localizing relevant temporal and spatial segments within videos through step-by-step reasoning. The proposed MoST-Grounding network provides explicit visually grounded information for Multimodal LLMs with clear reasoning paths, thus enhancing interpretability for the predicted answers. To further improve the reasoning quality, we design an Alternate Self-reflection Training Strategy to jointly optimize policy generation and Multimodal LLMs. Experiments on real-world datasets demonstrate the superiority of our proposed MSR-ViR framework in video understanding, reasoning transparency, and providing explicit localization evidence for answers.

Cite this Paper

BibTeX

@InProceedings{pmlr-v267-song25g,
  title = 	 {Modularized Self-Reflected Video Reasoner for Multimodal {LLM} with Application to Video Question Answering},
  author =       {Song, Zihan and Wang, Xin and Qian, Zi and Chen, Hong and Huang, Longtao and Xue, Hui and Zhu, Wenwu},
  booktitle = 	 {Proceedings of the 42nd International Conference on Machine Learning},
  pages = 	 {56389--56413},
  year = 	 {2025},
  editor = 	 {Singh, Aarti and Fazel, Maryam and Hsu, Daniel and Lacoste-Julien, Simon and Berkenkamp, Felix and Maharaj, Tegan and Wagstaff, Kiri and Zhu, Jerry},
  volume = 	 {267},
  series = 	 {Proceedings of Machine Learning Research},
  month = 	 {13--19 Jul},
  publisher =    {PMLR},
  pdf = 	 {https://raw.githubusercontent.com/mlresearch/v267/main/assets/song25g/song25g.pdf},
  url = 	 {https://proceedings.mlr.press/v267/song25g.html},
  abstract = 	 {Multimodal Large Language Models (Multimodal LLMs) have shown their strength in Video Question Answering (VideoQA). However, due to the black-box nature of end-to-end training strategies, existing approaches based on Multimodal LLMs suffer from the lack of interpretability for VideoQA: they can neither present reasoning paths nor indicate where the answers are derived from the video. To address this issue, we propose MSR-ViR (Modularized Self-Reflected Video Reasoner), which for the first time integrates modular networks to Multimodal LLMs, capable of providing VideoQA with explicit reasoning paths for more interpretability. Specifically, a MoST-Grounding (Modularized Spatial-Temporal Grounding) network is proposed to decompose complex questions via tree-structured policies, localizing relevant temporal and spatial segments within videos through step-by-step reasoning. The proposed MoST-Grounding network provides explicit visually grounded information for Multimodal LLMs with clear reasoning paths, thus enhancing interpretability for the predicted answers. To further improve the reasoning quality, we design an Alternate Self-reflection Training Strategy to jointly optimize policy generation and Multimodal LLMs. Experiments on real-world datasets demonstrate the superiority of our proposed MSR-ViR framework in video understanding, reasoning transparency, and providing explicit localization evidence for answers.}
}

Endnote

%0 Conference Paper
%T Modularized Self-Reflected Video Reasoner for Multimodal LLM with Application to Video Question Answering
%A Zihan Song
%A Xin Wang
%A Zi Qian
%A Hong Chen
%A Longtao Huang
%A Hui Xue
%A Wenwu Zhu
%B Proceedings of the 42nd International Conference on Machine Learning
%C Proceedings of Machine Learning Research
%D 2025
%E Aarti Singh
%E Maryam Fazel
%E Daniel Hsu
%E Simon Lacoste-Julien
%E Felix Berkenkamp
%E Tegan Maharaj
%E Kiri Wagstaff
%E Jerry Zhu	
%F pmlr-v267-song25g
%I PMLR
%P 56389--56413
%U https://proceedings.mlr.press/v267/song25g.html
%V 267
%X Multimodal Large Language Models (Multimodal LLMs) have shown their strength in Video Question Answering (VideoQA). However, due to the black-box nature of end-to-end training strategies, existing approaches based on Multimodal LLMs suffer from the lack of interpretability for VideoQA: they can neither present reasoning paths nor indicate where the answers are derived from the video. To address this issue, we propose MSR-ViR (Modularized Self-Reflected Video Reasoner), which for the first time integrates modular networks to Multimodal LLMs, capable of providing VideoQA with explicit reasoning paths for more interpretability. Specifically, a MoST-Grounding (Modularized Spatial-Temporal Grounding) network is proposed to decompose complex questions via tree-structured policies, localizing relevant temporal and spatial segments within videos through step-by-step reasoning. The proposed MoST-Grounding network provides explicit visually grounded information for Multimodal LLMs with clear reasoning paths, thus enhancing interpretability for the predicted answers. To further improve the reasoning quality, we design an Alternate Self-reflection Training Strategy to jointly optimize policy generation and Multimodal LLMs. Experiments on real-world datasets demonstrate the superiority of our proposed MSR-ViR framework in video understanding, reasoning transparency, and providing explicit localization evidence for answers.

APA

Song, Z., Wang, X., Qian, Z., Chen, H., Huang, L., Xue, H. & Zhu, W.. (2025). Modularized Self-Reflected Video Reasoner for Multimodal LLM with Application to Video Question Answering. Proceedings of the 42nd International Conference on Machine Learning, in Proceedings of Machine Learning Research 267:56389-56413 Available from https://proceedings.mlr.press/v267/song25g.html.

Modularized Self-Reflected Video Reasoner for Multimodal LLM with Application to Video Question Answering

Abstract

Cite this Paper

Related Material