MoMa: Modulating Mamba for Adapting Image Foundation Models to Video Recognition

Yuhuan Yang, Chaofan Ma, Zhenjie Mao, Jiangchao Yao, Ya Zhang, Yanfeng Wang
Proceedings of the 42nd International Conference on Machine Learning, PMLR 267:70914-70926, 2025.

Abstract

Video understanding is a complex challenge that requires effective modeling of spatial-temporal dynamics. With the success of image foundation models (IFMs) in image understanding, recent approaches have explored parameter-efficient fine-tuning (PEFT) to adapt IFMs for video. However, most of these methods tend to process spatial and temporal information separately, which may fail to capture the full intricacy of video dynamics. In this paper, we propose MoMa, an efficient adapter framework that achieves full spatial-temporal modeling by integrating Mamba’s selective state space modeling into IFMs. We propose a novel SeqMod operation to inject spatial-temporal information into pre-trained IFMs, without disrupting their original features. By incorporating SeqMod into a Divide-and-Modulate architecture, MoMa enhances video understanding while maintaining computational efficiency. Extensive experiments on multiple video benchmarks demonstrate the effectiveness of MoMa, achieving superior performance with reduced computational cost. Codes will be released upon publication.

Cite this Paper


BibTeX
@InProceedings{pmlr-v267-yang25t, title = {{M}o{M}a: Modulating Mamba for Adapting Image Foundation Models to Video Recognition}, author = {Yang, Yuhuan and Ma, Chaofan and Mao, Zhenjie and Yao, Jiangchao and Zhang, Ya and Wang, Yanfeng}, booktitle = {Proceedings of the 42nd International Conference on Machine Learning}, pages = {70914--70926}, year = {2025}, editor = {Singh, Aarti and Fazel, Maryam and Hsu, Daniel and Lacoste-Julien, Simon and Berkenkamp, Felix and Maharaj, Tegan and Wagstaff, Kiri and Zhu, Jerry}, volume = {267}, series = {Proceedings of Machine Learning Research}, month = {13--19 Jul}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v267/main/assets/yang25t/yang25t.pdf}, url = {https://proceedings.mlr.press/v267/yang25t.html}, abstract = {Video understanding is a complex challenge that requires effective modeling of spatial-temporal dynamics. With the success of image foundation models (IFMs) in image understanding, recent approaches have explored parameter-efficient fine-tuning (PEFT) to adapt IFMs for video. However, most of these methods tend to process spatial and temporal information separately, which may fail to capture the full intricacy of video dynamics. In this paper, we propose MoMa, an efficient adapter framework that achieves full spatial-temporal modeling by integrating Mamba’s selective state space modeling into IFMs. We propose a novel SeqMod operation to inject spatial-temporal information into pre-trained IFMs, without disrupting their original features. By incorporating SeqMod into a Divide-and-Modulate architecture, MoMa enhances video understanding while maintaining computational efficiency. Extensive experiments on multiple video benchmarks demonstrate the effectiveness of MoMa, achieving superior performance with reduced computational cost. Codes will be released upon publication.} }
Endnote
%0 Conference Paper %T MoMa: Modulating Mamba for Adapting Image Foundation Models to Video Recognition %A Yuhuan Yang %A Chaofan Ma %A Zhenjie Mao %A Jiangchao Yao %A Ya Zhang %A Yanfeng Wang %B Proceedings of the 42nd International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2025 %E Aarti Singh %E Maryam Fazel %E Daniel Hsu %E Simon Lacoste-Julien %E Felix Berkenkamp %E Tegan Maharaj %E Kiri Wagstaff %E Jerry Zhu %F pmlr-v267-yang25t %I PMLR %P 70914--70926 %U https://proceedings.mlr.press/v267/yang25t.html %V 267 %X Video understanding is a complex challenge that requires effective modeling of spatial-temporal dynamics. With the success of image foundation models (IFMs) in image understanding, recent approaches have explored parameter-efficient fine-tuning (PEFT) to adapt IFMs for video. However, most of these methods tend to process spatial and temporal information separately, which may fail to capture the full intricacy of video dynamics. In this paper, we propose MoMa, an efficient adapter framework that achieves full spatial-temporal modeling by integrating Mamba’s selective state space modeling into IFMs. We propose a novel SeqMod operation to inject spatial-temporal information into pre-trained IFMs, without disrupting their original features. By incorporating SeqMod into a Divide-and-Modulate architecture, MoMa enhances video understanding while maintaining computational efficiency. Extensive experiments on multiple video benchmarks demonstrate the effectiveness of MoMa, achieving superior performance with reduced computational cost. Codes will be released upon publication.
APA
Yang, Y., Ma, C., Mao, Z., Yao, J., Zhang, Y. & Wang, Y.. (2025). MoMa: Modulating Mamba for Adapting Image Foundation Models to Video Recognition. Proceedings of the 42nd International Conference on Machine Learning, in Proceedings of Machine Learning Research 267:70914-70926 Available from https://proceedings.mlr.press/v267/yang25t.html.

Related Material