VL-Mamba: Exploring State Space Models for Multimodal Learning

Yanyuan Qiao, Zheng Yu, Zijia Zhao, Sihan Chen, Mingzhen Sun, Longteng Guo, Qi Wu, Jing Liu
Proceedings of The 4th NeurIPS Efficient Natural Language and Speech Processing Workshop, PMLR 262:102-113, 2024.

Abstract

Multimodal large language models (MLLMs) have gained considerable attention due to their ability to integrate visual and textual information, enhancing understanding and providing context for complex tasks. While Transformer-based architectures have been the dominant framework for MLLMs, recent studies suggest that state space models (SSMs) like Mamba can achieve competitive or even superior performance. However, no prior research has investigated the potential of SSMs to replace Transformers in multimodal tasks, which are inherently more challenging due to the heterogeneity of visual and language data and the complexities of aligning these modalities. In this paper, we introduce VL-Mamba, the first study to explore the application of state space models in multimodal learning tasks. VL-Mamba leverages a pretrained Mamba language model as its core, and we propose a novel MultiModal Connector (MMC) that incorporates a Vision Selective Scan (VSS) module to improve visual sequence modeling. We empirically explore how to effectively apply the 2D vision selective scan mechanism for multimodal learning and the combinations of different vision encoders and variants of pretrained Mamba language models. Our experiments across multiple multimodal benchmarks demonstrate that VL-Mamba achieves competitive performance against small MLLMs of similar size, and in some cases, surpasses larger models such as the 7B and 13B versions of LLaVA-1.5. These results suggest that state space models have the potential to serve as an alternative to Transformers in multimodal learning tasks.

Cite this Paper


BibTeX
@InProceedings{pmlr-v262-qiao24a, title = {{VL-Mamba}: Exploring State Space Models for Multimodal Learning}, author = {Qiao, Yanyuan and Yu, Zheng and Zhao, Zijia and Chen, Sihan and Sun, Mingzhen and Guo, Longteng and Wu, Qi and Liu, Jing}, booktitle = {Proceedings of The 4th NeurIPS Efficient Natural Language and Speech Processing Workshop}, pages = {102--113}, year = {2024}, editor = {Rezagholizadeh, Mehdi and Passban, Peyman and Samiee, Soheila and Partovi Nia, Vahid and Cheng, Yu and Deng, Yue and Liu, Qun and Chen, Boxing}, volume = {262}, series = {Proceedings of Machine Learning Research}, month = {14 Dec}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v262/main/assets/qiao24a/qiao24a.pdf}, url = {https://proceedings.mlr.press/v262/qiao24a.html}, abstract = {Multimodal large language models (MLLMs) have gained considerable attention due to their ability to integrate visual and textual information, enhancing understanding and providing context for complex tasks. While Transformer-based architectures have been the dominant framework for MLLMs, recent studies suggest that state space models (SSMs) like Mamba can achieve competitive or even superior performance. However, no prior research has investigated the potential of SSMs to replace Transformers in multimodal tasks, which are inherently more challenging due to the heterogeneity of visual and language data and the complexities of aligning these modalities. In this paper, we introduce VL-Mamba, the first study to explore the application of state space models in multimodal learning tasks. VL-Mamba leverages a pretrained Mamba language model as its core, and we propose a novel MultiModal Connector (MMC) that incorporates a Vision Selective Scan (VSS) module to improve visual sequence modeling. We empirically explore how to effectively apply the 2D vision selective scan mechanism for multimodal learning and the combinations of different vision encoders and variants of pretrained Mamba language models. Our experiments across multiple multimodal benchmarks demonstrate that VL-Mamba achieves competitive performance against small MLLMs of similar size, and in some cases, surpasses larger models such as the 7B and 13B versions of LLaVA-1.5. These results suggest that state space models have the potential to serve as an alternative to Transformers in multimodal learning tasks.} }
Endnote
%0 Conference Paper %T VL-Mamba: Exploring State Space Models for Multimodal Learning %A Yanyuan Qiao %A Zheng Yu %A Zijia Zhao %A Sihan Chen %A Mingzhen Sun %A Longteng Guo %A Qi Wu %A Jing Liu %B Proceedings of The 4th NeurIPS Efficient Natural Language and Speech Processing Workshop %C Proceedings of Machine Learning Research %D 2024 %E Mehdi Rezagholizadeh %E Peyman Passban %E Soheila Samiee %E Vahid Partovi Nia %E Yu Cheng %E Yue Deng %E Qun Liu %E Boxing Chen %F pmlr-v262-qiao24a %I PMLR %P 102--113 %U https://proceedings.mlr.press/v262/qiao24a.html %V 262 %X Multimodal large language models (MLLMs) have gained considerable attention due to their ability to integrate visual and textual information, enhancing understanding and providing context for complex tasks. While Transformer-based architectures have been the dominant framework for MLLMs, recent studies suggest that state space models (SSMs) like Mamba can achieve competitive or even superior performance. However, no prior research has investigated the potential of SSMs to replace Transformers in multimodal tasks, which are inherently more challenging due to the heterogeneity of visual and language data and the complexities of aligning these modalities. In this paper, we introduce VL-Mamba, the first study to explore the application of state space models in multimodal learning tasks. VL-Mamba leverages a pretrained Mamba language model as its core, and we propose a novel MultiModal Connector (MMC) that incorporates a Vision Selective Scan (VSS) module to improve visual sequence modeling. We empirically explore how to effectively apply the 2D vision selective scan mechanism for multimodal learning and the combinations of different vision encoders and variants of pretrained Mamba language models. Our experiments across multiple multimodal benchmarks demonstrate that VL-Mamba achieves competitive performance against small MLLMs of similar size, and in some cases, surpasses larger models such as the 7B and 13B versions of LLaVA-1.5. These results suggest that state space models have the potential to serve as an alternative to Transformers in multimodal learning tasks.
APA
Qiao, Y., Yu, Z., Zhao, Z., Chen, S., Sun, M., Guo, L., Wu, Q. & Liu, J.. (2024). VL-Mamba: Exploring State Space Models for Multimodal Learning. Proceedings of The 4th NeurIPS Efficient Natural Language and Speech Processing Workshop, in Proceedings of Machine Learning Research 262:102-113 Available from https://proceedings.mlr.press/v262/qiao24a.html.

Related Material