FreeBind: Free Lunch in Unified Multimodal Space via Knowledge Fusion

Zehan Wang, Ziang Zhang, Xize Cheng, Rongjie Huang, Luping Liu, Zhenhui Ye, Haifeng Huang, Yang Zhao, Tao Jin, Peng Gao, Zhou Zhao
Proceedings of the 41st International Conference on Machine Learning, PMLR 235:52233-52246, 2024.

Abstract

Unified multi-model representation spaces are the foundation of multimodal understanding and generation. However, the billions of model parameters and catastrophic forgetting problems make it challenging to further enhance pre-trained unified spaces. In this work, we propose FreeBind, an idea that treats multimodal representation spaces as basic units, and freely augments pre-trained unified space by integrating knowledge from extra expert spaces via “space bonds". Specifically, we introduce two kinds of basic space bonds: 1) Space Displacement Bond and 2) Space Combination Bond. Based on these basic bonds, we design Complex Sequential & Parallel Bonds to effectively integrate multiple spaces simultaneously. Benefiting from the modularization concept, we further propose a coarse-to-fine customized inference strategy to flexibly adjust the enhanced unified space for different purposes. Experimentally, we bind ImageBind with extra image-text and audio-text expert spaces, resulting in three main variants: ImageBind++, InternVL_IB, and InternVL_IB++. These resulting spaces outperform ImageBind on 5 audio-image-text downstream tasks across 9 datasets. Moreover, via customized inference, it even surpasses the advanced audio-text and image-text expert spaces. Our code and checkpoints are released at https://github.com/zehanwang01/FreeBind

Cite this Paper


BibTeX
@InProceedings{pmlr-v235-wang24co, title = {{F}ree{B}ind: Free Lunch in Unified Multimodal Space via Knowledge Fusion}, author = {Wang, Zehan and Zhang, Ziang and Cheng, Xize and Huang, Rongjie and Liu, Luping and Ye, Zhenhui and Huang, Haifeng and Zhao, Yang and Jin, Tao and Gao, Peng and Zhao, Zhou}, booktitle = {Proceedings of the 41st International Conference on Machine Learning}, pages = {52233--52246}, year = {2024}, editor = {Salakhutdinov, Ruslan and Kolter, Zico and Heller, Katherine and Weller, Adrian and Oliver, Nuria and Scarlett, Jonathan and Berkenkamp, Felix}, volume = {235}, series = {Proceedings of Machine Learning Research}, month = {21--27 Jul}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v235/main/assets/wang24co/wang24co.pdf}, url = {https://proceedings.mlr.press/v235/wang24co.html}, abstract = {Unified multi-model representation spaces are the foundation of multimodal understanding and generation. However, the billions of model parameters and catastrophic forgetting problems make it challenging to further enhance pre-trained unified spaces. In this work, we propose FreeBind, an idea that treats multimodal representation spaces as basic units, and freely augments pre-trained unified space by integrating knowledge from extra expert spaces via “space bonds". Specifically, we introduce two kinds of basic space bonds: 1) Space Displacement Bond and 2) Space Combination Bond. Based on these basic bonds, we design Complex Sequential & Parallel Bonds to effectively integrate multiple spaces simultaneously. Benefiting from the modularization concept, we further propose a coarse-to-fine customized inference strategy to flexibly adjust the enhanced unified space for different purposes. Experimentally, we bind ImageBind with extra image-text and audio-text expert spaces, resulting in three main variants: ImageBind++, InternVL_IB, and InternVL_IB++. These resulting spaces outperform ImageBind on 5 audio-image-text downstream tasks across 9 datasets. Moreover, via customized inference, it even surpasses the advanced audio-text and image-text expert spaces. Our code and checkpoints are released at https://github.com/zehanwang01/FreeBind} }
Endnote
%0 Conference Paper %T FreeBind: Free Lunch in Unified Multimodal Space via Knowledge Fusion %A Zehan Wang %A Ziang Zhang %A Xize Cheng %A Rongjie Huang %A Luping Liu %A Zhenhui Ye %A Haifeng Huang %A Yang Zhao %A Tao Jin %A Peng Gao %A Zhou Zhao %B Proceedings of the 41st International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2024 %E Ruslan Salakhutdinov %E Zico Kolter %E Katherine Heller %E Adrian Weller %E Nuria Oliver %E Jonathan Scarlett %E Felix Berkenkamp %F pmlr-v235-wang24co %I PMLR %P 52233--52246 %U https://proceedings.mlr.press/v235/wang24co.html %V 235 %X Unified multi-model representation spaces are the foundation of multimodal understanding and generation. However, the billions of model parameters and catastrophic forgetting problems make it challenging to further enhance pre-trained unified spaces. In this work, we propose FreeBind, an idea that treats multimodal representation spaces as basic units, and freely augments pre-trained unified space by integrating knowledge from extra expert spaces via “space bonds". Specifically, we introduce two kinds of basic space bonds: 1) Space Displacement Bond and 2) Space Combination Bond. Based on these basic bonds, we design Complex Sequential & Parallel Bonds to effectively integrate multiple spaces simultaneously. Benefiting from the modularization concept, we further propose a coarse-to-fine customized inference strategy to flexibly adjust the enhanced unified space for different purposes. Experimentally, we bind ImageBind with extra image-text and audio-text expert spaces, resulting in three main variants: ImageBind++, InternVL_IB, and InternVL_IB++. These resulting spaces outperform ImageBind on 5 audio-image-text downstream tasks across 9 datasets. Moreover, via customized inference, it even surpasses the advanced audio-text and image-text expert spaces. Our code and checkpoints are released at https://github.com/zehanwang01/FreeBind
APA
Wang, Z., Zhang, Z., Cheng, X., Huang, R., Liu, L., Ye, Z., Huang, H., Zhao, Y., Jin, T., Gao, P. & Zhao, Z.. (2024). FreeBind: Free Lunch in Unified Multimodal Space via Knowledge Fusion. Proceedings of the 41st International Conference on Machine Learning, in Proceedings of Machine Learning Research 235:52233-52246 Available from https://proceedings.mlr.press/v235/wang24co.html.

Related Material