On Uni-Modal Feature Learning in Supervised Multi-Modal Learning

Chenzhuang Du, Jiaye Teng, Tingle Li, Yichen Liu, Tianyuan Yuan, Yue Wang, Yang Yuan, Hang Zhao
Proceedings of the 40th International Conference on Machine Learning, PMLR 202:8632-8656, 2023.

Abstract

We abstract the features (i.e. learned representations) of multi-modal data into 1) uni-modal features, which can be learned from uni-modal training, and 2) paired features, which can only be learned from cross-modal interactions. Multi-modal models are expected to benefit from cross-modal interactions on the basis of ensuring uni-modal feature learning. However, recent supervised multi-modal late-fusion training approaches still suffer from insufficient learning of uni-modal features on each modality. We prove that this phenomenon does hurt the model’s generalization ability. To this end, we propose to choose a targeted late-fusion learning method for the given supervised multi-modal task from Uni-Modal Ensemble (UME) and the proposed Uni-Modal Teacher (UMT), according to the distribution of uni-modal and paired features. We demonstrate that, under a simple guiding strategy, we can achieve comparable results to other complex late-fusion or intermediate-fusion methods on various multi-modal datasets, including VGG-Sound, Kinetics-400, UCF101, and ModelNet40.

Cite this Paper


BibTeX
@InProceedings{pmlr-v202-du23e, title = {On Uni-Modal Feature Learning in Supervised Multi-Modal Learning}, author = {Du, Chenzhuang and Teng, Jiaye and Li, Tingle and Liu, Yichen and Yuan, Tianyuan and Wang, Yue and Yuan, Yang and Zhao, Hang}, booktitle = {Proceedings of the 40th International Conference on Machine Learning}, pages = {8632--8656}, year = {2023}, editor = {Krause, Andreas and Brunskill, Emma and Cho, Kyunghyun and Engelhardt, Barbara and Sabato, Sivan and Scarlett, Jonathan}, volume = {202}, series = {Proceedings of Machine Learning Research}, month = {23--29 Jul}, publisher = {PMLR}, pdf = {https://proceedings.mlr.press/v202/du23e/du23e.pdf}, url = {https://proceedings.mlr.press/v202/du23e.html}, abstract = {We abstract the features (i.e. learned representations) of multi-modal data into 1) uni-modal features, which can be learned from uni-modal training, and 2) paired features, which can only be learned from cross-modal interactions. Multi-modal models are expected to benefit from cross-modal interactions on the basis of ensuring uni-modal feature learning. However, recent supervised multi-modal late-fusion training approaches still suffer from insufficient learning of uni-modal features on each modality. We prove that this phenomenon does hurt the model’s generalization ability. To this end, we propose to choose a targeted late-fusion learning method for the given supervised multi-modal task from Uni-Modal Ensemble (UME) and the proposed Uni-Modal Teacher (UMT), according to the distribution of uni-modal and paired features. We demonstrate that, under a simple guiding strategy, we can achieve comparable results to other complex late-fusion or intermediate-fusion methods on various multi-modal datasets, including VGG-Sound, Kinetics-400, UCF101, and ModelNet40.} }
Endnote
%0 Conference Paper %T On Uni-Modal Feature Learning in Supervised Multi-Modal Learning %A Chenzhuang Du %A Jiaye Teng %A Tingle Li %A Yichen Liu %A Tianyuan Yuan %A Yue Wang %A Yang Yuan %A Hang Zhao %B Proceedings of the 40th International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2023 %E Andreas Krause %E Emma Brunskill %E Kyunghyun Cho %E Barbara Engelhardt %E Sivan Sabato %E Jonathan Scarlett %F pmlr-v202-du23e %I PMLR %P 8632--8656 %U https://proceedings.mlr.press/v202/du23e.html %V 202 %X We abstract the features (i.e. learned representations) of multi-modal data into 1) uni-modal features, which can be learned from uni-modal training, and 2) paired features, which can only be learned from cross-modal interactions. Multi-modal models are expected to benefit from cross-modal interactions on the basis of ensuring uni-modal feature learning. However, recent supervised multi-modal late-fusion training approaches still suffer from insufficient learning of uni-modal features on each modality. We prove that this phenomenon does hurt the model’s generalization ability. To this end, we propose to choose a targeted late-fusion learning method for the given supervised multi-modal task from Uni-Modal Ensemble (UME) and the proposed Uni-Modal Teacher (UMT), according to the distribution of uni-modal and paired features. We demonstrate that, under a simple guiding strategy, we can achieve comparable results to other complex late-fusion or intermediate-fusion methods on various multi-modal datasets, including VGG-Sound, Kinetics-400, UCF101, and ModelNet40.
APA
Du, C., Teng, J., Li, T., Liu, Y., Yuan, T., Wang, Y., Yuan, Y. & Zhao, H.. (2023). On Uni-Modal Feature Learning in Supervised Multi-Modal Learning. Proceedings of the 40th International Conference on Machine Learning, in Proceedings of Machine Learning Research 202:8632-8656 Available from https://proceedings.mlr.press/v202/du23e.html.

Related Material