On Uni-Modal Feature Learning in Supervised Multi-Modal Learning

Chenzhuang Du; Jiaye Teng; Tingle Li; Yichen Liu; Tianyuan Yuan; Yue Wang; Yang Yuan; Hang Zhao

On Uni-Modal Feature Learning in Supervised Multi-Modal Learning

Chenzhuang Du, Jiaye Teng, Tingle Li, Yichen Liu, Tianyuan Yuan, Yue Wang, Yang Yuan, Hang Zhao

Proceedings of the 40th International Conference on Machine Learning, PMLR 202:8632-8656, 2023.

Abstract

We abstract the features (i.e. learned representations) of multi-modal data into 1) uni-modal features, which can be learned from uni-modal training, and 2) paired features, which can only be learned from cross-modal interactions. Multi-modal models are expected to benefit from cross-modal interactions on the basis of ensuring uni-modal feature learning. However, recent supervised multi-modal late-fusion training approaches still suffer from insufficient learning of uni-modal features on each modality. We prove that this phenomenon does hurt the model’s generalization ability. To this end, we propose to choose a targeted late-fusion learning method for the given supervised multi-modal task from Uni-Modal Ensemble (UME) and the proposed Uni-Modal Teacher (UMT), according to the distribution of uni-modal and paired features. We demonstrate that, under a simple guiding strategy, we can achieve comparable results to other complex late-fusion or intermediate-fusion methods on various multi-modal datasets, including VGG-Sound, Kinetics-400, UCF101, and ModelNet40.

Cite this Paper

BibTeX


@InProceedings{pmlr-v202-du23e,
  title = 	 {On Uni-Modal Feature Learning in Supervised Multi-Modal Learning},
  author =       {Du, Chenzhuang and Teng, Jiaye and Li, Tingle and Liu, Yichen and Yuan, Tianyuan and Wang, Yue and Yuan, Yang and Zhao, Hang},
  booktitle = 	 {Proceedings of the 40th International Conference on Machine Learning},
  pages = 	 {8632--8656},
  year = 	 {2023},
  editor = 	 {Krause, Andreas and Brunskill, Emma and Cho, Kyunghyun and Engelhardt, Barbara and Sabato, Sivan and Scarlett, Jonathan},
  volume = 	 {202},
  series = 	 {Proceedings of Machine Learning Research},
  month = 	 {23--29 Jul},
  publisher =    {PMLR},
  pdf = 	 {https://proceedings.mlr.press/v202/du23e/du23e.pdf},
  url = 	 {https://proceedings.mlr.press/v202/du23e.html},
  abstract = 	 {We abstract the features (i.e. learned representations) of multi-modal data into 1) uni-modal features, which can be learned from uni-modal training, and 2) paired features, which can only be learned from cross-modal interactions. Multi-modal models are expected to benefit from cross-modal interactions on the basis of ensuring uni-modal feature learning. However, recent supervised multi-modal late-fusion training approaches still suffer from insufficient learning of uni-modal features on each modality. We prove that this phenomenon does hurt the model’s generalization ability. To this end, we propose to choose a targeted late-fusion learning method for the given supervised multi-modal task from Uni-Modal Ensemble (UME) and the proposed Uni-Modal Teacher (UMT), according to the distribution of uni-modal and paired features. We demonstrate that, under a simple guiding strategy, we can achieve comparable results to other complex late-fusion or intermediate-fusion methods on various multi-modal datasets, including VGG-Sound, Kinetics-400, UCF101, and ModelNet40.}
}

Endnote

%0 Conference Paper
%T On Uni-Modal Feature Learning in Supervised Multi-Modal Learning
%A Chenzhuang Du
%A Jiaye Teng
%A Tingle Li
%A Yichen Liu
%A Tianyuan Yuan
%A Yue Wang
%A Yang Yuan
%A Hang Zhao
%B Proceedings of the 40th International Conference on Machine Learning
%C Proceedings of Machine Learning Research
%D 2023
%E Andreas Krause
%E Emma Brunskill
%E Kyunghyun Cho
%E Barbara Engelhardt
%E Sivan Sabato
%E Jonathan Scarlett	
%F pmlr-v202-du23e
%I PMLR
%P 8632--8656
%U https://proceedings.mlr.press/v202/du23e.html
%V 202
%X We abstract the features (i.e. learned representations) of multi-modal data into 1) uni-modal features, which can be learned from uni-modal training, and 2) paired features, which can only be learned from cross-modal interactions. Multi-modal models are expected to benefit from cross-modal interactions on the basis of ensuring uni-modal feature learning. However, recent supervised multi-modal late-fusion training approaches still suffer from insufficient learning of uni-modal features on each modality. We prove that this phenomenon does hurt the model’s generalization ability. To this end, we propose to choose a targeted late-fusion learning method for the given supervised multi-modal task from Uni-Modal Ensemble (UME) and the proposed Uni-Modal Teacher (UMT), according to the distribution of uni-modal and paired features. We demonstrate that, under a simple guiding strategy, we can achieve comparable results to other complex late-fusion or intermediate-fusion methods on various multi-modal datasets, including VGG-Sound, Kinetics-400, UCF101, and ModelNet40.

APA


Du, C., Teng, J., Li, T., Liu, Y., Yuan, T., Wang, Y., Yuan, Y. & Zhao, H.. (2023). On Uni-Modal Feature Learning in Supervised Multi-Modal Learning. Proceedings of the 40th International Conference on Machine Learning, in Proceedings of Machine Learning Research 202:8632-8656 Available from https://proceedings.mlr.press/v202/du23e.html.

On Uni-Modal Feature Learning in Supervised Multi-Modal Learning

Abstract

Cite this Paper

Related Material