Circulant-interactive Transformer with Dimension-aware Fusion for Multimodal Sentiment Analysis

Peizhu Gong, Jin Liu, Xiliang Zhang, Xingye Li, Zijun Yu
Proceedings of The 14th Asian Conference on Machine Learning, PMLR 189:391-406, 2023.

Abstract

Multimodal sentiment analysis (MSA) is gaining traction as a critical tool for understanding human behavior and enabling a wide range of applications. Since data of different modalities might lie in completely distinct spaces, it is very challenging to perform effective fusion and analysis from asynchronous multimodal streams. Most of previous works focused on aligned fusion, which is unpractical in real-world scenarios. The recent Multimodal Transformer (MulT) approach attends to model the correlations between elements from different modalities in an unaligned manner. However, it collects temporal information by self-attention transformer which is a sequence model, implying that interactions across distinct time steps are not sufficient. In this paper, we propose the Citculant-interactive Transformer Network with dimension-aware fusion (CITN-DAF), which enables parallel computation of different modalities among different time steps and alleviates inter-modal temporal sensitivity while preserving intra-modal semantic order. By incorporating circulant matrices into the cross-modal attention mechanism, CITN-DAF is aimed to examine all conceivable interactions between vectors of different modalities. In addition, a dimension-aware fusion method is presented, which projects feature representations into different subspaces for an in-depth fusion. We evaluate CITN-DAF on three commonly used sentiment analysis benchmarks including CMU-MOSEI, CMU-MOSI and IEMOCAP. Extensive experimental results reveal that CITN-DAF is superior in cross-modal semantic interactions and outperforms the state-of-the-art multimodal methods.

Cite this Paper


BibTeX
@InProceedings{pmlr-v189-gong23a, title = {Circulant-interactive Transformer with Dimension-aware Fusion for Multimodal Sentiment Analysis}, author = {Gong, Peizhu and Liu, Jin and Zhang, Xiliang and Li, Xingye and Yu, Zijun}, booktitle = {Proceedings of The 14th Asian Conference on Machine Learning}, pages = {391--406}, year = {2023}, editor = {Khan, Emtiyaz and Gonen, Mehmet}, volume = {189}, series = {Proceedings of Machine Learning Research}, month = {12--14 Dec}, publisher = {PMLR}, pdf = {https://proceedings.mlr.press/v189/gong23a/gong23a.pdf}, url = {https://proceedings.mlr.press/v189/gong23a.html}, abstract = {Multimodal sentiment analysis (MSA) is gaining traction as a critical tool for understanding human behavior and enabling a wide range of applications. Since data of different modalities might lie in completely distinct spaces, it is very challenging to perform effective fusion and analysis from asynchronous multimodal streams. Most of previous works focused on aligned fusion, which is unpractical in real-world scenarios. The recent Multimodal Transformer (MulT) approach attends to model the correlations between elements from different modalities in an unaligned manner. However, it collects temporal information by self-attention transformer which is a sequence model, implying that interactions across distinct time steps are not sufficient. In this paper, we propose the Citculant-interactive Transformer Network with dimension-aware fusion (CITN-DAF), which enables parallel computation of different modalities among different time steps and alleviates inter-modal temporal sensitivity while preserving intra-modal semantic order. By incorporating circulant matrices into the cross-modal attention mechanism, CITN-DAF is aimed to examine all conceivable interactions between vectors of different modalities. In addition, a dimension-aware fusion method is presented, which projects feature representations into different subspaces for an in-depth fusion. We evaluate CITN-DAF on three commonly used sentiment analysis benchmarks including CMU-MOSEI, CMU-MOSI and IEMOCAP. Extensive experimental results reveal that CITN-DAF is superior in cross-modal semantic interactions and outperforms the state-of-the-art multimodal methods.} }
Endnote
%0 Conference Paper %T Circulant-interactive Transformer with Dimension-aware Fusion for Multimodal Sentiment Analysis %A Peizhu Gong %A Jin Liu %A Xiliang Zhang %A Xingye Li %A Zijun Yu %B Proceedings of The 14th Asian Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2023 %E Emtiyaz Khan %E Mehmet Gonen %F pmlr-v189-gong23a %I PMLR %P 391--406 %U https://proceedings.mlr.press/v189/gong23a.html %V 189 %X Multimodal sentiment analysis (MSA) is gaining traction as a critical tool for understanding human behavior and enabling a wide range of applications. Since data of different modalities might lie in completely distinct spaces, it is very challenging to perform effective fusion and analysis from asynchronous multimodal streams. Most of previous works focused on aligned fusion, which is unpractical in real-world scenarios. The recent Multimodal Transformer (MulT) approach attends to model the correlations between elements from different modalities in an unaligned manner. However, it collects temporal information by self-attention transformer which is a sequence model, implying that interactions across distinct time steps are not sufficient. In this paper, we propose the Citculant-interactive Transformer Network with dimension-aware fusion (CITN-DAF), which enables parallel computation of different modalities among different time steps and alleviates inter-modal temporal sensitivity while preserving intra-modal semantic order. By incorporating circulant matrices into the cross-modal attention mechanism, CITN-DAF is aimed to examine all conceivable interactions between vectors of different modalities. In addition, a dimension-aware fusion method is presented, which projects feature representations into different subspaces for an in-depth fusion. We evaluate CITN-DAF on three commonly used sentiment analysis benchmarks including CMU-MOSEI, CMU-MOSI and IEMOCAP. Extensive experimental results reveal that CITN-DAF is superior in cross-modal semantic interactions and outperforms the state-of-the-art multimodal methods.
APA
Gong, P., Liu, J., Zhang, X., Li, X. & Yu, Z.. (2023). Circulant-interactive Transformer with Dimension-aware Fusion for Multimodal Sentiment Analysis. Proceedings of The 14th Asian Conference on Machine Learning, in Proceedings of Machine Learning Research 189:391-406 Available from https://proceedings.mlr.press/v189/gong23a.html.

Related Material