[edit]
Circulant-interactive Transformer with Dimension-aware Fusion for Multimodal Sentiment Analysis
Proceedings of The 14th Asian Conference on Machine
Learning, PMLR 189:391-406, 2023.
Abstract
Multimodal sentiment analysis (MSA) is gaining
traction as a critical tool for understanding human
behavior and enabling a wide range of
applications. Since data of different modalities
might lie in completely distinct spaces, it is very
challenging to perform effective fusion and analysis
from asynchronous multimodal streams. Most of
previous works focused on aligned fusion, which is
unpractical in real-world scenarios. The recent
Multimodal Transformer (MulT) approach attends to
model the correlations between elements from
different modalities in an unaligned
manner. However, it collects temporal information by
self-attention transformer which is a sequence
model, implying that interactions across distinct
time steps are not sufficient. In this paper, we
propose the Citculant-interactive Transformer
Network with dimension-aware fusion (CITN-DAF),
which enables parallel computation of different
modalities among different time steps and alleviates
inter-modal temporal sensitivity while preserving
intra-modal semantic order. By incorporating
circulant matrices into the cross-modal attention
mechanism, CITN-DAF is aimed to examine all
conceivable interactions between vectors of
different modalities. In addition, a dimension-aware
fusion method is presented, which projects feature
representations into different subspaces for an
in-depth fusion. We evaluate CITN-DAF on three
commonly used sentiment analysis benchmarks
including CMU-MOSEI, CMU-MOSI and IEMOCAP. Extensive
experimental results reveal that CITN-DAF is
superior in cross-modal semantic interactions and
outperforms the state-of-the-art multimodal
methods.