IIANet: An Intra- and Inter-Modality Attention Network for Audio-Visual Speech Separation

Kai Li; Runxuan Yang; Fuchun Sun; Xiaolin Hu

IIANet: An Intra- and Inter-Modality Attention Network for Audio-Visual Speech Separation

Kai Li, Runxuan Yang, Fuchun Sun, Xiaolin Hu

Proceedings of the 41st International Conference on Machine Learning, PMLR 235:29181-29200, 2024.

Abstract

Recent research has made significant progress in designing fusion modules for audio-visual speech separation. However, they predominantly focus on multi-modal fusion at a single temporal scale of auditory and visual features without employing selective attention mechanisms, which is in sharp contrast with the brain. To address this, We propose a novel model called intra- and inter-attention network (IIANet), which leverages the attention mechanism for efficient audio-visual feature fusion. IIANet consists of two types of attention blocks: intra-attention (IntraA) and inter-attention (InterA) blocks, where the InterA blocks are distributed at the top, middle and bottom of IIANet. Heavily inspired by the way how human brain selectively focuses on relevant content at various temporal scales, these blocks maintain the ability to learn modality-specific features and enable the extraction of different semantics from audio-visual features. Comprehensive experiments on three standard audio-visual separation benchmarks (LRS2, LRS3, and VoxCeleb2) demonstrate the effectiveness of IIANet, outperforming previous state-of-the-art methods while maintaining comparable inference time. In particular, the fast version of IIANet (IIANet-fast) has only 7% of CTCNet’s MACs and is 40% faster than CTCNet on CPUs while achieving better separation quality, showing the great potential of attention mechanism for efficient and effective multimodal fusion.

Cite this Paper

BibTeX

@InProceedings{pmlr-v235-li24cf,
  title = 	 {{IIAN}et: An Intra- and Inter-Modality Attention Network for Audio-Visual Speech Separation},
  author =       {Li, Kai and Yang, Runxuan and Sun, Fuchun and Hu, Xiaolin},
  booktitle = 	 {Proceedings of the 41st International Conference on Machine Learning},
  pages = 	 {29181--29200},
  year = 	 {2024},
  editor = 	 {Salakhutdinov, Ruslan and Kolter, Zico and Heller, Katherine and Weller, Adrian and Oliver, Nuria and Scarlett, Jonathan and Berkenkamp, Felix},
  volume = 	 {235},
  series = 	 {Proceedings of Machine Learning Research},
  month = 	 {21--27 Jul},
  publisher =    {PMLR},
  pdf = 	 {https://raw.githubusercontent.com/mlresearch/v235/main/assets/li24cf/li24cf.pdf},
  url = 	 {https://proceedings.mlr.press/v235/li24cf.html},
  abstract = 	 {Recent research has made significant progress in designing fusion modules for audio-visual speech separation. However, they predominantly focus on multi-modal fusion at a single temporal scale of auditory and visual features without employing selective attention mechanisms, which is in sharp contrast with the brain. To address this, We propose a novel model called intra- and inter-attention network (IIANet), which leverages the attention mechanism for efficient audio-visual feature fusion. IIANet consists of two types of attention blocks: intra-attention (IntraA) and inter-attention (InterA) blocks, where the InterA blocks are distributed at the top, middle and bottom of IIANet. Heavily inspired by the way how human brain selectively focuses on relevant content at various temporal scales, these blocks maintain the ability to learn modality-specific features and enable the extraction of different semantics from audio-visual features. Comprehensive experiments on three standard audio-visual separation benchmarks (LRS2, LRS3, and VoxCeleb2) demonstrate the effectiveness of IIANet, outperforming previous state-of-the-art methods while maintaining comparable inference time. In particular, the fast version of IIANet (IIANet-fast) has only 7% of CTCNet’s MACs and is 40% faster than CTCNet on CPUs while achieving better separation quality, showing the great potential of attention mechanism for efficient and effective multimodal fusion.}
}

Endnote

%0 Conference Paper
%T IIANet: An Intra- and Inter-Modality Attention Network for Audio-Visual Speech Separation
%A Kai Li
%A Runxuan Yang
%A Fuchun Sun
%A Xiaolin Hu
%B Proceedings of the 41st International Conference on Machine Learning
%C Proceedings of Machine Learning Research
%D 2024
%E Ruslan Salakhutdinov
%E Zico Kolter
%E Katherine Heller
%E Adrian Weller
%E Nuria Oliver
%E Jonathan Scarlett
%E Felix Berkenkamp	
%F pmlr-v235-li24cf
%I PMLR
%P 29181--29200
%U https://proceedings.mlr.press/v235/li24cf.html
%V 235
%X Recent research has made significant progress in designing fusion modules for audio-visual speech separation. However, they predominantly focus on multi-modal fusion at a single temporal scale of auditory and visual features without employing selective attention mechanisms, which is in sharp contrast with the brain. To address this, We propose a novel model called intra- and inter-attention network (IIANet), which leverages the attention mechanism for efficient audio-visual feature fusion. IIANet consists of two types of attention blocks: intra-attention (IntraA) and inter-attention (InterA) blocks, where the InterA blocks are distributed at the top, middle and bottom of IIANet. Heavily inspired by the way how human brain selectively focuses on relevant content at various temporal scales, these blocks maintain the ability to learn modality-specific features and enable the extraction of different semantics from audio-visual features. Comprehensive experiments on three standard audio-visual separation benchmarks (LRS2, LRS3, and VoxCeleb2) demonstrate the effectiveness of IIANet, outperforming previous state-of-the-art methods while maintaining comparable inference time. In particular, the fast version of IIANet (IIANet-fast) has only 7% of CTCNet’s MACs and is 40% faster than CTCNet on CPUs while achieving better separation quality, showing the great potential of attention mechanism for efficient and effective multimodal fusion.

APA

Li, K., Yang, R., Sun, F. & Hu, X.. (2024). IIANet: An Intra- and Inter-Modality Attention Network for Audio-Visual Speech Separation. Proceedings of the 41st International Conference on Machine Learning, in Proceedings of Machine Learning Research 235:29181-29200 Available from https://proceedings.mlr.press/v235/li24cf.html.

IIANet: An Intra- and Inter-Modality Attention Network for Audio-Visual Speech Separation

Abstract

Cite this Paper

Related Material