Less is Enough: Adapting Pre-trained Vision Transformers for Audio-Visual Speaker Verification

Gnana Praveen Rajasekhar, Jahangir Alam
Proceedings of The 4th NeurIPS Efficient Natural Language and Speech Processing Workshop, PMLR 262:554-563, 2024.

Abstract

Speaker Verification has achieved significant improvement in performance using sophisticated deep learning architectures, specialized for speech signals as well as robust loss functions. Recently, the fusion of faces and voices received a lot of attention as they offer complementary relationship with each other, which has the potential to outperform systems with only speech signals. Inspired by the massive success of Vision Transformers (ViTs) in computer vision, ViTs have also been explored for multimodal learning. In this work, we have investigated the potential of ViTs, pre-trained on visual data, for audio-visual speaker verification. To cope with the challenges of large-scale training, we introduce the Latent Audio-Visual Vision Transformer (LAVViT) adapters, where we exploit the existing pre-trained models on visual data by training only the parameters of LAVViT adapters, without fine-tuning the original parameters of the pre-trained models. The LAVViT adapters are injected into every layer of the ViT architecture to effectively fuse the audio and visual modalities using a small set of latent tokens, thereby avoiding the quadratic computational cost of cross-attention across the modalities. The proposed approach has been evaluated on the Voxceleb1 dataset and shows promising performance using only a few trainable parameters.

Cite this Paper


BibTeX
@InProceedings{pmlr-v262-praveen-rajasekhar24a, title = {Less is Enough: Adapting Pre-trained Vision Transformers for Audio-Visual Speaker Verification}, author = {Praveen Rajasekhar, Gnana and Alam, Jahangir}, booktitle = {Proceedings of The 4th NeurIPS Efficient Natural Language and Speech Processing Workshop}, pages = {554--563}, year = {2024}, editor = {Rezagholizadeh, Mehdi and Passban, Peyman and Samiee, Soheila and Partovi Nia, Vahid and Cheng, Yu and Deng, Yue and Liu, Qun and Chen, Boxing}, volume = {262}, series = {Proceedings of Machine Learning Research}, month = {14 Dec}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v262/main/assets/praveen-rajasekhar24a/praveen-rajasekhar24a.pdf}, url = {https://proceedings.mlr.press/v262/praveen-rajasekhar24a.html}, abstract = {Speaker Verification has achieved significant improvement in performance using sophisticated deep learning architectures, specialized for speech signals as well as robust loss functions. Recently, the fusion of faces and voices received a lot of attention as they offer complementary relationship with each other, which has the potential to outperform systems with only speech signals. Inspired by the massive success of Vision Transformers (ViTs) in computer vision, ViTs have also been explored for multimodal learning. In this work, we have investigated the potential of ViTs, pre-trained on visual data, for audio-visual speaker verification. To cope with the challenges of large-scale training, we introduce the Latent Audio-Visual Vision Transformer (LAVViT) adapters, where we exploit the existing pre-trained models on visual data by training only the parameters of LAVViT adapters, without fine-tuning the original parameters of the pre-trained models. The LAVViT adapters are injected into every layer of the ViT architecture to effectively fuse the audio and visual modalities using a small set of latent tokens, thereby avoiding the quadratic computational cost of cross-attention across the modalities. The proposed approach has been evaluated on the Voxceleb1 dataset and shows promising performance using only a few trainable parameters.} }
Endnote
%0 Conference Paper %T Less is Enough: Adapting Pre-trained Vision Transformers for Audio-Visual Speaker Verification %A Gnana Praveen Rajasekhar %A Jahangir Alam %B Proceedings of The 4th NeurIPS Efficient Natural Language and Speech Processing Workshop %C Proceedings of Machine Learning Research %D 2024 %E Mehdi Rezagholizadeh %E Peyman Passban %E Soheila Samiee %E Vahid Partovi Nia %E Yu Cheng %E Yue Deng %E Qun Liu %E Boxing Chen %F pmlr-v262-praveen-rajasekhar24a %I PMLR %P 554--563 %U https://proceedings.mlr.press/v262/praveen-rajasekhar24a.html %V 262 %X Speaker Verification has achieved significant improvement in performance using sophisticated deep learning architectures, specialized for speech signals as well as robust loss functions. Recently, the fusion of faces and voices received a lot of attention as they offer complementary relationship with each other, which has the potential to outperform systems with only speech signals. Inspired by the massive success of Vision Transformers (ViTs) in computer vision, ViTs have also been explored for multimodal learning. In this work, we have investigated the potential of ViTs, pre-trained on visual data, for audio-visual speaker verification. To cope with the challenges of large-scale training, we introduce the Latent Audio-Visual Vision Transformer (LAVViT) adapters, where we exploit the existing pre-trained models on visual data by training only the parameters of LAVViT adapters, without fine-tuning the original parameters of the pre-trained models. The LAVViT adapters are injected into every layer of the ViT architecture to effectively fuse the audio and visual modalities using a small set of latent tokens, thereby avoiding the quadratic computational cost of cross-attention across the modalities. The proposed approach has been evaluated on the Voxceleb1 dataset and shows promising performance using only a few trainable parameters.
APA
Praveen Rajasekhar, G. & Alam, J.. (2024). Less is Enough: Adapting Pre-trained Vision Transformers for Audio-Visual Speaker Verification. Proceedings of The 4th NeurIPS Efficient Natural Language and Speech Processing Workshop, in Proceedings of Machine Learning Research 262:554-563 Available from https://proceedings.mlr.press/v262/praveen-rajasekhar24a.html.

Related Material