AMG-AVSR: Adaptive Modality Guidance for Audio-Visual Speech Recognition via Progressive Feature Enhancement

Zhishuo Zhao, Dongyue Guo, Wenjie Ou, Hong Liu, Yi Lin
Proceedings of the 16th Asian Conference on Machine Learning, PMLR 260:952-967, 2025.

Abstract

Audio-Visual Speech Recognition (AVSR) is a task that identifies spoken words by analyzing both lip movements and auditory signals. Compared to Automatic Speech Recognition (ASR), AVSR demonstrates greater robustness in noisy environments due to the support of dual modalities. However, the inherent differences between these modalities present a challenge: effectively accounting for their disparities and leveraging their complementary information to extract useful information for AVSR. To address this, we propose the AMG-AVSR model, which utilizes a two-stage curriculum learning strategy and incorporates a feature compression and recovery mechanism. By leveraging the characteristics of different modalities in various scenarios to guide each other, the model extracts refined features from audio-visual data, thereby enhancing recognition performance in both clean and noisy environments. Compared to the baseline model AV-HuBERT, AMG-AVSR demonstrates superior performance on the LRS2 dataset in both noisy and clean environments. AMG-AVSR achieves a word error rate (WER) of 2.9% under clean speech conditions. In various noisy conditions, AMG-AVSR shows a significant reduction in WER compared to previous methods.

Cite this Paper


BibTeX
@InProceedings{pmlr-v260-zhao25a, title = {{AMG-AVSR}: {A}daptive Modality Guidance for Audio-Visual Speech Recognition via Progressive Feature Enhancement}, author = {Zhao, Zhishuo and Guo, Dongyue and Ou, Wenjie and Liu, Hong and Lin, Yi}, booktitle = {Proceedings of the 16th Asian Conference on Machine Learning}, pages = {952--967}, year = {2025}, editor = {Nguyen, Vu and Lin, Hsuan-Tien}, volume = {260}, series = {Proceedings of Machine Learning Research}, month = {05--08 Dec}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v260/main/assets/zhao25a/zhao25a.pdf}, url = {https://proceedings.mlr.press/v260/zhao25a.html}, abstract = {Audio-Visual Speech Recognition (AVSR) is a task that identifies spoken words by analyzing both lip movements and auditory signals. Compared to Automatic Speech Recognition (ASR), AVSR demonstrates greater robustness in noisy environments due to the support of dual modalities. However, the inherent differences between these modalities present a challenge: effectively accounting for their disparities and leveraging their complementary information to extract useful information for AVSR. To address this, we propose the AMG-AVSR model, which utilizes a two-stage curriculum learning strategy and incorporates a feature compression and recovery mechanism. By leveraging the characteristics of different modalities in various scenarios to guide each other, the model extracts refined features from audio-visual data, thereby enhancing recognition performance in both clean and noisy environments. Compared to the baseline model AV-HuBERT, AMG-AVSR demonstrates superior performance on the LRS2 dataset in both noisy and clean environments. AMG-AVSR achieves a word error rate (WER) of 2.9% under clean speech conditions. In various noisy conditions, AMG-AVSR shows a significant reduction in WER compared to previous methods.} }
Endnote
%0 Conference Paper %T AMG-AVSR: Adaptive Modality Guidance for Audio-Visual Speech Recognition via Progressive Feature Enhancement %A Zhishuo Zhao %A Dongyue Guo %A Wenjie Ou %A Hong Liu %A Yi Lin %B Proceedings of the 16th Asian Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2025 %E Vu Nguyen %E Hsuan-Tien Lin %F pmlr-v260-zhao25a %I PMLR %P 952--967 %U https://proceedings.mlr.press/v260/zhao25a.html %V 260 %X Audio-Visual Speech Recognition (AVSR) is a task that identifies spoken words by analyzing both lip movements and auditory signals. Compared to Automatic Speech Recognition (ASR), AVSR demonstrates greater robustness in noisy environments due to the support of dual modalities. However, the inherent differences between these modalities present a challenge: effectively accounting for their disparities and leveraging their complementary information to extract useful information for AVSR. To address this, we propose the AMG-AVSR model, which utilizes a two-stage curriculum learning strategy and incorporates a feature compression and recovery mechanism. By leveraging the characteristics of different modalities in various scenarios to guide each other, the model extracts refined features from audio-visual data, thereby enhancing recognition performance in both clean and noisy environments. Compared to the baseline model AV-HuBERT, AMG-AVSR demonstrates superior performance on the LRS2 dataset in both noisy and clean environments. AMG-AVSR achieves a word error rate (WER) of 2.9% under clean speech conditions. In various noisy conditions, AMG-AVSR shows a significant reduction in WER compared to previous methods.
APA
Zhao, Z., Guo, D., Ou, W., Liu, H. & Lin, Y.. (2025). AMG-AVSR: Adaptive Modality Guidance for Audio-Visual Speech Recognition via Progressive Feature Enhancement. Proceedings of the 16th Asian Conference on Machine Learning, in Proceedings of Machine Learning Research 260:952-967 Available from https://proceedings.mlr.press/v260/zhao25a.html.

Related Material