On Learning Frequency-Instance Correlations by Model-Agnostic Training for Synthetic Speech Detection

Zining Wang, Lijian Gao, Jialin Zhang, Qirong Mao
Proceedings of the 16th Asian Conference on Machine Learning, PMLR 260:1192-1207, 2025.

Abstract

The goal of Synthetic Speech Detection (SSD) is to detect spoofing speech synthesized by text-to-speech and voice conversion. Most existing SSD methods focus only on mining frequency-wise dependency by customizing frequency-aggregation modules in SSD models. However, the instance-wise dependency is usually under-explored, which is critical for identifying the synthetic speech in a global view. In this paper, we propose a novel model-agnostic training strategy for SSD that exploits both local (frequency-wise) and global (instance-wise) contexts, which do not rely on a customized architecture and can be flexibly integrated into previous SSD models. Specifically, we propose an inter-frequency correlation module to capture the local context by reconstructing the masked frequency information from the unmasked frequency context. Meanwhile, an inter-instance correlation module is performed to explore the global context among different instances by promoting intra-class compactness and inter-class dispersion in the latent space. These two complementary modules operate from distinct contextual perspectives, leading to improvements in SSD performance. Extensive experiments show that our method significantly improves the performance of two state-of-the-art models on the 2019 dataset and 2021 dataset of ASVspoof.

Cite this Paper


BibTeX
@InProceedings{pmlr-v260-wang25g, title = {On Learning Frequency-Instance Correlations by Model-Agnostic Training for Synthetic Speech Detection}, author = {Wang, Zining and Gao, Lijian and Zhang, Jialin and Mao, Qirong}, booktitle = {Proceedings of the 16th Asian Conference on Machine Learning}, pages = {1192--1207}, year = {2025}, editor = {Nguyen, Vu and Lin, Hsuan-Tien}, volume = {260}, series = {Proceedings of Machine Learning Research}, month = {05--08 Dec}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v260/main/assets/wang25g/wang25g.pdf}, url = {https://proceedings.mlr.press/v260/wang25g.html}, abstract = {The goal of Synthetic Speech Detection (SSD) is to detect spoofing speech synthesized by text-to-speech and voice conversion. Most existing SSD methods focus only on mining frequency-wise dependency by customizing frequency-aggregation modules in SSD models. However, the instance-wise dependency is usually under-explored, which is critical for identifying the synthetic speech in a global view. In this paper, we propose a novel model-agnostic training strategy for SSD that exploits both local (frequency-wise) and global (instance-wise) contexts, which do not rely on a customized architecture and can be flexibly integrated into previous SSD models. Specifically, we propose an inter-frequency correlation module to capture the local context by reconstructing the masked frequency information from the unmasked frequency context. Meanwhile, an inter-instance correlation module is performed to explore the global context among different instances by promoting intra-class compactness and inter-class dispersion in the latent space. These two complementary modules operate from distinct contextual perspectives, leading to improvements in SSD performance. Extensive experiments show that our method significantly improves the performance of two state-of-the-art models on the 2019 dataset and 2021 dataset of ASVspoof.} }
Endnote
%0 Conference Paper %T On Learning Frequency-Instance Correlations by Model-Agnostic Training for Synthetic Speech Detection %A Zining Wang %A Lijian Gao %A Jialin Zhang %A Qirong Mao %B Proceedings of the 16th Asian Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2025 %E Vu Nguyen %E Hsuan-Tien Lin %F pmlr-v260-wang25g %I PMLR %P 1192--1207 %U https://proceedings.mlr.press/v260/wang25g.html %V 260 %X The goal of Synthetic Speech Detection (SSD) is to detect spoofing speech synthesized by text-to-speech and voice conversion. Most existing SSD methods focus only on mining frequency-wise dependency by customizing frequency-aggregation modules in SSD models. However, the instance-wise dependency is usually under-explored, which is critical for identifying the synthetic speech in a global view. In this paper, we propose a novel model-agnostic training strategy for SSD that exploits both local (frequency-wise) and global (instance-wise) contexts, which do not rely on a customized architecture and can be flexibly integrated into previous SSD models. Specifically, we propose an inter-frequency correlation module to capture the local context by reconstructing the masked frequency information from the unmasked frequency context. Meanwhile, an inter-instance correlation module is performed to explore the global context among different instances by promoting intra-class compactness and inter-class dispersion in the latent space. These two complementary modules operate from distinct contextual perspectives, leading to improvements in SSD performance. Extensive experiments show that our method significantly improves the performance of two state-of-the-art models on the 2019 dataset and 2021 dataset of ASVspoof.
APA
Wang, Z., Gao, L., Zhang, J. & Mao, Q.. (2025). On Learning Frequency-Instance Correlations by Model-Agnostic Training for Synthetic Speech Detection. Proceedings of the 16th Asian Conference on Machine Learning, in Proceedings of Machine Learning Research 260:1192-1207 Available from https://proceedings.mlr.press/v260/wang25g.html.

Related Material