SpcNet: Speaker Validation Model Based on Self-Calibrated Convolution

Zhang Xia, Wu Guobo, Liu Qian
Proceedings of 2024 International Conference on Machine Learning and Intelligent Computing, PMLR 245:16-24, 2024.

Abstract

The convolutional module-based speaker representation network has demonstrated outstanding performance in the speaker verification (SV) task and has now become one of the widely adopted network structures in this task field. There are some limitations to the convolution-based network structure, specifically with the fixed-size convolution kernel in the general convolution operation. This makes it difficult to capture long-range time-frequency and channel dependencies in speech features, limiting the network’s ability to extract representations of the speaker. To overcome this issue, we have explored several alternative approaches. Firstly, we propose an enhanced selfcalibrating convolutional kernel that adaptively constructs long-range time-frequency and channel dependencies around each time-frequency position. This allows for the integration of richer information, significantly enhancing the network’s capacity to learn representations. Secondly, we have made adjustments to the network structure to improve the extraction of speaker feature representations. We refer to this modified model as SpcNet. In this paper, our proposed SpcNet model has been experimented on the datasets VoxCeleb1 and VoxCeleb2. Comprehensive experiments show that the Equal Error Rate (EER) is significantly improved.

Cite this Paper


BibTeX
@InProceedings{pmlr-v245-xia24a, title = {SpcNet: Speaker Validation Model Based on Self-Calibrated Convolution}, author = {Xia, Zhang and Guobo, Wu and Qian, Liu}, booktitle = {Proceedings of 2024 International Conference on Machine Learning and Intelligent Computing}, pages = {16--24}, year = {2024}, editor = {Nianyin, Zeng and Pachori, Ram Bilas}, volume = {245}, series = {Proceedings of Machine Learning Research}, month = {26--28 Apr}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v245/main/assets/xia24a/xia24a.pdf}, url = {https://proceedings.mlr.press/v245/xia24a.html}, abstract = {The convolutional module-based speaker representation network has demonstrated outstanding performance in the speaker verification (SV) task and has now become one of the widely adopted network structures in this task field. There are some limitations to the convolution-based network structure, specifically with the fixed-size convolution kernel in the general convolution operation. This makes it difficult to capture long-range time-frequency and channel dependencies in speech features, limiting the network’s ability to extract representations of the speaker. To overcome this issue, we have explored several alternative approaches. Firstly, we propose an enhanced selfcalibrating convolutional kernel that adaptively constructs long-range time-frequency and channel dependencies around each time-frequency position. This allows for the integration of richer information, significantly enhancing the network’s capacity to learn representations. Secondly, we have made adjustments to the network structure to improve the extraction of speaker feature representations. We refer to this modified model as SpcNet. In this paper, our proposed SpcNet model has been experimented on the datasets VoxCeleb1 and VoxCeleb2. Comprehensive experiments show that the Equal Error Rate (EER) is significantly improved.} }
Endnote
%0 Conference Paper %T SpcNet: Speaker Validation Model Based on Self-Calibrated Convolution %A Zhang Xia %A Wu Guobo %A Liu Qian %B Proceedings of 2024 International Conference on Machine Learning and Intelligent Computing %C Proceedings of Machine Learning Research %D 2024 %E Zeng Nianyin %E Ram Bilas Pachori %F pmlr-v245-xia24a %I PMLR %P 16--24 %U https://proceedings.mlr.press/v245/xia24a.html %V 245 %X The convolutional module-based speaker representation network has demonstrated outstanding performance in the speaker verification (SV) task and has now become one of the widely adopted network structures in this task field. There are some limitations to the convolution-based network structure, specifically with the fixed-size convolution kernel in the general convolution operation. This makes it difficult to capture long-range time-frequency and channel dependencies in speech features, limiting the network’s ability to extract representations of the speaker. To overcome this issue, we have explored several alternative approaches. Firstly, we propose an enhanced selfcalibrating convolutional kernel that adaptively constructs long-range time-frequency and channel dependencies around each time-frequency position. This allows for the integration of richer information, significantly enhancing the network’s capacity to learn representations. Secondly, we have made adjustments to the network structure to improve the extraction of speaker feature representations. We refer to this modified model as SpcNet. In this paper, our proposed SpcNet model has been experimented on the datasets VoxCeleb1 and VoxCeleb2. Comprehensive experiments show that the Equal Error Rate (EER) is significantly improved.
APA
Xia, Z., Guobo, W. & Qian, L.. (2024). SpcNet: Speaker Validation Model Based on Self-Calibrated Convolution. Proceedings of 2024 International Conference on Machine Learning and Intelligent Computing, in Proceedings of Machine Learning Research 245:16-24 Available from https://proceedings.mlr.press/v245/xia24a.html.

Related Material