[edit]
SpcNet: Speaker Validation Model Based on Self-Calibrated Convolution
Proceedings of 2024 International Conference on Machine Learning and Intelligent Computing, PMLR 245:16-24, 2024.
Abstract
The convolutional module-based speaker representation network has demonstrated outstanding performance in the speaker verification (SV) task and has now become one of the widely adopted network structures in this task field. There are some limitations to the convolution-based network structure, specifically with the fixed-size convolution kernel in the general convolution operation. This makes it difficult to capture long-range time-frequency and channel dependencies in speech features, limiting the network’s ability to extract representations of the speaker. To overcome this issue, we have explored several alternative approaches. Firstly, we propose an enhanced selfcalibrating convolutional kernel that adaptively constructs long-range time-frequency and channel dependencies around each time-frequency position. This allows for the integration of richer information, significantly enhancing the network’s capacity to learn representations. Secondly, we have made adjustments to the network structure to improve the extraction of speaker feature representations. We refer to this modified model as SpcNet. In this paper, our proposed SpcNet model has been experimented on the datasets VoxCeleb1 and VoxCeleb2. Comprehensive experiments show that the Equal Error Rate (EER) is significantly improved.