[edit]
Latent-RQ: Enhancing Speech Pre-training with Latent Representations and Random Quantization
Proceedings of the AAAI 2026 Workshop on Audio-Centric AI: Towards Real-World Multimodal Reasoning and Application Use Cases (Audio-AAAI), PMLR 312:78-93, 2026.
Abstract
Random quantization is a simple yet effective strategy for speech self-supervised pre-training, producing strong encoder representations for a range of downstream tasks. However, existing methods such as BEST-RQ rely on Mel spectrograms – low-level acoustic features – as quantizer inputs, which may hinder convergence and limit target quality. We propose Latent-RQ, an extension that replaces direct Mel inputs with richer latent representations extracted from a pre-trained encoder. To further enhance target quality, the target encoder is periodically updated during training. Latent-RQ achieves consistent improvements on the SUPERB benchmark, particularly for speech recognition and speaker identification, while reaching comparable performance to BEST-RQ with fewer optimization steps under a fixed training budget. We also analyze how target layer selection influences downstream performance and layer-wise information encoding. t-SNE visualizations of phoneme and speaker embeddings reveal clearer clustering and improved target discriminability. Overall, Latent-RQ offers a scalable and effective enhancement to random quantization-based SSL frameworks for speech representation learning.