Latent-RQ: Enhancing Speech Pre-training with Latent Representations and Random Quantization

Muhammad Huzaifah, Hardik Sailor, Jeremy Wong, Nancy F. Chen, Ai Ti Aw
Proceedings of the AAAI 2026 Workshop on Audio-Centric AI: Towards Real-World Multimodal Reasoning and Application Use Cases (Audio-AAAI), PMLR 312:78-93, 2026.

Abstract

Random quantization is a simple yet effective strategy for speech self-supervised pre-training, producing strong encoder representations for a range of downstream tasks. However, existing methods such as BEST-RQ rely on Mel spectrograms – low-level acoustic features – as quantizer inputs, which may hinder convergence and limit target quality. We propose Latent-RQ, an extension that replaces direct Mel inputs with richer latent representations extracted from a pre-trained encoder. To further enhance target quality, the target encoder is periodically updated during training. Latent-RQ achieves consistent improvements on the SUPERB benchmark, particularly for speech recognition and speaker identification, while reaching comparable performance to BEST-RQ with fewer optimization steps under a fixed training budget. We also analyze how target layer selection influences downstream performance and layer-wise information encoding. t-SNE visualizations of phoneme and speaker embeddings reveal clearer clustering and improved target discriminability. Overall, Latent-RQ offers a scalable and effective enhancement to random quantization-based SSL frameworks for speech representation learning.

Cite this Paper


BibTeX
@InProceedings{pmlr-v312-huzaifah26a, title = {Latent-{RQ}: Enhancing Speech Pre-training with Latent Representations and Random Quantization}, author = {Huzaifah, Muhammad and Sailor, Hardik and Wong, Jeremy and Chen, Nancy F. and Aw, Ai Ti}, booktitle = {Proceedings of the AAAI 2026 Workshop on Audio-Centric AI: Towards Real-World Multimodal Reasoning and Application Use Cases (Audio-AAAI)}, pages = {78--93}, year = {2026}, editor = {Komatsu, Tatsuya and Imoto, Keisuke and Gao, Xiaoxue and Ono, Nobutaka and Chen, Nancy F.}, volume = {312}, series = {Proceedings of Machine Learning Research}, month = {26 Jan}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v312/main/assets/huzaifah26a/huzaifah26a.pdf}, url = {https://proceedings.mlr.press/v312/huzaifah26a.html}, abstract = {Random quantization is a simple yet effective strategy for speech self-supervised pre-training, producing strong encoder representations for a range of downstream tasks. However, existing methods such as BEST-RQ rely on Mel spectrograms – low-level acoustic features – as quantizer inputs, which may hinder convergence and limit target quality. We propose Latent-RQ, an extension that replaces direct Mel inputs with richer latent representations extracted from a pre-trained encoder. To further enhance target quality, the target encoder is periodically updated during training. Latent-RQ achieves consistent improvements on the SUPERB benchmark, particularly for speech recognition and speaker identification, while reaching comparable performance to BEST-RQ with fewer optimization steps under a fixed training budget. We also analyze how target layer selection influences downstream performance and layer-wise information encoding. t-SNE visualizations of phoneme and speaker embeddings reveal clearer clustering and improved target discriminability. Overall, Latent-RQ offers a scalable and effective enhancement to random quantization-based SSL frameworks for speech representation learning.} }
Endnote
%0 Conference Paper %T Latent-RQ: Enhancing Speech Pre-training with Latent Representations and Random Quantization %A Muhammad Huzaifah %A Hardik Sailor %A Jeremy Wong %A Nancy F. Chen %A Ai Ti Aw %B Proceedings of the AAAI 2026 Workshop on Audio-Centric AI: Towards Real-World Multimodal Reasoning and Application Use Cases (Audio-AAAI) %C Proceedings of Machine Learning Research %D 2026 %E Tatsuya Komatsu %E Keisuke Imoto %E Xiaoxue Gao %E Nobutaka Ono %E Nancy F. Chen %F pmlr-v312-huzaifah26a %I PMLR %P 78--93 %U https://proceedings.mlr.press/v312/huzaifah26a.html %V 312 %X Random quantization is a simple yet effective strategy for speech self-supervised pre-training, producing strong encoder representations for a range of downstream tasks. However, existing methods such as BEST-RQ rely on Mel spectrograms – low-level acoustic features – as quantizer inputs, which may hinder convergence and limit target quality. We propose Latent-RQ, an extension that replaces direct Mel inputs with richer latent representations extracted from a pre-trained encoder. To further enhance target quality, the target encoder is periodically updated during training. Latent-RQ achieves consistent improvements on the SUPERB benchmark, particularly for speech recognition and speaker identification, while reaching comparable performance to BEST-RQ with fewer optimization steps under a fixed training budget. We also analyze how target layer selection influences downstream performance and layer-wise information encoding. t-SNE visualizations of phoneme and speaker embeddings reveal clearer clustering and improved target discriminability. Overall, Latent-RQ offers a scalable and effective enhancement to random quantization-based SSL frameworks for speech representation learning.
APA
Huzaifah, M., Sailor, H., Wong, J., Chen, N.F. & Aw, A.T.. (2026). Latent-RQ: Enhancing Speech Pre-training with Latent Representations and Random Quantization. Proceedings of the AAAI 2026 Workshop on Audio-Centric AI: Towards Real-World Multimodal Reasoning and Application Use Cases (Audio-AAAI), in Proceedings of Machine Learning Research 312:78-93 Available from https://proceedings.mlr.press/v312/huzaifah26a.html.

Related Material