Latent-RQ: Enhancing Speech Pre-training with Latent Representations and Random Quantization

Muhammad Huzaifah; Hardik Sailor; Jeremy Wong; Nancy F. Chen; Ai Ti Aw

Latent-RQ: Enhancing Speech Pre-training with Latent Representations and Random Quantization

Muhammad Huzaifah, Hardik Sailor, Jeremy Wong, Nancy F. Chen, Ai Ti Aw

Proceedings of the AAAI 2026 Workshop on Audio-Centric AI: Towards Real-World Multimodal Reasoning and Application Use Cases (Audio-AAAI), PMLR 312:78-93, 2026.

Abstract

Random quantization is a simple yet effective strategy for speech self-supervised pre-training, producing strong encoder representations for a range of downstream tasks. However, existing methods such as BEST-RQ rely on Mel spectrograms – low-level acoustic features – as quantizer inputs, which may hinder convergence and limit target quality. We propose Latent-RQ, an extension that replaces direct Mel inputs with richer latent representations extracted from a pre-trained encoder. To further enhance target quality, the target encoder is periodically updated during training. Latent-RQ achieves consistent improvements on the SUPERB benchmark, particularly for speech recognition and speaker identification, while reaching comparable performance to BEST-RQ with fewer optimization steps under a fixed training budget. We also analyze how target layer selection influences downstream performance and layer-wise information encoding. t-SNE visualizations of phoneme and speaker embeddings reveal clearer clustering and improved target discriminability. Overall, Latent-RQ offers a scalable and effective enhancement to random quantization-based SSL frameworks for speech representation learning.

Cite this Paper

BibTeX

@InProceedings{pmlr-v312-huzaifah26a,
  title = 	 {Latent-{RQ}: Enhancing Speech Pre-training with Latent Representations and Random Quantization},
  author =       {Huzaifah, Muhammad and Sailor, Hardik and Wong, Jeremy and Chen, Nancy F. and Aw, Ai Ti},
  booktitle = 	 {Proceedings of the AAAI 2026 Workshop on Audio-Centric AI: Towards Real-World Multimodal Reasoning and Application Use Cases (Audio-AAAI)},
  pages = 	 {78--93},
  year = 	 {2026},
  editor = 	 {Komatsu, Tatsuya and Imoto, Keisuke and Gao, Xiaoxue and Ono, Nobutaka and Chen, Nancy F.},
  volume = 	 {312},
  series = 	 {Proceedings of Machine Learning Research},
  month = 	 {26 Jan},
  publisher =    {PMLR},
  pdf = 	 {https://raw.githubusercontent.com/mlresearch/v312/main/assets/huzaifah26a/huzaifah26a.pdf},
  url = 	 {https://proceedings.mlr.press/v312/huzaifah26a.html},
  abstract = 	 {Random quantization is a simple yet effective strategy for speech self-supervised pre-training, producing strong encoder representations for a range of downstream tasks. However, existing methods such as BEST-RQ rely on Mel spectrograms – low-level acoustic features – as quantizer inputs, which may hinder convergence and limit target quality. We propose Latent-RQ, an extension that replaces direct Mel inputs with richer latent representations extracted from a pre-trained encoder. To further enhance target quality, the target encoder is periodically updated during training. Latent-RQ achieves consistent improvements on the SUPERB benchmark, particularly for speech recognition and speaker identification, while reaching comparable performance to BEST-RQ with fewer optimization steps under a fixed training budget. We also analyze how target layer selection influences downstream performance and layer-wise information encoding. t-SNE visualizations of phoneme and speaker embeddings reveal clearer clustering and improved target discriminability. Overall, Latent-RQ offers a scalable and effective enhancement to random quantization-based SSL frameworks for speech representation learning.}
}

Endnote

%0 Conference Paper
%T Latent-RQ: Enhancing Speech Pre-training with Latent Representations and Random Quantization
%A Muhammad Huzaifah
%A Hardik Sailor
%A Jeremy Wong
%A Nancy F. Chen
%A Ai Ti Aw
%B Proceedings of the AAAI 2026 Workshop on Audio-Centric AI: Towards Real-World Multimodal Reasoning and Application Use Cases (Audio-AAAI)
%C Proceedings of Machine Learning Research
%D 2026
%E Tatsuya Komatsu
%E Keisuke Imoto
%E Xiaoxue Gao
%E Nobutaka Ono
%E Nancy F. Chen	
%F pmlr-v312-huzaifah26a
%I PMLR
%P 78--93
%U https://proceedings.mlr.press/v312/huzaifah26a.html
%V 312
%X Random quantization is a simple yet effective strategy for speech self-supervised pre-training, producing strong encoder representations for a range of downstream tasks. However, existing methods such as BEST-RQ rely on Mel spectrograms – low-level acoustic features – as quantizer inputs, which may hinder convergence and limit target quality. We propose Latent-RQ, an extension that replaces direct Mel inputs with richer latent representations extracted from a pre-trained encoder. To further enhance target quality, the target encoder is periodically updated during training. Latent-RQ achieves consistent improvements on the SUPERB benchmark, particularly for speech recognition and speaker identification, while reaching comparable performance to BEST-RQ with fewer optimization steps under a fixed training budget. We also analyze how target layer selection influences downstream performance and layer-wise information encoding. t-SNE visualizations of phoneme and speaker embeddings reveal clearer clustering and improved target discriminability. Overall, Latent-RQ offers a scalable and effective enhancement to random quantization-based SSL frameworks for speech representation learning.

APA

Huzaifah, M., Sailor, H., Wong, J., Chen, N.F. & Aw, A.T.. (2026). Latent-RQ: Enhancing Speech Pre-training with Latent Representations and Random Quantization. Proceedings of the AAAI 2026 Workshop on Audio-Centric AI: Towards Real-World Multimodal Reasoning and Application Use Cases (Audio-AAAI), in Proceedings of Machine Learning Research 312:78-93 Available from https://proceedings.mlr.press/v312/huzaifah26a.html.

Related Material

Download PDF