Investigating Timbre Representations in CLAP Across Modalities via Perturbations

Devyani Hebbar, Brian McFee
Proceedings of Machine Learning Research, PMLR 303:1-12, 2026.

Abstract

The transition from feature-based language-audio representations to more high-dimensional ones from pre-trained foundation models has enabled us to map audio content to a significantly broader vocabulary of natural language. However, some interpretability of the alignment between the embedding spaces of the two modalities and their relation to psychoacoustic features is lost as a byproduct. In this study, we investigate timbre representations in CLAP in both the text embedding space and audio embedding space. We identify directions for different timbral qualities in each embedding space and use them as perturbation vectors in uni-modal and cross-modal Text-to-Music (TTM) Retrieval and Generation downstream tasks. We find that although both audio and text embeddings move monotonically along their respective timbre directions, timbral variation is more linearly distributed and therefore more easily exploitable in the audio embedding space. Cross-modal perturbation experiments further reveal that the audio and text embedding spaces form a geometrically aligned subspace with respect to timbre. Additionally, our analysis identifies cases where CLAP’s timbre representations closely align with perceptually grounded spectral features, and cases where such alignment is limited.

Cite this Paper


BibTeX
@InProceedings{pmlr-v303-hebbar26a, title = {Investigating Timbre Representations in CLAP Across Modalities via Perturbations}, author = {Hebbar, Devyani and McFee, Brian}, booktitle = {Proceedings of Machine Learning Research}, pages = {1--12}, year = {2026}, editor = {Herremans, Dorien and Bhandari, Keshav and Roy, Abhinaba and Colton, Simon and Barthet, Mathieu}, volume = {303}, series = {Proceedings of Machine Learning Research}, month = {26 Jan}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v303/main/assets/hebbar26a/hebbar26a.pdf}, url = {https://proceedings.mlr.press/v303/hebbar26a.html}, abstract = {The transition from feature-based language-audio representations to more high-dimensional ones from pre-trained foundation models has enabled us to map audio content to a significantly broader vocabulary of natural language. However, some interpretability of the alignment between the embedding spaces of the two modalities and their relation to psychoacoustic features is lost as a byproduct. In this study, we investigate timbre representations in CLAP in both the text embedding space and audio embedding space. We identify directions for different timbral qualities in each embedding space and use them as perturbation vectors in uni-modal and cross-modal Text-to-Music (TTM) Retrieval and Generation downstream tasks. We find that although both audio and text embeddings move monotonically along their respective timbre directions, timbral variation is more linearly distributed and therefore more easily exploitable in the audio embedding space. Cross-modal perturbation experiments further reveal that the audio and text embedding spaces form a geometrically aligned subspace with respect to timbre. Additionally, our analysis identifies cases where CLAP’s timbre representations closely align with perceptually grounded spectral features, and cases where such alignment is limited.} }
Endnote
%0 Conference Paper %T Investigating Timbre Representations in CLAP Across Modalities via Perturbations %A Devyani Hebbar %A Brian McFee %B Proceedings of Machine Learning Research %C Proceedings of Machine Learning Research %D 2026 %E Dorien Herremans %E Keshav Bhandari %E Abhinaba Roy %E Simon Colton %E Mathieu Barthet %F pmlr-v303-hebbar26a %I PMLR %P 1--12 %U https://proceedings.mlr.press/v303/hebbar26a.html %V 303 %X The transition from feature-based language-audio representations to more high-dimensional ones from pre-trained foundation models has enabled us to map audio content to a significantly broader vocabulary of natural language. However, some interpretability of the alignment between the embedding spaces of the two modalities and their relation to psychoacoustic features is lost as a byproduct. In this study, we investigate timbre representations in CLAP in both the text embedding space and audio embedding space. We identify directions for different timbral qualities in each embedding space and use them as perturbation vectors in uni-modal and cross-modal Text-to-Music (TTM) Retrieval and Generation downstream tasks. We find that although both audio and text embeddings move monotonically along their respective timbre directions, timbral variation is more linearly distributed and therefore more easily exploitable in the audio embedding space. Cross-modal perturbation experiments further reveal that the audio and text embedding spaces form a geometrically aligned subspace with respect to timbre. Additionally, our analysis identifies cases where CLAP’s timbre representations closely align with perceptually grounded spectral features, and cases where such alignment is limited.
APA
Hebbar, D. & McFee, B.. (2026). Investigating Timbre Representations in CLAP Across Modalities via Perturbations. Proceedings of Machine Learning Research, in Proceedings of Machine Learning Research 303:1-12 Available from https://proceedings.mlr.press/v303/hebbar26a.html.

Related Material