Conditional Vocal Timbral Technique Conversion via Embedding-Guided Dual Attribute Modulation

Ting-Chao Hsu, Yi-Hsuan Yang
Proceedings of Machine Learning Research, PMLR 303:1-14, 2026.

Abstract

Vocal timbral techniques such as whisper, falsetto, and vocal fry scream uniquely shape the spectral properties of the human voice, presenting a complex challenge for converting between them while preserving the original speaker’s identity. Traditional voice conversion methods, while effective at altering speaker identity or broad timbral qualities, often struggle to transform specialized timbral techniques without compromising speaker-specific traits. Similarly, existing style-transfer models, which are designed to capture broad categories like emotional expressiveness or singing styles, lack the necessary granularity to handle technique-specific variations. To address this, we propose FABYOL, a novel framework for timbral technique conversion built upon FACodec. FABYOL leverages supervised contrastive learning to generate embeddings that encode specific timbral techniques. These embeddings are then used to modulate timbre and prosody, enabling authentic technique conversion while preserving speaker identity. Experimental evaluation, using both tailored objective metrics and a user study, demonstrates that FABYOL achieves promising performance and offers significant improvements in fidelity and flexibility compared to state-of-the-art models. To support this task, we also introduce the EMO dataset, a high-quality, paired corpus developed with a specific focus on vocal fry scream. Audio samples, source code, pre-trained checkpoints, and the EMO dataset are available at https://alberthsu0509.github.io/FABYOL/.

Cite this Paper


BibTeX
@InProceedings{pmlr-v303-hsu26a, title = {Conditional Vocal Timbral Technique Conversion via Embedding-Guided Dual Attribute Modulation}, author = {Hsu, Ting-Chao and Yang, Yi-Hsuan}, booktitle = {Proceedings of Machine Learning Research}, pages = {1--14}, year = {2026}, editor = {Herremans, Dorien and Bhandari, Keshav and Roy, Abhinaba and Colton, Simon and Barthet, Mathieu}, volume = {303}, series = {Proceedings of Machine Learning Research}, month = {26 Jan}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v303/main/assets/hsu26a/hsu26a.pdf}, url = {https://proceedings.mlr.press/v303/hsu26a.html}, abstract = {Vocal timbral techniques such as whisper, falsetto, and vocal fry scream uniquely shape the spectral properties of the human voice, presenting a complex challenge for converting between them while preserving the original speaker’s identity. Traditional voice conversion methods, while effective at altering speaker identity or broad timbral qualities, often struggle to transform specialized timbral techniques without compromising speaker-specific traits. Similarly, existing style-transfer models, which are designed to capture broad categories like emotional expressiveness or singing styles, lack the necessary granularity to handle technique-specific variations. To address this, we propose FABYOL, a novel framework for timbral technique conversion built upon FACodec. FABYOL leverages supervised contrastive learning to generate embeddings that encode specific timbral techniques. These embeddings are then used to modulate timbre and prosody, enabling authentic technique conversion while preserving speaker identity. Experimental evaluation, using both tailored objective metrics and a user study, demonstrates that FABYOL achieves promising performance and offers significant improvements in fidelity and flexibility compared to state-of-the-art models. To support this task, we also introduce the EMO dataset, a high-quality, paired corpus developed with a specific focus on vocal fry scream. Audio samples, source code, pre-trained checkpoints, and the EMO dataset are available at https://alberthsu0509.github.io/FABYOL/.} }
Endnote
%0 Conference Paper %T Conditional Vocal Timbral Technique Conversion via Embedding-Guided Dual Attribute Modulation %A Ting-Chao Hsu %A Yi-Hsuan Yang %B Proceedings of Machine Learning Research %C Proceedings of Machine Learning Research %D 2026 %E Dorien Herremans %E Keshav Bhandari %E Abhinaba Roy %E Simon Colton %E Mathieu Barthet %F pmlr-v303-hsu26a %I PMLR %P 1--14 %U https://proceedings.mlr.press/v303/hsu26a.html %V 303 %X Vocal timbral techniques such as whisper, falsetto, and vocal fry scream uniquely shape the spectral properties of the human voice, presenting a complex challenge for converting between them while preserving the original speaker’s identity. Traditional voice conversion methods, while effective at altering speaker identity or broad timbral qualities, often struggle to transform specialized timbral techniques without compromising speaker-specific traits. Similarly, existing style-transfer models, which are designed to capture broad categories like emotional expressiveness or singing styles, lack the necessary granularity to handle technique-specific variations. To address this, we propose FABYOL, a novel framework for timbral technique conversion built upon FACodec. FABYOL leverages supervised contrastive learning to generate embeddings that encode specific timbral techniques. These embeddings are then used to modulate timbre and prosody, enabling authentic technique conversion while preserving speaker identity. Experimental evaluation, using both tailored objective metrics and a user study, demonstrates that FABYOL achieves promising performance and offers significant improvements in fidelity and flexibility compared to state-of-the-art models. To support this task, we also introduce the EMO dataset, a high-quality, paired corpus developed with a specific focus on vocal fry scream. Audio samples, source code, pre-trained checkpoints, and the EMO dataset are available at https://alberthsu0509.github.io/FABYOL/.
APA
Hsu, T. & Yang, Y.. (2026). Conditional Vocal Timbral Technique Conversion via Embedding-Guided Dual Attribute Modulation. Proceedings of Machine Learning Research, in Proceedings of Machine Learning Research 303:1-14 Available from https://proceedings.mlr.press/v303/hsu26a.html.

Related Material