Expressive Neural Voice Cloning

Paarth Neekhara, Shehzeen Hussain, Shlomo Dubnov, Farinaz Koushanfar, Julian McAuley
Proceedings of The 13th Asian Conference on Machine Learning, PMLR 157:252-267, 2021.

Abstract

Voice cloning is the task of learning to synthesize the voice of an unseen speaker from a few samples. While current voice cloning methods achieve promising results in Text-to-Speech (TTS) synthesis for a new voice, these approaches lack the ability to control the expressiveness of synthesized audio. In this work, we propose a controllable voice cloning method that allows fine-grained control over various style aspects of the synthesized speech for an unseen speaker. We achieve this by explicitly conditioning the speech synthesis model on a speaker encoding, pitch contour and latent style tokens during training. Through both quantitative and qualitative evaluations, we show that our framework can be used for various expressive voice cloning tasks using only a few transcribed or untranscribed speech samples for a new speaker. These cloning tasks include style transfer from a reference speech, synthesizing speech directly from text, and fine-grained style control by manipulating the style conditioning variables during inference.

Cite this Paper


BibTeX
@InProceedings{pmlr-v157-neekhara21a, title = {Expressive Neural Voice Cloning}, author = {Neekhara, Paarth and Hussain, Shehzeen and Dubnov, Shlomo and Koushanfar, Farinaz and McAuley, Julian}, booktitle = {Proceedings of The 13th Asian Conference on Machine Learning}, pages = {252--267}, year = {2021}, editor = {Balasubramanian, Vineeth N. and Tsang, Ivor}, volume = {157}, series = {Proceedings of Machine Learning Research}, month = {17--19 Nov}, publisher = {PMLR}, pdf = {https://proceedings.mlr.press/v157/neekhara21a/neekhara21a.pdf}, url = {https://proceedings.mlr.press/v157/neekhara21a.html}, abstract = {Voice cloning is the task of learning to synthesize the voice of an unseen speaker from a few samples. While current voice cloning methods achieve promising results in Text-to-Speech (TTS) synthesis for a new voice, these approaches lack the ability to control the expressiveness of synthesized audio. In this work, we propose a controllable voice cloning method that allows fine-grained control over various style aspects of the synthesized speech for an unseen speaker. We achieve this by explicitly conditioning the speech synthesis model on a speaker encoding, pitch contour and latent style tokens during training. Through both quantitative and qualitative evaluations, we show that our framework can be used for various expressive voice cloning tasks using only a few transcribed or untranscribed speech samples for a new speaker. These cloning tasks include style transfer from a reference speech, synthesizing speech directly from text, and fine-grained style control by manipulating the style conditioning variables during inference.} }
Endnote
%0 Conference Paper %T Expressive Neural Voice Cloning %A Paarth Neekhara %A Shehzeen Hussain %A Shlomo Dubnov %A Farinaz Koushanfar %A Julian McAuley %B Proceedings of The 13th Asian Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2021 %E Vineeth N. Balasubramanian %E Ivor Tsang %F pmlr-v157-neekhara21a %I PMLR %P 252--267 %U https://proceedings.mlr.press/v157/neekhara21a.html %V 157 %X Voice cloning is the task of learning to synthesize the voice of an unseen speaker from a few samples. While current voice cloning methods achieve promising results in Text-to-Speech (TTS) synthesis for a new voice, these approaches lack the ability to control the expressiveness of synthesized audio. In this work, we propose a controllable voice cloning method that allows fine-grained control over various style aspects of the synthesized speech for an unseen speaker. We achieve this by explicitly conditioning the speech synthesis model on a speaker encoding, pitch contour and latent style tokens during training. Through both quantitative and qualitative evaluations, we show that our framework can be used for various expressive voice cloning tasks using only a few transcribed or untranscribed speech samples for a new speaker. These cloning tasks include style transfer from a reference speech, synthesizing speech directly from text, and fine-grained style control by manipulating the style conditioning variables during inference.
APA
Neekhara, P., Hussain, S., Dubnov, S., Koushanfar, F. & McAuley, J.. (2021). Expressive Neural Voice Cloning. Proceedings of The 13th Asian Conference on Machine Learning, in Proceedings of Machine Learning Research 157:252-267 Available from https://proceedings.mlr.press/v157/neekhara21a.html.

Related Material