FastStitch: Speech editing by hitch-hiking a pre-trained FastSpeech2 model

Antonios Alexos, Pierre Baldi
Proceedings of the 5th Northern Lights Deep Learning Conference ({NLDL}), PMLR 233:1-6, 2024.

Abstract

We present an innovative approach to speech editing, mitigating the time-consuming process of training acoustic models from scratch. Our methodology involves fine-tuning the upper layers of a pre-trained FastSpeech2 model and fusing it with information from a reference mel-spectrogram during inference via a convolution-based, or an attention-based, blending network. Comparative evaluations against baseline methods and against state-of-the-art techniques on single-speaker (LJSpeech) as well as multi-speaker (VCTK) datasets, employing both subjective and objective measures, demonstrate the superior quality of our approach, yielding significantly more natural-sounding speech edits.

Cite this Paper


BibTeX
@InProceedings{pmlr-v233-alexos24a, title = {FastStitch: Speech editing by hitch-hiking a pre-trained FastSpeech2 model}, author = {Alexos, Antonios and Baldi, Pierre}, booktitle = {Proceedings of the 5th Northern Lights Deep Learning Conference ({NLDL})}, pages = {1--6}, year = {2024}, editor = {Lutchyn, Tetiana and Ramírez Rivera, Adín and Ricaud, Benjamin}, volume = {233}, series = {Proceedings of Machine Learning Research}, month = {09--11 Jan}, publisher = {PMLR}, pdf = {https://proceedings.mlr.press/v233/alexos24a/alexos24a.pdf}, url = {https://proceedings.mlr.press/v233/alexos24a.html}, abstract = {We present an innovative approach to speech editing, mitigating the time-consuming process of training acoustic models from scratch. Our methodology involves fine-tuning the upper layers of a pre-trained FastSpeech2 model and fusing it with information from a reference mel-spectrogram during inference via a convolution-based, or an attention-based, blending network. Comparative evaluations against baseline methods and against state-of-the-art techniques on single-speaker (LJSpeech) as well as multi-speaker (VCTK) datasets, employing both subjective and objective measures, demonstrate the superior quality of our approach, yielding significantly more natural-sounding speech edits.} }
Endnote
%0 Conference Paper %T FastStitch: Speech editing by hitch-hiking a pre-trained FastSpeech2 model %A Antonios Alexos %A Pierre Baldi %B Proceedings of the 5th Northern Lights Deep Learning Conference ({NLDL}) %C Proceedings of Machine Learning Research %D 2024 %E Tetiana Lutchyn %E Adín Ramírez Rivera %E Benjamin Ricaud %F pmlr-v233-alexos24a %I PMLR %P 1--6 %U https://proceedings.mlr.press/v233/alexos24a.html %V 233 %X We present an innovative approach to speech editing, mitigating the time-consuming process of training acoustic models from scratch. Our methodology involves fine-tuning the upper layers of a pre-trained FastSpeech2 model and fusing it with information from a reference mel-spectrogram during inference via a convolution-based, or an attention-based, blending network. Comparative evaluations against baseline methods and against state-of-the-art techniques on single-speaker (LJSpeech) as well as multi-speaker (VCTK) datasets, employing both subjective and objective measures, demonstrate the superior quality of our approach, yielding significantly more natural-sounding speech edits.
APA
Alexos, A. & Baldi, P.. (2024). FastStitch: Speech editing by hitch-hiking a pre-trained FastSpeech2 model. Proceedings of the 5th Northern Lights Deep Learning Conference ({NLDL}), in Proceedings of Machine Learning Research 233:1-6 Available from https://proceedings.mlr.press/v233/alexos24a.html.

Related Material