[edit]
FastStitch: Speech editing by hitch-hiking a pre-trained FastSpeech2 model
Proceedings of the 5th Northern Lights Deep Learning Conference ({NLDL}), PMLR 233:1-6, 2024.
Abstract
We present an innovative approach to speech editing, mitigating the time-consuming process of training acoustic models from scratch. Our methodology involves fine-tuning the upper layers of a pre-trained FastSpeech2 model and fusing it with information from a reference mel-spectrogram during inference via a convolution-based, or an attention-based, blending network. Comparative evaluations against baseline methods and against state-of-the-art techniques on single-speaker (LJSpeech) as well as multi-speaker (VCTK) datasets, employing both subjective and objective measures, demonstrate the superior quality of our approach, yielding significantly more natural-sounding speech edits.