InstructSpeech: Following Speech Editing Instructions via Large Language Models

Rongjie Huang, Ruofan Hu, Yongqi Wang, Zehan Wang, Xize Cheng, Ziyue Jiang, Zhenhui Ye, Dongchao Yang, Luping Liu, Peng Gao, Zhou Zhao
Proceedings of the 41st International Conference on Machine Learning, PMLR 235:19886-19903, 2024.

Abstract

Instruction-guided speech editing aims to follow the user’s natural language instruction to manipulate the semantic and acoustic attributes of a speech. In this work, we construct triplet paired data (instruction, input speech, output speech) to alleviate data scarcity and train a multi-task large language model named InstructSpeech. To mitigate the challenges of accurately executing user’s instructions, we 1) introduce the learned task embeddings with a fine-tuned Flan-T5-XL to guide the generation process towards the correct generative task; 2) include an extensive and diverse set of speech editing and processing tasks to enhance model capabilities; 3) investigate chain-of-thought reasoning for free-form semantic content editing; and 4) propose a hierarchical adapter that effectively updates a small portion of parameters for generalization to new tasks. To assess instruction speech editing in greater depth, we introduce a benchmark evaluation with contrastive instruction-speech pre-training (CISP) to test the speech quality and instruction-speech alignment faithfulness. Experimental results demonstrate that InstructSpeech achieves state-of-the-art results in eleven tasks, for the first time unlocking the ability to edit speech’s acoustic and semantic attributes following a user’s instruction. Audio samples are available at https://InstructSpeech.github.io

Cite this Paper


BibTeX
@InProceedings{pmlr-v235-huang24k, title = {{I}nstruct{S}peech: Following Speech Editing Instructions via Large Language Models}, author = {Huang, Rongjie and Hu, Ruofan and Wang, Yongqi and Wang, Zehan and Cheng, Xize and Jiang, Ziyue and Ye, Zhenhui and Yang, Dongchao and Liu, Luping and Gao, Peng and Zhao, Zhou}, booktitle = {Proceedings of the 41st International Conference on Machine Learning}, pages = {19886--19903}, year = {2024}, editor = {Salakhutdinov, Ruslan and Kolter, Zico and Heller, Katherine and Weller, Adrian and Oliver, Nuria and Scarlett, Jonathan and Berkenkamp, Felix}, volume = {235}, series = {Proceedings of Machine Learning Research}, month = {21--27 Jul}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v235/main/assets/huang24k/huang24k.pdf}, url = {https://proceedings.mlr.press/v235/huang24k.html}, abstract = {Instruction-guided speech editing aims to follow the user’s natural language instruction to manipulate the semantic and acoustic attributes of a speech. In this work, we construct triplet paired data (instruction, input speech, output speech) to alleviate data scarcity and train a multi-task large language model named InstructSpeech. To mitigate the challenges of accurately executing user’s instructions, we 1) introduce the learned task embeddings with a fine-tuned Flan-T5-XL to guide the generation process towards the correct generative task; 2) include an extensive and diverse set of speech editing and processing tasks to enhance model capabilities; 3) investigate chain-of-thought reasoning for free-form semantic content editing; and 4) propose a hierarchical adapter that effectively updates a small portion of parameters for generalization to new tasks. To assess instruction speech editing in greater depth, we introduce a benchmark evaluation with contrastive instruction-speech pre-training (CISP) to test the speech quality and instruction-speech alignment faithfulness. Experimental results demonstrate that InstructSpeech achieves state-of-the-art results in eleven tasks, for the first time unlocking the ability to edit speech’s acoustic and semantic attributes following a user’s instruction. Audio samples are available at https://InstructSpeech.github.io} }
Endnote
%0 Conference Paper %T InstructSpeech: Following Speech Editing Instructions via Large Language Models %A Rongjie Huang %A Ruofan Hu %A Yongqi Wang %A Zehan Wang %A Xize Cheng %A Ziyue Jiang %A Zhenhui Ye %A Dongchao Yang %A Luping Liu %A Peng Gao %A Zhou Zhao %B Proceedings of the 41st International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2024 %E Ruslan Salakhutdinov %E Zico Kolter %E Katherine Heller %E Adrian Weller %E Nuria Oliver %E Jonathan Scarlett %E Felix Berkenkamp %F pmlr-v235-huang24k %I PMLR %P 19886--19903 %U https://proceedings.mlr.press/v235/huang24k.html %V 235 %X Instruction-guided speech editing aims to follow the user’s natural language instruction to manipulate the semantic and acoustic attributes of a speech. In this work, we construct triplet paired data (instruction, input speech, output speech) to alleviate data scarcity and train a multi-task large language model named InstructSpeech. To mitigate the challenges of accurately executing user’s instructions, we 1) introduce the learned task embeddings with a fine-tuned Flan-T5-XL to guide the generation process towards the correct generative task; 2) include an extensive and diverse set of speech editing and processing tasks to enhance model capabilities; 3) investigate chain-of-thought reasoning for free-form semantic content editing; and 4) propose a hierarchical adapter that effectively updates a small portion of parameters for generalization to new tasks. To assess instruction speech editing in greater depth, we introduce a benchmark evaluation with contrastive instruction-speech pre-training (CISP) to test the speech quality and instruction-speech alignment faithfulness. Experimental results demonstrate that InstructSpeech achieves state-of-the-art results in eleven tasks, for the first time unlocking the ability to edit speech’s acoustic and semantic attributes following a user’s instruction. Audio samples are available at https://InstructSpeech.github.io
APA
Huang, R., Hu, R., Wang, Y., Wang, Z., Cheng, X., Jiang, Z., Ye, Z., Yang, D., Liu, L., Gao, P. & Zhao, Z.. (2024). InstructSpeech: Following Speech Editing Instructions via Large Language Models. Proceedings of the 41st International Conference on Machine Learning, in Proceedings of Machine Learning Research 235:19886-19903 Available from https://proceedings.mlr.press/v235/huang24k.html.

Related Material