High-Fidelity Simultaneous Speech-To-Speech Translation

Tom Labiausse, Laurent Mazaré, Edouard Grave, Alexandre Défossez, Neil Zeghidour
Proceedings of the 42nd International Conference on Machine Learning, PMLR 267:32116-32129, 2025.

Abstract

We introduce Hibiki, a decoder-only model for simultaneous speech translation. Hibiki leverages a multistream language model to synchronously process source and target speech, and jointly produces text and audio tokens to perform speech-to-text and speech-to-speech translation. We furthermore address the fundamental challenge of simultaneous interpretation, which unlike its consecutive counterpart –where one waits for the end of the source utterance to start translating– adapts its flow to accumulate just enough context to produce a correct translation in real-time, chunk by chunk. To do so, we introduce a weakly-supervised method that leverages the perplexity of an off-the-shelf text translation system to identify optimal delays on a per-word basis and create aligned synthetic data. After supervised training, Hibiki performs adaptive, simultaneous speech translation with vanilla temperature sampling. On a French-English simultaneous speech translation task, Hibiki demonstrates state-of-the-art performance in translation quality, speaker fidelity and naturalness. Moreover, the simplicity of its inference process makes it compatible with batched translation and even real-time on-device deployment. We provide examples on huggingface.co/spaces/kyutai/hibiki-samples as well as models and inference code at github.com/kyutai-labs/hibiki.

Cite this Paper


BibTeX
@InProceedings{pmlr-v267-labiausse25a, title = {High-Fidelity Simultaneous Speech-To-Speech Translation}, author = {Labiausse, Tom and Mazar\'{e}, Laurent and Grave, Edouard and D\'{e}fossez, Alexandre and Zeghidour, Neil}, booktitle = {Proceedings of the 42nd International Conference on Machine Learning}, pages = {32116--32129}, year = {2025}, editor = {Singh, Aarti and Fazel, Maryam and Hsu, Daniel and Lacoste-Julien, Simon and Berkenkamp, Felix and Maharaj, Tegan and Wagstaff, Kiri and Zhu, Jerry}, volume = {267}, series = {Proceedings of Machine Learning Research}, month = {13--19 Jul}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v267/main/assets/labiausse25a/labiausse25a.pdf}, url = {https://proceedings.mlr.press/v267/labiausse25a.html}, abstract = {We introduce Hibiki, a decoder-only model for simultaneous speech translation. Hibiki leverages a multistream language model to synchronously process source and target speech, and jointly produces text and audio tokens to perform speech-to-text and speech-to-speech translation. We furthermore address the fundamental challenge of simultaneous interpretation, which unlike its consecutive counterpart –where one waits for the end of the source utterance to start translating– adapts its flow to accumulate just enough context to produce a correct translation in real-time, chunk by chunk. To do so, we introduce a weakly-supervised method that leverages the perplexity of an off-the-shelf text translation system to identify optimal delays on a per-word basis and create aligned synthetic data. After supervised training, Hibiki performs adaptive, simultaneous speech translation with vanilla temperature sampling. On a French-English simultaneous speech translation task, Hibiki demonstrates state-of-the-art performance in translation quality, speaker fidelity and naturalness. Moreover, the simplicity of its inference process makes it compatible with batched translation and even real-time on-device deployment. We provide examples on huggingface.co/spaces/kyutai/hibiki-samples as well as models and inference code at github.com/kyutai-labs/hibiki.} }
Endnote
%0 Conference Paper %T High-Fidelity Simultaneous Speech-To-Speech Translation %A Tom Labiausse %A Laurent Mazaré %A Edouard Grave %A Alexandre Défossez %A Neil Zeghidour %B Proceedings of the 42nd International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2025 %E Aarti Singh %E Maryam Fazel %E Daniel Hsu %E Simon Lacoste-Julien %E Felix Berkenkamp %E Tegan Maharaj %E Kiri Wagstaff %E Jerry Zhu %F pmlr-v267-labiausse25a %I PMLR %P 32116--32129 %U https://proceedings.mlr.press/v267/labiausse25a.html %V 267 %X We introduce Hibiki, a decoder-only model for simultaneous speech translation. Hibiki leverages a multistream language model to synchronously process source and target speech, and jointly produces text and audio tokens to perform speech-to-text and speech-to-speech translation. We furthermore address the fundamental challenge of simultaneous interpretation, which unlike its consecutive counterpart –where one waits for the end of the source utterance to start translating– adapts its flow to accumulate just enough context to produce a correct translation in real-time, chunk by chunk. To do so, we introduce a weakly-supervised method that leverages the perplexity of an off-the-shelf text translation system to identify optimal delays on a per-word basis and create aligned synthetic data. After supervised training, Hibiki performs adaptive, simultaneous speech translation with vanilla temperature sampling. On a French-English simultaneous speech translation task, Hibiki demonstrates state-of-the-art performance in translation quality, speaker fidelity and naturalness. Moreover, the simplicity of its inference process makes it compatible with batched translation and even real-time on-device deployment. We provide examples on huggingface.co/spaces/kyutai/hibiki-samples as well as models and inference code at github.com/kyutai-labs/hibiki.
APA
Labiausse, T., Mazaré, L., Grave, E., Défossez, A. & Zeghidour, N.. (2025). High-Fidelity Simultaneous Speech-To-Speech Translation. Proceedings of the 42nd International Conference on Machine Learning, in Proceedings of Machine Learning Research 267:32116-32129 Available from https://proceedings.mlr.press/v267/labiausse25a.html.

Related Material