ArrayDPS: Unsupervised Blind Speech Separation with a Diffusion Prior

Zhongweiyang Xu, Xulin Fan, Zhong-Qiu Wang, Xilin Jiang, Romit Roy Choudhury
Proceedings of the 42nd International Conference on Machine Learning, PMLR 267:69160-69188, 2025.

Abstract

Blind Speech Separation (BSS) aims to separate multiple speech sources from audio mixtures recorded by a microphone array. The problem is challenging because it is a blind inverse problem, i.e., the microphone array geometry, the room impulse response (RIR), and the speech sources, are all unknown. We propose ArrayDPS to solve the BSS problem in an unsupervised, array-agnostic, and generative manner. The core idea builds on diffusion posterior sampling (DPS), but unlike DPS where the likelihood is tractable, ArrayDPS must approximate the likelihood by formulating a separate optimization problem. The solution to the optimization approximates room acoustics and the relative transfer functions between microphones. These approximations, along with the diffusion priors, iterate through the ArrayDPS sampling process and ultimately yield separated voice sources. We only need a simple single-speaker speech diffusion model as a prior, along with the mixtures recorded at the microphones; no microphone array information is necessary. Evaluation results show that ArrayDPS outperforms all baseline unsupervised methods while being comparable to supervised methods in terms of SDR. Audio demos and codes are provided at: https://arraydps.github.io/ArrayDPSDemo/ and https://github.com/ArrayDPS/ArrayDPS.

Cite this Paper


BibTeX
@InProceedings{pmlr-v267-xu25f, title = {{A}rray{DPS}: Unsupervised Blind Speech Separation with a Diffusion Prior}, author = {Xu, Zhongweiyang and Fan, Xulin and Wang, Zhong-Qiu and Jiang, Xilin and Roy Choudhury, Romit}, booktitle = {Proceedings of the 42nd International Conference on Machine Learning}, pages = {69160--69188}, year = {2025}, editor = {Singh, Aarti and Fazel, Maryam and Hsu, Daniel and Lacoste-Julien, Simon and Berkenkamp, Felix and Maharaj, Tegan and Wagstaff, Kiri and Zhu, Jerry}, volume = {267}, series = {Proceedings of Machine Learning Research}, month = {13--19 Jul}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v267/main/assets/xu25f/xu25f.pdf}, url = {https://proceedings.mlr.press/v267/xu25f.html}, abstract = {Blind Speech Separation (BSS) aims to separate multiple speech sources from audio mixtures recorded by a microphone array. The problem is challenging because it is a blind inverse problem, i.e., the microphone array geometry, the room impulse response (RIR), and the speech sources, are all unknown. We propose ArrayDPS to solve the BSS problem in an unsupervised, array-agnostic, and generative manner. The core idea builds on diffusion posterior sampling (DPS), but unlike DPS where the likelihood is tractable, ArrayDPS must approximate the likelihood by formulating a separate optimization problem. The solution to the optimization approximates room acoustics and the relative transfer functions between microphones. These approximations, along with the diffusion priors, iterate through the ArrayDPS sampling process and ultimately yield separated voice sources. We only need a simple single-speaker speech diffusion model as a prior, along with the mixtures recorded at the microphones; no microphone array information is necessary. Evaluation results show that ArrayDPS outperforms all baseline unsupervised methods while being comparable to supervised methods in terms of SDR. Audio demos and codes are provided at: https://arraydps.github.io/ArrayDPSDemo/ and https://github.com/ArrayDPS/ArrayDPS.} }
Endnote
%0 Conference Paper %T ArrayDPS: Unsupervised Blind Speech Separation with a Diffusion Prior %A Zhongweiyang Xu %A Xulin Fan %A Zhong-Qiu Wang %A Xilin Jiang %A Romit Roy Choudhury %B Proceedings of the 42nd International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2025 %E Aarti Singh %E Maryam Fazel %E Daniel Hsu %E Simon Lacoste-Julien %E Felix Berkenkamp %E Tegan Maharaj %E Kiri Wagstaff %E Jerry Zhu %F pmlr-v267-xu25f %I PMLR %P 69160--69188 %U https://proceedings.mlr.press/v267/xu25f.html %V 267 %X Blind Speech Separation (BSS) aims to separate multiple speech sources from audio mixtures recorded by a microphone array. The problem is challenging because it is a blind inverse problem, i.e., the microphone array geometry, the room impulse response (RIR), and the speech sources, are all unknown. We propose ArrayDPS to solve the BSS problem in an unsupervised, array-agnostic, and generative manner. The core idea builds on diffusion posterior sampling (DPS), but unlike DPS where the likelihood is tractable, ArrayDPS must approximate the likelihood by formulating a separate optimization problem. The solution to the optimization approximates room acoustics and the relative transfer functions between microphones. These approximations, along with the diffusion priors, iterate through the ArrayDPS sampling process and ultimately yield separated voice sources. We only need a simple single-speaker speech diffusion model as a prior, along with the mixtures recorded at the microphones; no microphone array information is necessary. Evaluation results show that ArrayDPS outperforms all baseline unsupervised methods while being comparable to supervised methods in terms of SDR. Audio demos and codes are provided at: https://arraydps.github.io/ArrayDPSDemo/ and https://github.com/ArrayDPS/ArrayDPS.
APA
Xu, Z., Fan, X., Wang, Z., Jiang, X. & Roy Choudhury, R.. (2025). ArrayDPS: Unsupervised Blind Speech Separation with a Diffusion Prior. Proceedings of the 42nd International Conference on Machine Learning, in Proceedings of Machine Learning Research 267:69160-69188 Available from https://proceedings.mlr.press/v267/xu25f.html.

Related Material