Sounding that Object: Interactive Object-Aware Image to Audio Generation

Tingle Li, Baihe Huang, Xiaobin Zhuang, Dongya Jia, Jiawei Chen, Yuping Wang, Zhuo Chen, Gopala Anumanchipalli, Yuxuan Wang
Proceedings of the 42nd International Conference on Machine Learning, PMLR 267:34774-34794, 2025.

Abstract

Generating accurate sounds for complex audio-visual scenes is challenging, especially in the presence of multiple objects and sound sources. In this paper, we propose an interactive object-aware audio generation model that grounds sound generation in user-selected visual objects within images. Our method integrates object-centric learning into a conditional latent diffusion model, which learns to associate image regions with their corresponding sounds through multi-modal attention. At test time, our model employs image segmentation to allow users to interactively generate sounds at the object level. We theoretically validate that our attention mechanism functionally approximates test-time segmentation masks, ensuring the generated audio aligns with selected objects. Quantitative and qualitative evaluations show that our model outperforms baselines, achieving better alignment between objects and their associated sounds.

Cite this Paper


BibTeX
@InProceedings{pmlr-v267-li25ai, title = {Sounding that Object: Interactive Object-Aware Image to Audio Generation}, author = {Li, Tingle and Huang, Baihe and Zhuang, Xiaobin and Jia, Dongya and Chen, Jiawei and Wang, Yuping and Chen, Zhuo and Anumanchipalli, Gopala and Wang, Yuxuan}, booktitle = {Proceedings of the 42nd International Conference on Machine Learning}, pages = {34774--34794}, year = {2025}, editor = {Singh, Aarti and Fazel, Maryam and Hsu, Daniel and Lacoste-Julien, Simon and Berkenkamp, Felix and Maharaj, Tegan and Wagstaff, Kiri and Zhu, Jerry}, volume = {267}, series = {Proceedings of Machine Learning Research}, month = {13--19 Jul}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v267/main/assets/li25ai/li25ai.pdf}, url = {https://proceedings.mlr.press/v267/li25ai.html}, abstract = {Generating accurate sounds for complex audio-visual scenes is challenging, especially in the presence of multiple objects and sound sources. In this paper, we propose an interactive object-aware audio generation model that grounds sound generation in user-selected visual objects within images. Our method integrates object-centric learning into a conditional latent diffusion model, which learns to associate image regions with their corresponding sounds through multi-modal attention. At test time, our model employs image segmentation to allow users to interactively generate sounds at the object level. We theoretically validate that our attention mechanism functionally approximates test-time segmentation masks, ensuring the generated audio aligns with selected objects. Quantitative and qualitative evaluations show that our model outperforms baselines, achieving better alignment between objects and their associated sounds.} }
Endnote
%0 Conference Paper %T Sounding that Object: Interactive Object-Aware Image to Audio Generation %A Tingle Li %A Baihe Huang %A Xiaobin Zhuang %A Dongya Jia %A Jiawei Chen %A Yuping Wang %A Zhuo Chen %A Gopala Anumanchipalli %A Yuxuan Wang %B Proceedings of the 42nd International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2025 %E Aarti Singh %E Maryam Fazel %E Daniel Hsu %E Simon Lacoste-Julien %E Felix Berkenkamp %E Tegan Maharaj %E Kiri Wagstaff %E Jerry Zhu %F pmlr-v267-li25ai %I PMLR %P 34774--34794 %U https://proceedings.mlr.press/v267/li25ai.html %V 267 %X Generating accurate sounds for complex audio-visual scenes is challenging, especially in the presence of multiple objects and sound sources. In this paper, we propose an interactive object-aware audio generation model that grounds sound generation in user-selected visual objects within images. Our method integrates object-centric learning into a conditional latent diffusion model, which learns to associate image regions with their corresponding sounds through multi-modal attention. At test time, our model employs image segmentation to allow users to interactively generate sounds at the object level. We theoretically validate that our attention mechanism functionally approximates test-time segmentation masks, ensuring the generated audio aligns with selected objects. Quantitative and qualitative evaluations show that our model outperforms baselines, achieving better alignment between objects and their associated sounds.
APA
Li, T., Huang, B., Zhuang, X., Jia, D., Chen, J., Wang, Y., Chen, Z., Anumanchipalli, G. & Wang, Y.. (2025). Sounding that Object: Interactive Object-Aware Image to Audio Generation. Proceedings of the 42nd International Conference on Machine Learning, in Proceedings of Machine Learning Research 267:34774-34794 Available from https://proceedings.mlr.press/v267/li25ai.html.

Related Material