FLAM: Frame-Wise Language-Audio Modeling

Yusong Wu, Christos Tsirigotis, Ke Chen, Cheng-Zhi Anna Huang, Aaron Courville, Oriol Nieto, Prem Seetharaman, Justin Salamon
Proceedings of the 42nd International Conference on Machine Learning, PMLR 267:67719-67740, 2025.

Abstract

Recent multi-modal audio-language models (ALMs) excel at text-audio retrieval but struggle with frame-wise audio understanding. Prior works use temporal-aware labels or unsupervised training to improve frame-wise capabilities, but they still lack fine-grained labeling capability to pinpoint when an event occurs. While traditional sound event detection models can precisely localize events, they are limited to pre-defined categories, making them ineffective for real-world scenarios with out-of-distribution events. In this work, we introduce FLAM, an open-vocabulary contrastive audio-language model capable of localizing specific sound events. FLAM employs a memory-efficient and calibrated frame-wise objective with logit adjustment to address spurious correlations, such as event dependencies and label imbalances during training. To enable frame-wise supervision, we leverage a large-scale dataset with diverse audio events, LLM-generated captions and simulation. Experimental results and case studies demonstrate that FLAM significantly improves the open-vocabulary localization capability while maintaining strong performance in global retrieval and downstream tasks.

Cite this Paper


BibTeX
@InProceedings{pmlr-v267-wu25ab, title = {{FLAM}: Frame-Wise Language-Audio Modeling}, author = {Wu, Yusong and Tsirigotis, Christos and Chen, Ke and Huang, Cheng-Zhi Anna and Courville, Aaron and Nieto, Oriol and Seetharaman, Prem and Salamon, Justin}, booktitle = {Proceedings of the 42nd International Conference on Machine Learning}, pages = {67719--67740}, year = {2025}, editor = {Singh, Aarti and Fazel, Maryam and Hsu, Daniel and Lacoste-Julien, Simon and Berkenkamp, Felix and Maharaj, Tegan and Wagstaff, Kiri and Zhu, Jerry}, volume = {267}, series = {Proceedings of Machine Learning Research}, month = {13--19 Jul}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v267/main/assets/wu25ab/wu25ab.pdf}, url = {https://proceedings.mlr.press/v267/wu25ab.html}, abstract = {Recent multi-modal audio-language models (ALMs) excel at text-audio retrieval but struggle with frame-wise audio understanding. Prior works use temporal-aware labels or unsupervised training to improve frame-wise capabilities, but they still lack fine-grained labeling capability to pinpoint when an event occurs. While traditional sound event detection models can precisely localize events, they are limited to pre-defined categories, making them ineffective for real-world scenarios with out-of-distribution events. In this work, we introduce FLAM, an open-vocabulary contrastive audio-language model capable of localizing specific sound events. FLAM employs a memory-efficient and calibrated frame-wise objective with logit adjustment to address spurious correlations, such as event dependencies and label imbalances during training. To enable frame-wise supervision, we leverage a large-scale dataset with diverse audio events, LLM-generated captions and simulation. Experimental results and case studies demonstrate that FLAM significantly improves the open-vocabulary localization capability while maintaining strong performance in global retrieval and downstream tasks.} }
Endnote
%0 Conference Paper %T FLAM: Frame-Wise Language-Audio Modeling %A Yusong Wu %A Christos Tsirigotis %A Ke Chen %A Cheng-Zhi Anna Huang %A Aaron Courville %A Oriol Nieto %A Prem Seetharaman %A Justin Salamon %B Proceedings of the 42nd International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2025 %E Aarti Singh %E Maryam Fazel %E Daniel Hsu %E Simon Lacoste-Julien %E Felix Berkenkamp %E Tegan Maharaj %E Kiri Wagstaff %E Jerry Zhu %F pmlr-v267-wu25ab %I PMLR %P 67719--67740 %U https://proceedings.mlr.press/v267/wu25ab.html %V 267 %X Recent multi-modal audio-language models (ALMs) excel at text-audio retrieval but struggle with frame-wise audio understanding. Prior works use temporal-aware labels or unsupervised training to improve frame-wise capabilities, but they still lack fine-grained labeling capability to pinpoint when an event occurs. While traditional sound event detection models can precisely localize events, they are limited to pre-defined categories, making them ineffective for real-world scenarios with out-of-distribution events. In this work, we introduce FLAM, an open-vocabulary contrastive audio-language model capable of localizing specific sound events. FLAM employs a memory-efficient and calibrated frame-wise objective with logit adjustment to address spurious correlations, such as event dependencies and label imbalances during training. To enable frame-wise supervision, we leverage a large-scale dataset with diverse audio events, LLM-generated captions and simulation. Experimental results and case studies demonstrate that FLAM significantly improves the open-vocabulary localization capability while maintaining strong performance in global retrieval and downstream tasks.
APA
Wu, Y., Tsirigotis, C., Chen, K., Huang, C.A., Courville, A., Nieto, O., Seetharaman, P. & Salamon, J.. (2025). FLAM: Frame-Wise Language-Audio Modeling. Proceedings of the 42nd International Conference on Machine Learning, in Proceedings of Machine Learning Research 267:67719-67740 Available from https://proceedings.mlr.press/v267/wu25ab.html.

Related Material