Listenable Maps for Audio Classifiers

Francesco Paissan, Mirco Ravanelli, Cem Subakan
Proceedings of the 41st International Conference on Machine Learning, PMLR 235:39009-39021, 2024.

Abstract

Despite the impressive performance of deep learning models across diverse tasks, their complexity poses challenges for interpretation. This challenge is particularly evident for audio signals, where conveying interpretations becomes inherently difficult. To address this issue, we introduce Listenable Maps for Audio Classifiers (L-MAC), a posthoc interpretation method that generates faithful and listenable interpretations. L-MAC utilizes a decoder on top of a pretrained classifier to generate binary masks that highlight relevant portions of the input audio. We train the decoder with a loss function that maximizes the confidence of the classifier decision on the masked-in portion of the audio while minimizing the probability of model output for the masked-out portion. Quantitative evaluations on both in-domain and out-of-domain data demonstrate that L-MAC consistently produces more faithful interpretations than several gradient and masking-based methodologies. Furthermore, a user study confirms that, on average, users prefer the interpretations generated by the proposed technique.

Cite this Paper


BibTeX
@InProceedings{pmlr-v235-paissan24a, title = {Listenable Maps for Audio Classifiers}, author = {Paissan, Francesco and Ravanelli, Mirco and Subakan, Cem}, booktitle = {Proceedings of the 41st International Conference on Machine Learning}, pages = {39009--39021}, year = {2024}, editor = {Salakhutdinov, Ruslan and Kolter, Zico and Heller, Katherine and Weller, Adrian and Oliver, Nuria and Scarlett, Jonathan and Berkenkamp, Felix}, volume = {235}, series = {Proceedings of Machine Learning Research}, month = {21--27 Jul}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v235/main/assets/paissan24a/paissan24a.pdf}, url = {https://proceedings.mlr.press/v235/paissan24a.html}, abstract = {Despite the impressive performance of deep learning models across diverse tasks, their complexity poses challenges for interpretation. This challenge is particularly evident for audio signals, where conveying interpretations becomes inherently difficult. To address this issue, we introduce Listenable Maps for Audio Classifiers (L-MAC), a posthoc interpretation method that generates faithful and listenable interpretations. L-MAC utilizes a decoder on top of a pretrained classifier to generate binary masks that highlight relevant portions of the input audio. We train the decoder with a loss function that maximizes the confidence of the classifier decision on the masked-in portion of the audio while minimizing the probability of model output for the masked-out portion. Quantitative evaluations on both in-domain and out-of-domain data demonstrate that L-MAC consistently produces more faithful interpretations than several gradient and masking-based methodologies. Furthermore, a user study confirms that, on average, users prefer the interpretations generated by the proposed technique.} }
Endnote
%0 Conference Paper %T Listenable Maps for Audio Classifiers %A Francesco Paissan %A Mirco Ravanelli %A Cem Subakan %B Proceedings of the 41st International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2024 %E Ruslan Salakhutdinov %E Zico Kolter %E Katherine Heller %E Adrian Weller %E Nuria Oliver %E Jonathan Scarlett %E Felix Berkenkamp %F pmlr-v235-paissan24a %I PMLR %P 39009--39021 %U https://proceedings.mlr.press/v235/paissan24a.html %V 235 %X Despite the impressive performance of deep learning models across diverse tasks, their complexity poses challenges for interpretation. This challenge is particularly evident for audio signals, where conveying interpretations becomes inherently difficult. To address this issue, we introduce Listenable Maps for Audio Classifiers (L-MAC), a posthoc interpretation method that generates faithful and listenable interpretations. L-MAC utilizes a decoder on top of a pretrained classifier to generate binary masks that highlight relevant portions of the input audio. We train the decoder with a loss function that maximizes the confidence of the classifier decision on the masked-in portion of the audio while minimizing the probability of model output for the masked-out portion. Quantitative evaluations on both in-domain and out-of-domain data demonstrate that L-MAC consistently produces more faithful interpretations than several gradient and masking-based methodologies. Furthermore, a user study confirms that, on average, users prefer the interpretations generated by the proposed technique.
APA
Paissan, F., Ravanelli, M. & Subakan, C.. (2024). Listenable Maps for Audio Classifiers. Proceedings of the 41st International Conference on Machine Learning, in Proceedings of Machine Learning Research 235:39009-39021 Available from https://proceedings.mlr.press/v235/paissan24a.html.

Related Material