AudioBERTScore: Objective Evaluation of Environmental Sound Synthesis Based on Similarity of Audio Embedding Sequences

Minoru Kishi, Ryosuke Sakai, Shinnosuke Takamichi, Yusuke Kanamori, Yuki Okamoto
Proceedings of the AAAI 2026 Workshop on Audio-Centric AI: Towards Real-World Multimodal Reasoning and Application Use Cases (Audio-AAAI), PMLR 312:21-37, 2026.

Abstract

We propose a novel objective evaluation metric for synthesized audio in text-to-audio generation (TTA), aiming to improve the performance of TTA models. In TTA, subjective evaluation of synthesized sounds is important; however, conducting it requires significant monetary and time costs. Therefore, objective evaluation such as mel-cepstral distortion are used, but the correlation between these objective metrics and subjective evaluation values is weak. Our proposed objective evaluation metric, AudioBERTScore, calculates the similarity between embedding of the synthesized and reference sounds. The method is based not only on the max-norm used in conventional BERTScore but also on the p-norm to reflect the non-local nature of environmental sounds. Experimental results show that scores obtained by the proposed method have a higher correlation with subjective evaluation values than conventional metrics.

Cite this Paper


BibTeX
@InProceedings{pmlr-v312-kishi26a, title = {AudioBERTScore: Objective Evaluation of Environmental Sound Synthesis Based on Similarity of Audio Embedding Sequences}, author = {Kishi, Minoru and Sakai, Ryosuke and Takamichi, Shinnosuke and Kanamori, Yusuke and Okamoto, Yuki}, booktitle = {Proceedings of the AAAI 2026 Workshop on Audio-Centric AI: Towards Real-World Multimodal Reasoning and Application Use Cases (Audio-AAAI)}, pages = {21--37}, year = {2026}, editor = {Komatsu, Tatsuya and Imoto, Keisuke and Gao, Xiaoxue and Ono, Nobutaka and Chen, Nancy F.}, volume = {312}, series = {Proceedings of Machine Learning Research}, month = {26 Jan}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v312/main/assets/kishi26a/kishi26a.pdf}, url = {https://proceedings.mlr.press/v312/kishi26a.html}, abstract = {We propose a novel objective evaluation metric for synthesized audio in text-to-audio generation (TTA), aiming to improve the performance of TTA models. In TTA, subjective evaluation of synthesized sounds is important; however, conducting it requires significant monetary and time costs. Therefore, objective evaluation such as mel-cepstral distortion are used, but the correlation between these objective metrics and subjective evaluation values is weak. Our proposed objective evaluation metric, AudioBERTScore, calculates the similarity between embedding of the synthesized and reference sounds. The method is based not only on the max-norm used in conventional BERTScore but also on the p-norm to reflect the non-local nature of environmental sounds. Experimental results show that scores obtained by the proposed method have a higher correlation with subjective evaluation values than conventional metrics.} }
Endnote
%0 Conference Paper %T AudioBERTScore: Objective Evaluation of Environmental Sound Synthesis Based on Similarity of Audio Embedding Sequences %A Minoru Kishi %A Ryosuke Sakai %A Shinnosuke Takamichi %A Yusuke Kanamori %A Yuki Okamoto %B Proceedings of the AAAI 2026 Workshop on Audio-Centric AI: Towards Real-World Multimodal Reasoning and Application Use Cases (Audio-AAAI) %C Proceedings of Machine Learning Research %D 2026 %E Tatsuya Komatsu %E Keisuke Imoto %E Xiaoxue Gao %E Nobutaka Ono %E Nancy F. Chen %F pmlr-v312-kishi26a %I PMLR %P 21--37 %U https://proceedings.mlr.press/v312/kishi26a.html %V 312 %X We propose a novel objective evaluation metric for synthesized audio in text-to-audio generation (TTA), aiming to improve the performance of TTA models. In TTA, subjective evaluation of synthesized sounds is important; however, conducting it requires significant monetary and time costs. Therefore, objective evaluation such as mel-cepstral distortion are used, but the correlation between these objective metrics and subjective evaluation values is weak. Our proposed objective evaluation metric, AudioBERTScore, calculates the similarity between embedding of the synthesized and reference sounds. The method is based not only on the max-norm used in conventional BERTScore but also on the p-norm to reflect the non-local nature of environmental sounds. Experimental results show that scores obtained by the proposed method have a higher correlation with subjective evaluation values than conventional metrics.
APA
Kishi, M., Sakai, R., Takamichi, S., Kanamori, Y. & Okamoto, Y.. (2026). AudioBERTScore: Objective Evaluation of Environmental Sound Synthesis Based on Similarity of Audio Embedding Sequences. Proceedings of the AAAI 2026 Workshop on Audio-Centric AI: Towards Real-World Multimodal Reasoning and Application Use Cases (Audio-AAAI), in Proceedings of Machine Learning Research 312:21-37 Available from https://proceedings.mlr.press/v312/kishi26a.html.

Related Material