AudioRAG: A Challenging Benchmark for Audio Reasoning and Information Retrieval

Jingru Lin, Chen Zhang, Tianrui Wang, Haizhou Li
Proceedings of the AAAI 2026 Workshop on Audio-Centric AI: Towards Real-World Multimodal Reasoning and Application Use Cases (Audio-AAAI), PMLR 312:113-125, 2026.

Abstract

Due to recent advancements in Large Audio-Language Models (LALMs) that demonstrate remarkable performance across a range of sound-, speech- and music-related tasks, there is a growing interest in proposing benchmarks to assess these models. Existing benchmarks generally focus only on reasoning with internal knowledge, neglecting real-world scenarios that require external information grounding. To bridge this gap, we introduce AudioRag, a novel benchmark designed to evaluate audio-based reasoning augmented by information retrieval in realistic web environments. This benchmark comprises both LLM-generated and manually curated question-answer pairs. Our evaluations reveal that even the state-of-the-art LALMs struggle to answer these questions. We therefore propose an agentic pipeline that integrates audio reasoning with retrieval-augmented generation, providing a stronger baseline for future research.

Cite this Paper


BibTeX
@InProceedings{pmlr-v312-lin26a, title = {{AudioRAG}: A Challenging Benchmark for Audio Reasoning and Information Retrieval}, author = {Lin, Jingru and Zhang, Chen and Wang, Tianrui and Li, Haizhou}, booktitle = {Proceedings of the AAAI 2026 Workshop on Audio-Centric AI: Towards Real-World Multimodal Reasoning and Application Use Cases (Audio-AAAI)}, pages = {113--125}, year = {2026}, editor = {Komatsu, Tatsuya and Imoto, Keisuke and Gao, Xiaoxue and Ono, Nobutaka and Chen, Nancy F.}, volume = {312}, series = {Proceedings of Machine Learning Research}, month = {26 Jan}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v312/main/assets/lin26a/lin26a.pdf}, url = {https://proceedings.mlr.press/v312/lin26a.html}, abstract = {Due to recent advancements in Large Audio-Language Models (LALMs) that demonstrate remarkable performance across a range of sound-, speech- and music-related tasks, there is a growing interest in proposing benchmarks to assess these models. Existing benchmarks generally focus only on reasoning with internal knowledge, neglecting real-world scenarios that require external information grounding. To bridge this gap, we introduce AudioRag, a novel benchmark designed to evaluate audio-based reasoning augmented by information retrieval in realistic web environments. This benchmark comprises both LLM-generated and manually curated question-answer pairs. Our evaluations reveal that even the state-of-the-art LALMs struggle to answer these questions. We therefore propose an agentic pipeline that integrates audio reasoning with retrieval-augmented generation, providing a stronger baseline for future research.} }
Endnote
%0 Conference Paper %T AudioRAG: A Challenging Benchmark for Audio Reasoning and Information Retrieval %A Jingru Lin %A Chen Zhang %A Tianrui Wang %A Haizhou Li %B Proceedings of the AAAI 2026 Workshop on Audio-Centric AI: Towards Real-World Multimodal Reasoning and Application Use Cases (Audio-AAAI) %C Proceedings of Machine Learning Research %D 2026 %E Tatsuya Komatsu %E Keisuke Imoto %E Xiaoxue Gao %E Nobutaka Ono %E Nancy F. Chen %F pmlr-v312-lin26a %I PMLR %P 113--125 %U https://proceedings.mlr.press/v312/lin26a.html %V 312 %X Due to recent advancements in Large Audio-Language Models (LALMs) that demonstrate remarkable performance across a range of sound-, speech- and music-related tasks, there is a growing interest in proposing benchmarks to assess these models. Existing benchmarks generally focus only on reasoning with internal knowledge, neglecting real-world scenarios that require external information grounding. To bridge this gap, we introduce AudioRag, a novel benchmark designed to evaluate audio-based reasoning augmented by information retrieval in realistic web environments. This benchmark comprises both LLM-generated and manually curated question-answer pairs. Our evaluations reveal that even the state-of-the-art LALMs struggle to answer these questions. We therefore propose an agentic pipeline that integrates audio reasoning with retrieval-augmented generation, providing a stronger baseline for future research.
APA
Lin, J., Zhang, C., Wang, T. & Li, H.. (2026). AudioRAG: A Challenging Benchmark for Audio Reasoning and Information Retrieval. Proceedings of the AAAI 2026 Workshop on Audio-Centric AI: Towards Real-World Multimodal Reasoning and Application Use Cases (Audio-AAAI), in Proceedings of Machine Learning Research 312:113-125 Available from https://proceedings.mlr.press/v312/lin26a.html.

Related Material