Measuring and Closing the Retrieval Gap in Financial Question Answering

Amine Kobeissi; Philippe Langlais

Measuring and Closing the Retrieval Gap in Financial Question Answering

Amine Kobeissi, Philippe Langlais

Proceedings of the The 39th Canadian Conference on Artificial Intelligence, PMLR 318:1157-1161, 2026.

Abstract

Retrieval-augmented generation (RAG) is increasingly applied to financial question answering over long regulatory documents, yet evaluations typically measure only chunklevel retrieval or end-to-end answer quality, leaving a systematic understanding of where and why pipelines fail out of reach. We introduce an oracle-based evaluation framework that decomposes retrieval performance into document, page, and chunk discovery, providing empirical upper bounds at each granularity and exposing a consistent retrieval gap that persists even when the correct document is found. We systematically evaluate several retrieval strategies on 150 FinanceBench questions, spanning dense, sparse, hybrid, hierarchical, query reformulation, and reranking methods using a shared multi-document index. Our analysis shows that while methods such as Multi-HyDE and cross-encoder reranking improve document recall, page-level retrieval substantially lags behind oracle bounds across all baselines. We further break down performance by question type and document type, revealing that retrieval difficulty varies significantly across these dimensions and that no single strategy closes the gap uniformly. As a targeted intervention, we introduce a domain fine-tuned page scorer that ranks pages before chunk retrieval, achieving strong gains under cross-validation, suggesting that domain-specific and page-level modeling is a promising direction.

Cite this Paper

BibTeX

@InProceedings{pmlr-v318-kobeissi26a,
  title = 	 {Measuring and Closing the Retrieval Gap in Financial Question Answering},
  author =       {Kobeissi, Amine and Langlais, Philippe},
  booktitle = 	 {Proceedings of the The 39th Canadian Conference on Artificial Intelligence},
  pages = 	 {1157--1161},
  year = 	 {2026},
  editor = 	 {Bouzar-Benlabiod, Lydia and Leung, Carson},
  volume = 	 {318},
  series = 	 {Proceedings of Machine Learning Research},
  month = 	 {25--29 May},
  publisher =    {PMLR},
  pdf = 	 {https://raw.githubusercontent.com/mlresearch/v318/main/assets/kobeissi26a/kobeissi26a.pdf},
  url = 	 {https://proceedings.mlr.press/v318/kobeissi26a.html},
  abstract = 	 {Retrieval-augmented generation (RAG) is increasingly applied to financial question answering over long regulatory documents, yet evaluations typically measure only chunklevel retrieval or end-to-end answer quality, leaving a systematic understanding of where and why pipelines fail out of reach. We introduce an oracle-based evaluation framework that decomposes retrieval performance into document, page, and chunk discovery, providing empirical upper bounds at each granularity and exposing a consistent retrieval gap that persists even when the correct document is found. We systematically evaluate several retrieval strategies on 150 FinanceBench questions, spanning dense, sparse, hybrid, hierarchical, query reformulation, and reranking methods using a shared multi-document index. Our analysis shows that while methods such as Multi-HyDE and cross-encoder reranking improve document recall, page-level retrieval substantially lags behind oracle bounds across all baselines. We further break down performance by question type and document type, revealing that retrieval difficulty varies significantly across these dimensions and that no single strategy closes the gap uniformly. As a targeted intervention, we introduce a domain fine-tuned page scorer that ranks pages before chunk retrieval, achieving strong gains under cross-validation, suggesting that domain-specific and page-level modeling is a promising direction.}
}

Endnote

%0 Conference Paper
%T Measuring and Closing the Retrieval Gap in Financial Question Answering
%A Amine Kobeissi
%A Philippe Langlais
%B Proceedings of the The 39th Canadian Conference on Artificial Intelligence
%C Proceedings of Machine Learning Research
%D 2026
%E Lydia Bouzar-Benlabiod
%E Carson Leung	
%F pmlr-v318-kobeissi26a
%I PMLR
%P 1157--1161
%U https://proceedings.mlr.press/v318/kobeissi26a.html
%V 318
%X Retrieval-augmented generation (RAG) is increasingly applied to financial question answering over long regulatory documents, yet evaluations typically measure only chunklevel retrieval or end-to-end answer quality, leaving a systematic understanding of where and why pipelines fail out of reach. We introduce an oracle-based evaluation framework that decomposes retrieval performance into document, page, and chunk discovery, providing empirical upper bounds at each granularity and exposing a consistent retrieval gap that persists even when the correct document is found. We systematically evaluate several retrieval strategies on 150 FinanceBench questions, spanning dense, sparse, hybrid, hierarchical, query reformulation, and reranking methods using a shared multi-document index. Our analysis shows that while methods such as Multi-HyDE and cross-encoder reranking improve document recall, page-level retrieval substantially lags behind oracle bounds across all baselines. We further break down performance by question type and document type, revealing that retrieval difficulty varies significantly across these dimensions and that no single strategy closes the gap uniformly. As a targeted intervention, we introduce a domain fine-tuned page scorer that ranks pages before chunk retrieval, achieving strong gains under cross-validation, suggesting that domain-specific and page-level modeling is a promising direction.

APA

Kobeissi, A. & Langlais, P.. (2026). Measuring and Closing the Retrieval Gap in Financial Question Answering. Proceedings of the The 39th Canadian Conference on Artificial Intelligence, in Proceedings of Machine Learning Research 318:1157-1161 Available from https://proceedings.mlr.press/v318/kobeissi26a.html.

Related Material

Download PDF