[edit]
Measuring and Closing the Retrieval Gap in Financial Question Answering
Proceedings of the The 39th Canadian Conference on Artificial Intelligence, PMLR 318:1157-1161, 2026.
Abstract
Retrieval-augmented generation (RAG) is increasingly applied to financial question answering over long regulatory documents, yet evaluations typically measure only chunklevel retrieval or end-to-end answer quality, leaving a systematic understanding of where and why pipelines fail out of reach. We introduce an oracle-based evaluation framework that decomposes retrieval performance into document, page, and chunk discovery, providing empirical upper bounds at each granularity and exposing a consistent retrieval gap that persists even when the correct document is found. We systematically evaluate several retrieval strategies on 150 FinanceBench questions, spanning dense, sparse, hybrid, hierarchical, query reformulation, and reranking methods using a shared multi-document index. Our analysis shows that while methods such as Multi-HyDE and cross-encoder reranking improve document recall, page-level retrieval substantially lags behind oracle bounds across all baselines. We further break down performance by question type and document type, revealing that retrieval difficulty varies significantly across these dimensions and that no single strategy closes the gap uniformly. As a targeted intervention, we introduce a domain fine-tuned page scorer that ranks pages before chunk retrieval, achieving strong gains under cross-validation, suggesting that domain-specific and page-level modeling is a promising direction.