[edit]
SCARE: A Benchmark for SQL Correction and Question Answerability Classification for Reliable EHR Question Answering
Proceedings of the Fifth Machine Learning for Health Symposium, PMLR 297:1446-1466, 2026.
Abstract
Recent advances in Large Language Models ({LLM}s) have enabled the development of text-to-{SQL} models that allow clinicians to query structured data stored in Electronic Health Records ({EHR}s) using natural language. However, deploying these models for {EHR} question answering ({QA}) systems in safety-critical clinical environments remains challenging: incorrect {SQL} queries—whether caused by model errors or problematic user inputs—can undermine clinical decision-making and jeopardize patient care. While prior work has mainly focused on improving {SQL} generation accuracy or filtering questions before execution, there is a lack of a unified benchmark for evaluating independent post-hoc verification mechanisms (i.e., a component that inspects and validates the generated {SQL} before execution), which is crucial for safe deployment. To fill this gap, we introduce {SCARE}, a benchmark for evaluating methods that function as a post-hoc safety layer in {EHR} {QA} systems. {SCARE} evaluates the joint task of (1) classifying question answerability (i.e., determining whether a question is answerable, ambiguous, or unanswerable) and (2) verifying or correcting candidate {SQL} queries. The benchmark comprises 4,200 triples of questions, candidate {SQL} queries, and expected model outputs, grounded in the {MIMIC-III}, {MIMIC-IV}, and {eICU} databases. It covers a diverse set of questions and corresponding candidate {SQL} queries generated by seven different text-to-{SQL} models, ensuring a realistic and challenging evaluation. Using {SCARE}, we benchmark a range of approaches—from two-stage methods to agentic frameworks. Our experiments reveal a critical trade-off between question classification and {SQL} error correction, highlighting key challenges and outlining directions for future research.