SCARE: A Benchmark for SQL Correction and Question Answerability Classification for Reliable EHR Question Answering

Gyubok Lee, Woosog Chay, Edward Choi
Proceedings of the Fifth Machine Learning for Health Symposium, PMLR 297:1446-1466, 2026.

Abstract

Recent advances in Large Language Models ({LLM}s) have enabled the development of text-to-{SQL} models that allow clinicians to query structured data stored in Electronic Health Records ({EHR}s) using natural language. However, deploying these models for {EHR} question answering ({QA}) systems in safety-critical clinical environments remains challenging: incorrect {SQL} queries—whether caused by model errors or problematic user inputs—can undermine clinical decision-making and jeopardize patient care. While prior work has mainly focused on improving {SQL} generation accuracy or filtering questions before execution, there is a lack of a unified benchmark for evaluating independent post-hoc verification mechanisms (i.e., a component that inspects and validates the generated {SQL} before execution), which is crucial for safe deployment. To fill this gap, we introduce {SCARE}, a benchmark for evaluating methods that function as a post-hoc safety layer in {EHR} {QA} systems. {SCARE} evaluates the joint task of (1) classifying question answerability (i.e., determining whether a question is answerable, ambiguous, or unanswerable) and (2) verifying or correcting candidate {SQL} queries. The benchmark comprises 4,200 triples of questions, candidate {SQL} queries, and expected model outputs, grounded in the {MIMIC-III}, {MIMIC-IV}, and {eICU} databases. It covers a diverse set of questions and corresponding candidate {SQL} queries generated by seven different text-to-{SQL} models, ensuring a realistic and challenging evaluation. Using {SCARE}, we benchmark a range of approaches—from two-stage methods to agentic frameworks. Our experiments reveal a critical trade-off between question classification and {SQL} error correction, highlighting key challenges and outlining directions for future research.

Cite this Paper


BibTeX
@InProceedings{pmlr-v297-lee26b, title = {{SCARE}: A Benchmark for {SQL} Correction and Question Answerability Classification for Reliable {EHR} Question Answering}, author = {Lee, Gyubok and Chay, Woosog and Choi, Edward}, booktitle = {Proceedings of the Fifth Machine Learning for Health Symposium}, pages = {1446--1466}, year = {2026}, editor = {Argaw, Peniel and Zhang, Haoran and Jabbour, Sarah and Chandak, Payal and Ji, Jerry and Mukherjee, Sumit and Salaudeen, Olawale and Chang, Trenton and Healey, Elizabeth and Gröger, Fabian and Adibi, Amin and Hegselmann, Stefan and Wild, Benjamin and Noori, Ayush}, volume = {297}, series = {Proceedings of Machine Learning Research}, month = {13--14 Dec}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v297/main/assets/lee26b/lee26b.pdf}, url = {https://proceedings.mlr.press/v297/lee26b.html}, abstract = {Recent advances in Large Language Models ({LLM}s) have enabled the development of text-to-{SQL} models that allow clinicians to query structured data stored in Electronic Health Records ({EHR}s) using natural language. However, deploying these models for {EHR} question answering ({QA}) systems in safety-critical clinical environments remains challenging: incorrect {SQL} queries—whether caused by model errors or problematic user inputs—can undermine clinical decision-making and jeopardize patient care. While prior work has mainly focused on improving {SQL} generation accuracy or filtering questions before execution, there is a lack of a unified benchmark for evaluating independent post-hoc verification mechanisms (i.e., a component that inspects and validates the generated {SQL} before execution), which is crucial for safe deployment. To fill this gap, we introduce {SCARE}, a benchmark for evaluating methods that function as a post-hoc safety layer in {EHR} {QA} systems. {SCARE} evaluates the joint task of (1) classifying question answerability (i.e., determining whether a question is answerable, ambiguous, or unanswerable) and (2) verifying or correcting candidate {SQL} queries. The benchmark comprises 4,200 triples of questions, candidate {SQL} queries, and expected model outputs, grounded in the {MIMIC-III}, {MIMIC-IV}, and {eICU} databases. It covers a diverse set of questions and corresponding candidate {SQL} queries generated by seven different text-to-{SQL} models, ensuring a realistic and challenging evaluation. Using {SCARE}, we benchmark a range of approaches—from two-stage methods to agentic frameworks. Our experiments reveal a critical trade-off between question classification and {SQL} error correction, highlighting key challenges and outlining directions for future research.} }
Endnote
%0 Conference Paper %T SCARE: A Benchmark for SQL Correction and Question Answerability Classification for Reliable EHR Question Answering %A Gyubok Lee %A Woosog Chay %A Edward Choi %B Proceedings of the Fifth Machine Learning for Health Symposium %C Proceedings of Machine Learning Research %D 2026 %E Peniel Argaw %E Haoran Zhang %E Sarah Jabbour %E Payal Chandak %E Jerry Ji %E Sumit Mukherjee %E Olawale Salaudeen %E Trenton Chang %E Elizabeth Healey %E Fabian Gröger %E Amin Adibi %E Stefan Hegselmann %E Benjamin Wild %E Ayush Noori %F pmlr-v297-lee26b %I PMLR %P 1446--1466 %U https://proceedings.mlr.press/v297/lee26b.html %V 297 %X Recent advances in Large Language Models ({LLM}s) have enabled the development of text-to-{SQL} models that allow clinicians to query structured data stored in Electronic Health Records ({EHR}s) using natural language. However, deploying these models for {EHR} question answering ({QA}) systems in safety-critical clinical environments remains challenging: incorrect {SQL} queries—whether caused by model errors or problematic user inputs—can undermine clinical decision-making and jeopardize patient care. While prior work has mainly focused on improving {SQL} generation accuracy or filtering questions before execution, there is a lack of a unified benchmark for evaluating independent post-hoc verification mechanisms (i.e., a component that inspects and validates the generated {SQL} before execution), which is crucial for safe deployment. To fill this gap, we introduce {SCARE}, a benchmark for evaluating methods that function as a post-hoc safety layer in {EHR} {QA} systems. {SCARE} evaluates the joint task of (1) classifying question answerability (i.e., determining whether a question is answerable, ambiguous, or unanswerable) and (2) verifying or correcting candidate {SQL} queries. The benchmark comprises 4,200 triples of questions, candidate {SQL} queries, and expected model outputs, grounded in the {MIMIC-III}, {MIMIC-IV}, and {eICU} databases. It covers a diverse set of questions and corresponding candidate {SQL} queries generated by seven different text-to-{SQL} models, ensuring a realistic and challenging evaluation. Using {SCARE}, we benchmark a range of approaches—from two-stage methods to agentic frameworks. Our experiments reveal a critical trade-off between question classification and {SQL} error correction, highlighting key challenges and outlining directions for future research.
APA
Lee, G., Chay, W. & Choi, E.. (2026). SCARE: A Benchmark for SQL Correction and Question Answerability Classification for Reliable EHR Question Answering. Proceedings of the Fifth Machine Learning for Health Symposium, in Proceedings of Machine Learning Research 297:1446-1466 Available from https://proceedings.mlr.press/v297/lee26b.html.

Related Material