DataFrame QA: A Universal LLM Framework on DataFrame Question Answering Without Data Exposure

Junyi Ye, Mengnan Du, Guiling Wang
Proceedings of the 16th Asian Conference on Machine Learning, PMLR 260:575-590, 2025.

Abstract

This paper introduces DataFrame question answering (QA), a novel task that utilizes natural language processing (NLP) models to generate Pandas queries for information retrieval and data analysis on dataframes, emphasizing safe and non-revealing data handling. Specifically, our method, leveraging large language model (LLM), which solely relies on dataframe column names, not only ensures data privacy but also significantly reduces the context window in the prompt, streamlining information processing and addressing major challenges in LLM-based data analysis. We propose DataFrame QA as a comprehensive framework that includes safe Pandas query generation and code execution. Various LLMs are evaluated on the renowned WikiSQL dataset and our newly developed UCI-DataFrameQA, tailored for complex data analysis queries. Our findings indicate that GPT-4 performs well on both datasets, underscoring its capability in securely retrieving and aggregating dataframe values and conducting sophisticated data analyses. This approach, deployable in a zero-shot manner without prior training or adjustments, proves to be highly adaptable and secure for diverse applications. Our code and dataset are available at https://github.com/JunyiYe/dataframe-qa.

Cite this Paper


BibTeX
@InProceedings{pmlr-v260-ye25a, title = {{DataFrame QA}: {A} Universal LLM Framework on DataFrame Question Answering Without Data Exposure}, author = {Ye, Junyi and Du, Mengnan and Wang, Guiling}, booktitle = {Proceedings of the 16th Asian Conference on Machine Learning}, pages = {575--590}, year = {2025}, editor = {Nguyen, Vu and Lin, Hsuan-Tien}, volume = {260}, series = {Proceedings of Machine Learning Research}, month = {05--08 Dec}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v260/main/assets/ye25a/ye25a.pdf}, url = {https://proceedings.mlr.press/v260/ye25a.html}, abstract = {This paper introduces DataFrame question answering (QA), a novel task that utilizes natural language processing (NLP) models to generate Pandas queries for information retrieval and data analysis on dataframes, emphasizing safe and non-revealing data handling. Specifically, our method, leveraging large language model (LLM), which solely relies on dataframe column names, not only ensures data privacy but also significantly reduces the context window in the prompt, streamlining information processing and addressing major challenges in LLM-based data analysis. We propose DataFrame QA as a comprehensive framework that includes safe Pandas query generation and code execution. Various LLMs are evaluated on the renowned WikiSQL dataset and our newly developed UCI-DataFrameQA, tailored for complex data analysis queries. Our findings indicate that GPT-4 performs well on both datasets, underscoring its capability in securely retrieving and aggregating dataframe values and conducting sophisticated data analyses. This approach, deployable in a zero-shot manner without prior training or adjustments, proves to be highly adaptable and secure for diverse applications. Our code and dataset are available at https://github.com/JunyiYe/dataframe-qa.} }
Endnote
%0 Conference Paper %T DataFrame QA: A Universal LLM Framework on DataFrame Question Answering Without Data Exposure %A Junyi Ye %A Mengnan Du %A Guiling Wang %B Proceedings of the 16th Asian Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2025 %E Vu Nguyen %E Hsuan-Tien Lin %F pmlr-v260-ye25a %I PMLR %P 575--590 %U https://proceedings.mlr.press/v260/ye25a.html %V 260 %X This paper introduces DataFrame question answering (QA), a novel task that utilizes natural language processing (NLP) models to generate Pandas queries for information retrieval and data analysis on dataframes, emphasizing safe and non-revealing data handling. Specifically, our method, leveraging large language model (LLM), which solely relies on dataframe column names, not only ensures data privacy but also significantly reduces the context window in the prompt, streamlining information processing and addressing major challenges in LLM-based data analysis. We propose DataFrame QA as a comprehensive framework that includes safe Pandas query generation and code execution. Various LLMs are evaluated on the renowned WikiSQL dataset and our newly developed UCI-DataFrameQA, tailored for complex data analysis queries. Our findings indicate that GPT-4 performs well on both datasets, underscoring its capability in securely retrieving and aggregating dataframe values and conducting sophisticated data analyses. This approach, deployable in a zero-shot manner without prior training or adjustments, proves to be highly adaptable and secure for diverse applications. Our code and dataset are available at https://github.com/JunyiYe/dataframe-qa.
APA
Ye, J., Du, M. & Wang, G.. (2025). DataFrame QA: A Universal LLM Framework on DataFrame Question Answering Without Data Exposure. Proceedings of the 16th Asian Conference on Machine Learning, in Proceedings of Machine Learning Research 260:575-590 Available from https://proceedings.mlr.press/v260/ye25a.html.

Related Material