InfiAgent-DABench: Evaluating Agents on Data Analysis Tasks

Xueyu Hu, Ziyu Zhao, Shuang Wei, Ziwei Chai, Qianli Ma, Guoyin Wang, Xuwu Wang, Jing Su, Jingjing Xu, Ming Zhu, Yao Cheng, Jianbo Yuan, Jiwei Li, Kun Kuang, Yang Yang, Hongxia Yang, Fei Wu
Proceedings of the 41st International Conference on Machine Learning, PMLR 235:19544-19572, 2024.

Abstract

In this paper, we introduce InfiAgent-DABench, the first benchmark specifically designed to evaluate LLM-based agents on data analysis tasks. Agents need to solve these tasks end-to-end by interacting with an execution environment. This benchmark contains DAEval, a dataset consisting of 603 data analysis questions derived from 124 CSV files, and an agent framework which incorporates LLMs to serve as data analysis agents for both serving and evaluating. Since data analysis questions are often open-ended and hard to evaluate without human supervision, we adopt a format-prompting technique to convert each question into a closed-form format so that they can be automatically evaluated. Our extensive benchmarking of 34 LLMs uncovers the current challenges encountered in data analysis tasks. In addition, building upon our agent framework, we develop a specialized agent, DAAgent, which surpasses GPT-3.5 by 3.9% on DABench. Evaluation datasets and toolkits for InfiAgent-DABench are released at https://github.com/InfiAgent/InfiAgent.

Cite this Paper


BibTeX
@InProceedings{pmlr-v235-hu24s, title = {{I}nfi{A}gent-{DAB}ench: Evaluating Agents on Data Analysis Tasks}, author = {Hu, Xueyu and Zhao, Ziyu and Wei, Shuang and Chai, Ziwei and Ma, Qianli and Wang, Guoyin and Wang, Xuwu and Su, Jing and Xu, Jingjing and Zhu, Ming and Cheng, Yao and Yuan, Jianbo and Li, Jiwei and Kuang, Kun and Yang, Yang and Yang, Hongxia and Wu, Fei}, booktitle = {Proceedings of the 41st International Conference on Machine Learning}, pages = {19544--19572}, year = {2024}, editor = {Salakhutdinov, Ruslan and Kolter, Zico and Heller, Katherine and Weller, Adrian and Oliver, Nuria and Scarlett, Jonathan and Berkenkamp, Felix}, volume = {235}, series = {Proceedings of Machine Learning Research}, month = {21--27 Jul}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v235/main/assets/hu24s/hu24s.pdf}, url = {https://proceedings.mlr.press/v235/hu24s.html}, abstract = {In this paper, we introduce InfiAgent-DABench, the first benchmark specifically designed to evaluate LLM-based agents on data analysis tasks. Agents need to solve these tasks end-to-end by interacting with an execution environment. This benchmark contains DAEval, a dataset consisting of 603 data analysis questions derived from 124 CSV files, and an agent framework which incorporates LLMs to serve as data analysis agents for both serving and evaluating. Since data analysis questions are often open-ended and hard to evaluate without human supervision, we adopt a format-prompting technique to convert each question into a closed-form format so that they can be automatically evaluated. Our extensive benchmarking of 34 LLMs uncovers the current challenges encountered in data analysis tasks. In addition, building upon our agent framework, we develop a specialized agent, DAAgent, which surpasses GPT-3.5 by 3.9% on DABench. Evaluation datasets and toolkits for InfiAgent-DABench are released at https://github.com/InfiAgent/InfiAgent.} }
Endnote
%0 Conference Paper %T InfiAgent-DABench: Evaluating Agents on Data Analysis Tasks %A Xueyu Hu %A Ziyu Zhao %A Shuang Wei %A Ziwei Chai %A Qianli Ma %A Guoyin Wang %A Xuwu Wang %A Jing Su %A Jingjing Xu %A Ming Zhu %A Yao Cheng %A Jianbo Yuan %A Jiwei Li %A Kun Kuang %A Yang Yang %A Hongxia Yang %A Fei Wu %B Proceedings of the 41st International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2024 %E Ruslan Salakhutdinov %E Zico Kolter %E Katherine Heller %E Adrian Weller %E Nuria Oliver %E Jonathan Scarlett %E Felix Berkenkamp %F pmlr-v235-hu24s %I PMLR %P 19544--19572 %U https://proceedings.mlr.press/v235/hu24s.html %V 235 %X In this paper, we introduce InfiAgent-DABench, the first benchmark specifically designed to evaluate LLM-based agents on data analysis tasks. Agents need to solve these tasks end-to-end by interacting with an execution environment. This benchmark contains DAEval, a dataset consisting of 603 data analysis questions derived from 124 CSV files, and an agent framework which incorporates LLMs to serve as data analysis agents for both serving and evaluating. Since data analysis questions are often open-ended and hard to evaluate without human supervision, we adopt a format-prompting technique to convert each question into a closed-form format so that they can be automatically evaluated. Our extensive benchmarking of 34 LLMs uncovers the current challenges encountered in data analysis tasks. In addition, building upon our agent framework, we develop a specialized agent, DAAgent, which surpasses GPT-3.5 by 3.9% on DABench. Evaluation datasets and toolkits for InfiAgent-DABench are released at https://github.com/InfiAgent/InfiAgent.
APA
Hu, X., Zhao, Z., Wei, S., Chai, Z., Ma, Q., Wang, G., Wang, X., Su, J., Xu, J., Zhu, M., Cheng, Y., Yuan, J., Li, J., Kuang, K., Yang, Y., Yang, H. & Wu, F.. (2024). InfiAgent-DABench: Evaluating Agents on Data Analysis Tasks. Proceedings of the 41st International Conference on Machine Learning, in Proceedings of Machine Learning Research 235:19544-19572 Available from https://proceedings.mlr.press/v235/hu24s.html.

Related Material