AAAR-1.0: Assessing AI’s Potential to Assist Research

Renze Lou, Hanzi Xu, Sijia Wang, Jiangshu Du, Ryo Kamoi, Xiaoxin Lu, Jian Xie, Yuxuan Sun, Yusen Zhang, Jihyun Janice Ahn, Hongchao Fang, Zhuoyang Zou, Wenchao Ma, Xi Li, Kai Zhang, Congying Xia, Lifu Huang, Wenpeng Yin
Proceedings of the 42nd International Conference on Machine Learning, PMLR 267:40361-40383, 2025.

Abstract

Numerous studies have assessed the proficiency of AI systems, particularly large language models (LLMs), in facilitating everyday tasks such as email writing, question answering, and creative content generation. However, researchers face unique challenges and opportunities in leveraging LLMs for their own work, such as brainstorming research ideas, designing experiments, and writing or reviewing papers. In this study, we introduce AAAR-1.0, a benchmark dataset designed to evaluate LLM performance in three fundamental, expertise-intensive research tasks: (i) EquationInference, assessing the correctness of equations based on the contextual information in paper submissions; (ii) ExperimentDesign, designing experiments to validate research ideas and solutions; and (iii) PaperWeakness, identifying weaknesses in paper submissions. AAAR-1.0 differs from prior benchmarks in two key ways: first, it is explicitly research-oriented, with tasks requiring deep domain expertise; second, it is researcher-oriented, mirroring the primary activities that researchers engage in on a daily basis. An evaluation of both open-source and proprietary LLMs reveals their potential as well as limitations in conducting sophisticated research tasks. We will release the AAAR-1.0 and keep iterating it to new versions.

Cite this Paper


BibTeX
@InProceedings{pmlr-v267-lou25c, title = {{AAAR}-1.0: Assessing {AI}’s Potential to Assist Research}, author = {Lou, Renze and Xu, Hanzi and Wang, Sijia and Du, Jiangshu and Kamoi, Ryo and Lu, Xiaoxin and Xie, Jian and Sun, Yuxuan and Zhang, Yusen and Ahn, Jihyun Janice and Fang, Hongchao and Zou, Zhuoyang and Ma, Wenchao and Li, Xi and Zhang, Kai and Xia, Congying and Huang, Lifu and Yin, Wenpeng}, booktitle = {Proceedings of the 42nd International Conference on Machine Learning}, pages = {40361--40383}, year = {2025}, editor = {Singh, Aarti and Fazel, Maryam and Hsu, Daniel and Lacoste-Julien, Simon and Berkenkamp, Felix and Maharaj, Tegan and Wagstaff, Kiri and Zhu, Jerry}, volume = {267}, series = {Proceedings of Machine Learning Research}, month = {13--19 Jul}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v267/main/assets/lou25c/lou25c.pdf}, url = {https://proceedings.mlr.press/v267/lou25c.html}, abstract = {Numerous studies have assessed the proficiency of AI systems, particularly large language models (LLMs), in facilitating everyday tasks such as email writing, question answering, and creative content generation. However, researchers face unique challenges and opportunities in leveraging LLMs for their own work, such as brainstorming research ideas, designing experiments, and writing or reviewing papers. In this study, we introduce AAAR-1.0, a benchmark dataset designed to evaluate LLM performance in three fundamental, expertise-intensive research tasks: (i) EquationInference, assessing the correctness of equations based on the contextual information in paper submissions; (ii) ExperimentDesign, designing experiments to validate research ideas and solutions; and (iii) PaperWeakness, identifying weaknesses in paper submissions. AAAR-1.0 differs from prior benchmarks in two key ways: first, it is explicitly research-oriented, with tasks requiring deep domain expertise; second, it is researcher-oriented, mirroring the primary activities that researchers engage in on a daily basis. An evaluation of both open-source and proprietary LLMs reveals their potential as well as limitations in conducting sophisticated research tasks. We will release the AAAR-1.0 and keep iterating it to new versions.} }
Endnote
%0 Conference Paper %T AAAR-1.0: Assessing AI’s Potential to Assist Research %A Renze Lou %A Hanzi Xu %A Sijia Wang %A Jiangshu Du %A Ryo Kamoi %A Xiaoxin Lu %A Jian Xie %A Yuxuan Sun %A Yusen Zhang %A Jihyun Janice Ahn %A Hongchao Fang %A Zhuoyang Zou %A Wenchao Ma %A Xi Li %A Kai Zhang %A Congying Xia %A Lifu Huang %A Wenpeng Yin %B Proceedings of the 42nd International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2025 %E Aarti Singh %E Maryam Fazel %E Daniel Hsu %E Simon Lacoste-Julien %E Felix Berkenkamp %E Tegan Maharaj %E Kiri Wagstaff %E Jerry Zhu %F pmlr-v267-lou25c %I PMLR %P 40361--40383 %U https://proceedings.mlr.press/v267/lou25c.html %V 267 %X Numerous studies have assessed the proficiency of AI systems, particularly large language models (LLMs), in facilitating everyday tasks such as email writing, question answering, and creative content generation. However, researchers face unique challenges and opportunities in leveraging LLMs for their own work, such as brainstorming research ideas, designing experiments, and writing or reviewing papers. In this study, we introduce AAAR-1.0, a benchmark dataset designed to evaluate LLM performance in three fundamental, expertise-intensive research tasks: (i) EquationInference, assessing the correctness of equations based on the contextual information in paper submissions; (ii) ExperimentDesign, designing experiments to validate research ideas and solutions; and (iii) PaperWeakness, identifying weaknesses in paper submissions. AAAR-1.0 differs from prior benchmarks in two key ways: first, it is explicitly research-oriented, with tasks requiring deep domain expertise; second, it is researcher-oriented, mirroring the primary activities that researchers engage in on a daily basis. An evaluation of both open-source and proprietary LLMs reveals their potential as well as limitations in conducting sophisticated research tasks. We will release the AAAR-1.0 and keep iterating it to new versions.
APA
Lou, R., Xu, H., Wang, S., Du, J., Kamoi, R., Lu, X., Xie, J., Sun, Y., Zhang, Y., Ahn, J.J., Fang, H., Zou, Z., Ma, W., Li, X., Zhang, K., Xia, C., Huang, L. & Yin, W.. (2025). AAAR-1.0: Assessing AI’s Potential to Assist Research. Proceedings of the 42nd International Conference on Machine Learning, in Proceedings of Machine Learning Research 267:40361-40383 Available from https://proceedings.mlr.press/v267/lou25c.html.

Related Material