ITBench: Evaluating AI Agents across Diverse Real-World IT Automation Tasks

Saurabh Jha, Rohan R. Arora, Yuji Watanabe, Takumi Yanagawa, Yinfang Chen, Jackson Clark, Bhavya Bhavya, Mudit Verma, Harshit Kumar, Hirokuni Kitahara, Noah Zheutlin, Saki Takano, Divya Pathak, Felix George, Xinbo Wu, Bekir O Turkkan, Gerard Vanloo, Michael Nidd, Ting Dai, Oishik Chatterjee, Pranjal Gupta, Suranjana Samanta, Pooja Aggarwal, Rong Lee, Jae-Wook Ahn, Debanjana Kar, Amit Paradkar, Yu Deng, Pratibha Moogi, Prateeti Mohapatra, Naoki Abe, Chandrasekhar Narayanaswami, Tianyin Xu, Lav R. Varshney, Ruchi Mahindru, Anca Sailer, Laura Shwartz, Daby Sow, Nicholas C. M. Fuller, Ruchir Puri
Proceedings of the 42nd International Conference on Machine Learning, PMLR 267:27134-27197, 2025.

Abstract

Realizing the vision of using AI agents to automate critical IT tasks depends on the ability to measure and understand effectiveness of proposed solutions. We introduce ITBench, a framework that offers a systematic methodology for benchmarking AI agents to address real-world IT automation tasks. Our initial release targets three key areas: Site Reliability Engineering (SRE), Compliance and Security Operations (CISO), and Financial Operations (FinOps). The design enables AI researchers to understand the challenges and opportunities of AI agents for IT automation with push-button workflows and interpretable metrics. IT-Bench includes an initial set of 102 real-world scenarios, which can be easily extended by community contributions. Our results show that agents powered by state-of-the-art models resolve only 11.4% of SRE scenarios, 25.2% of CISO scenarios, and 25.8% of FinOps scenarios (excluding anomaly detection). For FinOps-specific anomaly detection (AD) scenarios, AI agents achieve an F1 score of 0.35. We expect ITBench to be a key enabler of AI-driven IT automation that is correct, safe, and fast. IT-Bench, along with a leaderboard and sample agent implementations, is available at https://github.com/ibm/itbench.

Cite this Paper


BibTeX
@InProceedings{pmlr-v267-jha25a, title = {{ITB}ench: Evaluating {AI} Agents across Diverse Real-World {IT} Automation Tasks}, author = {Jha, Saurabh and Arora, Rohan R. and Watanabe, Yuji and Yanagawa, Takumi and Chen, Yinfang and Clark, Jackson and Bhavya, Bhavya and Verma, Mudit and Kumar, Harshit and Kitahara, Hirokuni and Zheutlin, Noah and Takano, Saki and Pathak, Divya and George, Felix and Wu, Xinbo and Turkkan, Bekir O and Vanloo, Gerard and Nidd, Michael and Dai, Ting and Chatterjee, Oishik and Gupta, Pranjal and Samanta, Suranjana and Aggarwal, Pooja and Lee, Rong and Ahn, Jae-Wook and Kar, Debanjana and Paradkar, Amit and Deng, Yu and Moogi, Pratibha and Mohapatra, Prateeti and Abe, Naoki and Narayanaswami, Chandrasekhar and Xu, Tianyin and Varshney, Lav R. and Mahindru, Ruchi and Sailer, Anca and Shwartz, Laura and Sow, Daby and Fuller, Nicholas C. M. and Puri, Ruchir}, booktitle = {Proceedings of the 42nd International Conference on Machine Learning}, pages = {27134--27197}, year = {2025}, editor = {Singh, Aarti and Fazel, Maryam and Hsu, Daniel and Lacoste-Julien, Simon and Berkenkamp, Felix and Maharaj, Tegan and Wagstaff, Kiri and Zhu, Jerry}, volume = {267}, series = {Proceedings of Machine Learning Research}, month = {13--19 Jul}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v267/main/assets/jha25a/jha25a.pdf}, url = {https://proceedings.mlr.press/v267/jha25a.html}, abstract = {Realizing the vision of using AI agents to automate critical IT tasks depends on the ability to measure and understand effectiveness of proposed solutions. We introduce ITBench, a framework that offers a systematic methodology for benchmarking AI agents to address real-world IT automation tasks. Our initial release targets three key areas: Site Reliability Engineering (SRE), Compliance and Security Operations (CISO), and Financial Operations (FinOps). The design enables AI researchers to understand the challenges and opportunities of AI agents for IT automation with push-button workflows and interpretable metrics. IT-Bench includes an initial set of 102 real-world scenarios, which can be easily extended by community contributions. Our results show that agents powered by state-of-the-art models resolve only 11.4% of SRE scenarios, 25.2% of CISO scenarios, and 25.8% of FinOps scenarios (excluding anomaly detection). For FinOps-specific anomaly detection (AD) scenarios, AI agents achieve an F1 score of 0.35. We expect ITBench to be a key enabler of AI-driven IT automation that is correct, safe, and fast. IT-Bench, along with a leaderboard and sample agent implementations, is available at https://github.com/ibm/itbench.} }
Endnote
%0 Conference Paper %T ITBench: Evaluating AI Agents across Diverse Real-World IT Automation Tasks %A Saurabh Jha %A Rohan R. Arora %A Yuji Watanabe %A Takumi Yanagawa %A Yinfang Chen %A Jackson Clark %A Bhavya Bhavya %A Mudit Verma %A Harshit Kumar %A Hirokuni Kitahara %A Noah Zheutlin %A Saki Takano %A Divya Pathak %A Felix George %A Xinbo Wu %A Bekir O Turkkan %A Gerard Vanloo %A Michael Nidd %A Ting Dai %A Oishik Chatterjee %A Pranjal Gupta %A Suranjana Samanta %A Pooja Aggarwal %A Rong Lee %A Jae-Wook Ahn %A Debanjana Kar %A Amit Paradkar %A Yu Deng %A Pratibha Moogi %A Prateeti Mohapatra %A Naoki Abe %A Chandrasekhar Narayanaswami %A Tianyin Xu %A Lav R. Varshney %A Ruchi Mahindru %A Anca Sailer %A Laura Shwartz %A Daby Sow %A Nicholas C. M. Fuller %A Ruchir Puri %B Proceedings of the 42nd International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2025 %E Aarti Singh %E Maryam Fazel %E Daniel Hsu %E Simon Lacoste-Julien %E Felix Berkenkamp %E Tegan Maharaj %E Kiri Wagstaff %E Jerry Zhu %F pmlr-v267-jha25a %I PMLR %P 27134--27197 %U https://proceedings.mlr.press/v267/jha25a.html %V 267 %X Realizing the vision of using AI agents to automate critical IT tasks depends on the ability to measure and understand effectiveness of proposed solutions. We introduce ITBench, a framework that offers a systematic methodology for benchmarking AI agents to address real-world IT automation tasks. Our initial release targets three key areas: Site Reliability Engineering (SRE), Compliance and Security Operations (CISO), and Financial Operations (FinOps). The design enables AI researchers to understand the challenges and opportunities of AI agents for IT automation with push-button workflows and interpretable metrics. IT-Bench includes an initial set of 102 real-world scenarios, which can be easily extended by community contributions. Our results show that agents powered by state-of-the-art models resolve only 11.4% of SRE scenarios, 25.2% of CISO scenarios, and 25.8% of FinOps scenarios (excluding anomaly detection). For FinOps-specific anomaly detection (AD) scenarios, AI agents achieve an F1 score of 0.35. We expect ITBench to be a key enabler of AI-driven IT automation that is correct, safe, and fast. IT-Bench, along with a leaderboard and sample agent implementations, is available at https://github.com/ibm/itbench.
APA
Jha, S., Arora, R.R., Watanabe, Y., Yanagawa, T., Chen, Y., Clark, J., Bhavya, B., Verma, M., Kumar, H., Kitahara, H., Zheutlin, N., Takano, S., Pathak, D., George, F., Wu, X., Turkkan, B.O., Vanloo, G., Nidd, M., Dai, T., Chatterjee, O., Gupta, P., Samanta, S., Aggarwal, P., Lee, R., Ahn, J., Kar, D., Paradkar, A., Deng, Y., Moogi, P., Mohapatra, P., Abe, N., Narayanaswami, C., Xu, T., Varshney, L.R., Mahindru, R., Sailer, A., Shwartz, L., Sow, D., Fuller, N.C.M. & Puri, R.. (2025). ITBench: Evaluating AI Agents across Diverse Real-World IT Automation Tasks. Proceedings of the 42nd International Conference on Machine Learning, in Proceedings of Machine Learning Research 267:27134-27197 Available from https://proceedings.mlr.press/v267/jha25a.html.

Related Material