WorkArena: How Capable are Web Agents at Solving Common Knowledge Work Tasks?

Alexandre Drouin, Maxime Gasse, Massimo Caccia, Issam H. Laradji, Manuel Del Verme, Tom Marty, David Vazquez, Nicolas Chapados, Alexandre Lacoste
Proceedings of the 41st International Conference on Machine Learning, PMLR 235:11642-11662, 2024.

Abstract

We study the use of large language model-based agents for interacting with software via web browsers. Unlike prior work, we focus on measuring the agents’ ability to perform tasks that span the typical daily work of knowledge workers utilizing enterprise software systems. To this end, we propose WorkArena, a remote-hosted benchmark of 33 tasks based on the widely-used ServiceNow platform. We also introduce BrowserGym, an environment for the design and evaluation of such agents, offering a rich set of actions as well as multimodal observations. Our empirical evaluation reveals that while current agents show promise on WorkArena, there remains a considerable gap towards achieving full task automation. Notably, our analysis uncovers a significant performance disparity between open and closed-source LLMs, highlighting a critical area for future exploration and development in the field.

Cite this Paper


BibTeX
@InProceedings{pmlr-v235-drouin24a, title = {{W}ork{A}rena: How Capable are Web Agents at Solving Common Knowledge Work Tasks?}, author = {Drouin, Alexandre and Gasse, Maxime and Caccia, Massimo and Laradji, Issam H. and Del Verme, Manuel and Marty, Tom and Vazquez, David and Chapados, Nicolas and Lacoste, Alexandre}, booktitle = {Proceedings of the 41st International Conference on Machine Learning}, pages = {11642--11662}, year = {2024}, editor = {Salakhutdinov, Ruslan and Kolter, Zico and Heller, Katherine and Weller, Adrian and Oliver, Nuria and Scarlett, Jonathan and Berkenkamp, Felix}, volume = {235}, series = {Proceedings of Machine Learning Research}, month = {21--27 Jul}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v235/main/assets/drouin24a/drouin24a.pdf}, url = {https://proceedings.mlr.press/v235/drouin24a.html}, abstract = {We study the use of large language model-based agents for interacting with software via web browsers. Unlike prior work, we focus on measuring the agents’ ability to perform tasks that span the typical daily work of knowledge workers utilizing enterprise software systems. To this end, we propose WorkArena, a remote-hosted benchmark of 33 tasks based on the widely-used ServiceNow platform. We also introduce BrowserGym, an environment for the design and evaluation of such agents, offering a rich set of actions as well as multimodal observations. Our empirical evaluation reveals that while current agents show promise on WorkArena, there remains a considerable gap towards achieving full task automation. Notably, our analysis uncovers a significant performance disparity between open and closed-source LLMs, highlighting a critical area for future exploration and development in the field.} }
Endnote
%0 Conference Paper %T WorkArena: How Capable are Web Agents at Solving Common Knowledge Work Tasks? %A Alexandre Drouin %A Maxime Gasse %A Massimo Caccia %A Issam H. Laradji %A Manuel Del Verme %A Tom Marty %A David Vazquez %A Nicolas Chapados %A Alexandre Lacoste %B Proceedings of the 41st International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2024 %E Ruslan Salakhutdinov %E Zico Kolter %E Katherine Heller %E Adrian Weller %E Nuria Oliver %E Jonathan Scarlett %E Felix Berkenkamp %F pmlr-v235-drouin24a %I PMLR %P 11642--11662 %U https://proceedings.mlr.press/v235/drouin24a.html %V 235 %X We study the use of large language model-based agents for interacting with software via web browsers. Unlike prior work, we focus on measuring the agents’ ability to perform tasks that span the typical daily work of knowledge workers utilizing enterprise software systems. To this end, we propose WorkArena, a remote-hosted benchmark of 33 tasks based on the widely-used ServiceNow platform. We also introduce BrowserGym, an environment for the design and evaluation of such agents, offering a rich set of actions as well as multimodal observations. Our empirical evaluation reveals that while current agents show promise on WorkArena, there remains a considerable gap towards achieving full task automation. Notably, our analysis uncovers a significant performance disparity between open and closed-source LLMs, highlighting a critical area for future exploration and development in the field.
APA
Drouin, A., Gasse, M., Caccia, M., Laradji, I.H., Del Verme, M., Marty, T., Vazquez, D., Chapados, N. & Lacoste, A.. (2024). WorkArena: How Capable are Web Agents at Solving Common Knowledge Work Tasks?. Proceedings of the 41st International Conference on Machine Learning, in Proceedings of Machine Learning Research 235:11642-11662 Available from https://proceedings.mlr.press/v235/drouin24a.html.

Related Material