Copilot Arena: A Platform for Code LLM Evaluation in the Wild

Wayne Chi, Valerie Chen, Anastasios Nikolas Angelopoulos, Wei-Lin Chiang, Aditya Mittal, Naman Jain, Tianjun Zhang, Ion Stoica, Chris Donahue, Ameet Talwalkar
Proceedings of the 42nd International Conference on Machine Learning, PMLR 267:10354-10382, 2025.

Abstract

Evaluating in-the-wild coding capabilities of large language models (LLMs) is a challenging endeavor with no existing solution. We introduce Copilot Arena, a platform to collect user preferences through native integration into a developer’s working environment. Copilot Arena comprises a novel interface for comparing pairs of model outputs, a sampling strategy to reduce experienced latency, and a prompting scheme to enable code completion functionality. Copilot Arena has served over 4.5 million suggestions from 10 models and collected over 11k pairwise judgements. Our results highlight the importance of model evaluations in integrated settings. We find that model rankings from Copilot Arena differ from those of existing evaluations, which we attribute to the unique distribution of data and tasks contained in Copilot Arena. We also identify novel insights into human preferences on code such as an observed consistency in user preference across programming languages yet significant variation in preference due to task category. We open-source Copilot Arena and release data to enable human-centric evaluations and improve understanding of coding assistants.

Cite this Paper


BibTeX
@InProceedings{pmlr-v267-chi25a, title = {Copilot Arena: A Platform for Code {LLM} Evaluation in the Wild}, author = {Chi, Wayne and Chen, Valerie and Angelopoulos, Anastasios Nikolas and Chiang, Wei-Lin and Mittal, Aditya and Jain, Naman and Zhang, Tianjun and Stoica, Ion and Donahue, Chris and Talwalkar, Ameet}, booktitle = {Proceedings of the 42nd International Conference on Machine Learning}, pages = {10354--10382}, year = {2025}, editor = {Singh, Aarti and Fazel, Maryam and Hsu, Daniel and Lacoste-Julien, Simon and Berkenkamp, Felix and Maharaj, Tegan and Wagstaff, Kiri and Zhu, Jerry}, volume = {267}, series = {Proceedings of Machine Learning Research}, month = {13--19 Jul}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v267/main/assets/chi25a/chi25a.pdf}, url = {https://proceedings.mlr.press/v267/chi25a.html}, abstract = {Evaluating in-the-wild coding capabilities of large language models (LLMs) is a challenging endeavor with no existing solution. We introduce Copilot Arena, a platform to collect user preferences through native integration into a developer’s working environment. Copilot Arena comprises a novel interface for comparing pairs of model outputs, a sampling strategy to reduce experienced latency, and a prompting scheme to enable code completion functionality. Copilot Arena has served over 4.5 million suggestions from 10 models and collected over 11k pairwise judgements. Our results highlight the importance of model evaluations in integrated settings. We find that model rankings from Copilot Arena differ from those of existing evaluations, which we attribute to the unique distribution of data and tasks contained in Copilot Arena. We also identify novel insights into human preferences on code such as an observed consistency in user preference across programming languages yet significant variation in preference due to task category. We open-source Copilot Arena and release data to enable human-centric evaluations and improve understanding of coding assistants.} }
Endnote
%0 Conference Paper %T Copilot Arena: A Platform for Code LLM Evaluation in the Wild %A Wayne Chi %A Valerie Chen %A Anastasios Nikolas Angelopoulos %A Wei-Lin Chiang %A Aditya Mittal %A Naman Jain %A Tianjun Zhang %A Ion Stoica %A Chris Donahue %A Ameet Talwalkar %B Proceedings of the 42nd International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2025 %E Aarti Singh %E Maryam Fazel %E Daniel Hsu %E Simon Lacoste-Julien %E Felix Berkenkamp %E Tegan Maharaj %E Kiri Wagstaff %E Jerry Zhu %F pmlr-v267-chi25a %I PMLR %P 10354--10382 %U https://proceedings.mlr.press/v267/chi25a.html %V 267 %X Evaluating in-the-wild coding capabilities of large language models (LLMs) is a challenging endeavor with no existing solution. We introduce Copilot Arena, a platform to collect user preferences through native integration into a developer’s working environment. Copilot Arena comprises a novel interface for comparing pairs of model outputs, a sampling strategy to reduce experienced latency, and a prompting scheme to enable code completion functionality. Copilot Arena has served over 4.5 million suggestions from 10 models and collected over 11k pairwise judgements. Our results highlight the importance of model evaluations in integrated settings. We find that model rankings from Copilot Arena differ from those of existing evaluations, which we attribute to the unique distribution of data and tasks contained in Copilot Arena. We also identify novel insights into human preferences on code such as an observed consistency in user preference across programming languages yet significant variation in preference due to task category. We open-source Copilot Arena and release data to enable human-centric evaluations and improve understanding of coding assistants.
APA
Chi, W., Chen, V., Angelopoulos, A.N., Chiang, W., Mittal, A., Jain, N., Zhang, T., Stoica, I., Donahue, C. & Talwalkar, A.. (2025). Copilot Arena: A Platform for Code LLM Evaluation in the Wild. Proceedings of the 42nd International Conference on Machine Learning, in Proceedings of Machine Learning Research 267:10354-10382 Available from https://proceedings.mlr.press/v267/chi25a.html.

Related Material