Unsupervised Evaluation of Code LLMs with Round-Trip Correctness

Miltiadis Allamanis, Sheena Panthaplackel, Pengcheng Yin
Proceedings of the 41st International Conference on Machine Learning, PMLR 235:1050-1066, 2024.

Abstract

To evaluate code large language models (LLMs), research has relied on a few small manually curated benchmarks, such as HumanEval and MBPP, which represent a narrow part of the real-world software domains. In this work, we introduce round-trip correctness (RTC) as an alternative evaluation method. RTC allows Code LLM evaluation on a broader spectrum of real-world software domains without the need for costly human curation. RTC rests on the idea that we can ask a model to make a prediction (e.g., describe some code using natural language), feed that prediction back (e.g., synthesize code from the predicted description), and check if this round-trip leads to code that is semantically equivalent to the original input. We show how to employ RTC to evaluate code synthesis and editing. We find that RTC strongly correlates with model performance on existing narrow-domain code synthesis benchmarks while allowing us to expand to a much broader set of domains and tasks which was not previously possible without costly human annotations.

Cite this Paper


BibTeX
@InProceedings{pmlr-v235-allamanis24a, title = {Unsupervised Evaluation of Code {LLM}s with Round-Trip Correctness}, author = {Allamanis, Miltiadis and Panthaplackel, Sheena and Yin, Pengcheng}, booktitle = {Proceedings of the 41st International Conference on Machine Learning}, pages = {1050--1066}, year = {2024}, editor = {Salakhutdinov, Ruslan and Kolter, Zico and Heller, Katherine and Weller, Adrian and Oliver, Nuria and Scarlett, Jonathan and Berkenkamp, Felix}, volume = {235}, series = {Proceedings of Machine Learning Research}, month = {21--27 Jul}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v235/main/assets/allamanis24a/allamanis24a.pdf}, url = {https://proceedings.mlr.press/v235/allamanis24a.html}, abstract = {To evaluate code large language models (LLMs), research has relied on a few small manually curated benchmarks, such as HumanEval and MBPP, which represent a narrow part of the real-world software domains. In this work, we introduce round-trip correctness (RTC) as an alternative evaluation method. RTC allows Code LLM evaluation on a broader spectrum of real-world software domains without the need for costly human curation. RTC rests on the idea that we can ask a model to make a prediction (e.g., describe some code using natural language), feed that prediction back (e.g., synthesize code from the predicted description), and check if this round-trip leads to code that is semantically equivalent to the original input. We show how to employ RTC to evaluate code synthesis and editing. We find that RTC strongly correlates with model performance on existing narrow-domain code synthesis benchmarks while allowing us to expand to a much broader set of domains and tasks which was not previously possible without costly human annotations.} }
Endnote
%0 Conference Paper %T Unsupervised Evaluation of Code LLMs with Round-Trip Correctness %A Miltiadis Allamanis %A Sheena Panthaplackel %A Pengcheng Yin %B Proceedings of the 41st International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2024 %E Ruslan Salakhutdinov %E Zico Kolter %E Katherine Heller %E Adrian Weller %E Nuria Oliver %E Jonathan Scarlett %E Felix Berkenkamp %F pmlr-v235-allamanis24a %I PMLR %P 1050--1066 %U https://proceedings.mlr.press/v235/allamanis24a.html %V 235 %X To evaluate code large language models (LLMs), research has relied on a few small manually curated benchmarks, such as HumanEval and MBPP, which represent a narrow part of the real-world software domains. In this work, we introduce round-trip correctness (RTC) as an alternative evaluation method. RTC allows Code LLM evaluation on a broader spectrum of real-world software domains without the need for costly human curation. RTC rests on the idea that we can ask a model to make a prediction (e.g., describe some code using natural language), feed that prediction back (e.g., synthesize code from the predicted description), and check if this round-trip leads to code that is semantically equivalent to the original input. We show how to employ RTC to evaluate code synthesis and editing. We find that RTC strongly correlates with model performance on existing narrow-domain code synthesis benchmarks while allowing us to expand to a much broader set of domains and tasks which was not previously possible without costly human annotations.
APA
Allamanis, M., Panthaplackel, S. & Yin, P.. (2024). Unsupervised Evaluation of Code LLMs with Round-Trip Correctness. Proceedings of the 41st International Conference on Machine Learning, in Proceedings of Machine Learning Research 235:1050-1066 Available from https://proceedings.mlr.press/v235/allamanis24a.html.

Related Material