CRUXEval: A Benchmark for Code Reasoning, Understanding and Execution

Alex Gu; Baptiste Roziere; Hugh James Leather; Armando Solar-Lezama; Gabriel Synnaeve; Sida Wang

CRUXEval: A Benchmark for Code Reasoning, Understanding and Execution

Alex Gu, Baptiste Roziere, Hugh James Leather, Armando Solar-Lezama, Gabriel Synnaeve, Sida Wang

Proceedings of the 41st International Conference on Machine Learning, PMLR 235:16568-16621, 2024.

Abstract

We present Code Reasoning, Understanding, and eXecution Evaluation, a benchmark consisting of 800 Python functions (3-13 lines). Each function comes with an input-output pair, leading to two natural tasks: input prediction and output prediction. First, we propose a general recipe for generating our execution benchmark by sampling from a model, which can be used for more challenging versions of the benchmark if needed. Second, we evaluate twenty code models on our benchmark and discover that many recent high-scoring models on HumanEval show no improvements on our benchmark. Third, we show that simple CoT and fine-tuning schemes can improve performance on our benchmark but remain far from solving it. The best setup, GPT-4 with chain of thought (CoT), achieves a pass@1 of 75% and 81% on input and output prediction, respectively. In contrast, Code Llama 34B achieves a pass@1 of 50% and 46% on input and output prediction. When it comes to reasoning about code, GPT-4 has a huge edge over other models but still fails consistently on some surprisingly simple Python programs.

Cite this Paper

BibTeX


@InProceedings{pmlr-v235-gu24c,
  title = 	 {{CRUXE}val: A Benchmark for Code Reasoning, Understanding and Execution},
  author =       {Gu, Alex and Roziere, Baptiste and Leather, Hugh James and Solar-Lezama, Armando and Synnaeve, Gabriel and Wang, Sida},
  booktitle = 	 {Proceedings of the 41st International Conference on Machine Learning},
  pages = 	 {16568--16621},
  year = 	 {2024},
  editor = 	 {Salakhutdinov, Ruslan and Kolter, Zico and Heller, Katherine and Weller, Adrian and Oliver, Nuria and Scarlett, Jonathan and Berkenkamp, Felix},
  volume = 	 {235},
  series = 	 {Proceedings of Machine Learning Research},
  month = 	 {21--27 Jul},
  publisher =    {PMLR},
  pdf = 	 {https://raw.githubusercontent.com/mlresearch/v235/main/assets/gu24c/gu24c.pdf},
  url = 	 {https://proceedings.mlr.press/v235/gu24c.html},
  abstract = 	 {We present Code Reasoning, Understanding, and eXecution Evaluation, a benchmark consisting of 800 Python functions (3-13 lines). Each function comes with an input-output pair, leading to two natural tasks: input prediction and output prediction. First, we propose a general recipe for generating our execution benchmark by sampling from a model, which can be used for more challenging versions of the benchmark if needed. Second, we evaluate twenty code models on our benchmark and discover that many recent high-scoring models on HumanEval show no improvements on our benchmark. Third, we show that simple CoT and fine-tuning schemes can improve performance on our benchmark but remain far from solving it. The best setup, GPT-4 with chain of thought (CoT), achieves a pass@1 of 75% and 81% on input and output prediction, respectively. In contrast, Code Llama 34B achieves a pass@1 of 50% and 46% on input and output prediction. When it comes to reasoning about code, GPT-4 has a huge edge over other models but still fails consistently on some surprisingly simple Python programs.}
}

Endnote

%0 Conference Paper
%T CRUXEval: A Benchmark for Code Reasoning, Understanding and Execution
%A Alex Gu
%A Baptiste Roziere
%A Hugh James Leather
%A Armando Solar-Lezama
%A Gabriel Synnaeve
%A Sida Wang
%B Proceedings of the 41st International Conference on Machine Learning
%C Proceedings of Machine Learning Research
%D 2024
%E Ruslan Salakhutdinov
%E Zico Kolter
%E Katherine Heller
%E Adrian Weller
%E Nuria Oliver
%E Jonathan Scarlett
%E Felix Berkenkamp	
%F pmlr-v235-gu24c
%I PMLR
%P 16568--16621
%U https://proceedings.mlr.press/v235/gu24c.html
%V 235
%X We present Code Reasoning, Understanding, and eXecution Evaluation, a benchmark consisting of 800 Python functions (3-13 lines). Each function comes with an input-output pair, leading to two natural tasks: input prediction and output prediction. First, we propose a general recipe for generating our execution benchmark by sampling from a model, which can be used for more challenging versions of the benchmark if needed. Second, we evaluate twenty code models on our benchmark and discover that many recent high-scoring models on HumanEval show no improvements on our benchmark. Third, we show that simple CoT and fine-tuning schemes can improve performance on our benchmark but remain far from solving it. The best setup, GPT-4 with chain of thought (CoT), achieves a pass@1 of 75% and 81% on input and output prediction, respectively. In contrast, Code Llama 34B achieves a pass@1 of 50% and 46% on input and output prediction. When it comes to reasoning about code, GPT-4 has a huge edge over other models but still fails consistently on some surprisingly simple Python programs.

APA


Gu, A., Roziere, B., Leather, H.J., Solar-Lezama, A., Synnaeve, G. & Wang, S.. (2024). CRUXEval: A Benchmark for Code Reasoning, Understanding and Execution. Proceedings of the 41st International Conference on Machine Learning, in Proceedings of Machine Learning Research 235:16568-16621 Available from https://proceedings.mlr.press/v235/gu24c.html.

CRUXEval: A Benchmark for Code Reasoning, Understanding and Execution

Abstract

Cite this Paper

Related Material