AuPair: Golden Example Pairs for Code Repair

Aditi Mavalankar, Hassan Mansoor, Zita Marinho, Mariia Samsikova, Tom Schaul
Proceedings of the 42nd International Conference on Machine Learning, PMLR 267:43276-43301, 2025.

Abstract

Scaling up inference-time compute has proven to be a valuable strategy in improving the performance of Large Language Models (LLMs) without fine-tuning. An important task that can benefit from additional inference-time compute is self-repair; given an initial flawed response or guess, the LLM corrects its own mistake and produces an improved response or fix. We leverage the in-context learning ability of LLMs to perform self-repair in the coding domain. The key contribution of our paper is an approach that synthesises and selects an ordered set of golden example pairs, or AuPairs, of these initial guesses and subsequent fixes for the corresponding problems. Each such AuPair is provided as a single in-context example at inference time to generate a repaired solution. For an inference-time compute budget of $N$ LLM calls per problem, $N$ AuPairs are used to generate $N$ repaired solutions, out of which the highest-scoring solution is the final answer. The underlying intuition is that if the LLM is given a different example of fixing an incorrect guess each time, it can subsequently generate a diverse set of repaired solutions. Our algorithm selects these AuPairs in a manner that maximises complementarity and usefulness. We demonstrate the results of our algorithm on 5 LLMs across 7 competitive programming datasets for the code repair task. Our algorithm yields a significant boost in performance compared to best-of-$N$ and self-repair, and also exhibits strong generalisation across datasets and models. Moreover, our approach shows stronger scaling with inference-time compute budget compared to baselines.

Cite this Paper


BibTeX
@InProceedings{pmlr-v267-mavalankar25a, title = {{A}u{P}air: Golden Example Pairs for Code Repair}, author = {Mavalankar, Aditi and Mansoor, Hassan and Marinho, Zita and Samsikova, Mariia and Schaul, Tom}, booktitle = {Proceedings of the 42nd International Conference on Machine Learning}, pages = {43276--43301}, year = {2025}, editor = {Singh, Aarti and Fazel, Maryam and Hsu, Daniel and Lacoste-Julien, Simon and Berkenkamp, Felix and Maharaj, Tegan and Wagstaff, Kiri and Zhu, Jerry}, volume = {267}, series = {Proceedings of Machine Learning Research}, month = {13--19 Jul}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v267/main/assets/mavalankar25a/mavalankar25a.pdf}, url = {https://proceedings.mlr.press/v267/mavalankar25a.html}, abstract = {Scaling up inference-time compute has proven to be a valuable strategy in improving the performance of Large Language Models (LLMs) without fine-tuning. An important task that can benefit from additional inference-time compute is self-repair; given an initial flawed response or guess, the LLM corrects its own mistake and produces an improved response or fix. We leverage the in-context learning ability of LLMs to perform self-repair in the coding domain. The key contribution of our paper is an approach that synthesises and selects an ordered set of golden example pairs, or AuPairs, of these initial guesses and subsequent fixes for the corresponding problems. Each such AuPair is provided as a single in-context example at inference time to generate a repaired solution. For an inference-time compute budget of $N$ LLM calls per problem, $N$ AuPairs are used to generate $N$ repaired solutions, out of which the highest-scoring solution is the final answer. The underlying intuition is that if the LLM is given a different example of fixing an incorrect guess each time, it can subsequently generate a diverse set of repaired solutions. Our algorithm selects these AuPairs in a manner that maximises complementarity and usefulness. We demonstrate the results of our algorithm on 5 LLMs across 7 competitive programming datasets for the code repair task. Our algorithm yields a significant boost in performance compared to best-of-$N$ and self-repair, and also exhibits strong generalisation across datasets and models. Moreover, our approach shows stronger scaling with inference-time compute budget compared to baselines.} }
Endnote
%0 Conference Paper %T AuPair: Golden Example Pairs for Code Repair %A Aditi Mavalankar %A Hassan Mansoor %A Zita Marinho %A Mariia Samsikova %A Tom Schaul %B Proceedings of the 42nd International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2025 %E Aarti Singh %E Maryam Fazel %E Daniel Hsu %E Simon Lacoste-Julien %E Felix Berkenkamp %E Tegan Maharaj %E Kiri Wagstaff %E Jerry Zhu %F pmlr-v267-mavalankar25a %I PMLR %P 43276--43301 %U https://proceedings.mlr.press/v267/mavalankar25a.html %V 267 %X Scaling up inference-time compute has proven to be a valuable strategy in improving the performance of Large Language Models (LLMs) without fine-tuning. An important task that can benefit from additional inference-time compute is self-repair; given an initial flawed response or guess, the LLM corrects its own mistake and produces an improved response or fix. We leverage the in-context learning ability of LLMs to perform self-repair in the coding domain. The key contribution of our paper is an approach that synthesises and selects an ordered set of golden example pairs, or AuPairs, of these initial guesses and subsequent fixes for the corresponding problems. Each such AuPair is provided as a single in-context example at inference time to generate a repaired solution. For an inference-time compute budget of $N$ LLM calls per problem, $N$ AuPairs are used to generate $N$ repaired solutions, out of which the highest-scoring solution is the final answer. The underlying intuition is that if the LLM is given a different example of fixing an incorrect guess each time, it can subsequently generate a diverse set of repaired solutions. Our algorithm selects these AuPairs in a manner that maximises complementarity and usefulness. We demonstrate the results of our algorithm on 5 LLMs across 7 competitive programming datasets for the code repair task. Our algorithm yields a significant boost in performance compared to best-of-$N$ and self-repair, and also exhibits strong generalisation across datasets and models. Moreover, our approach shows stronger scaling with inference-time compute budget compared to baselines.
APA
Mavalankar, A., Mansoor, H., Marinho, Z., Samsikova, M. & Schaul, T.. (2025). AuPair: Golden Example Pairs for Code Repair. Proceedings of the 42nd International Conference on Machine Learning, in Proceedings of Machine Learning Research 267:43276-43301 Available from https://proceedings.mlr.press/v267/mavalankar25a.html.

Related Material