Product of Experts with LLMs: Boosting Performance on ARC Is a Matter of Perspective

Daniel Franzen, Jan Disselhoff, David Hartmann
Proceedings of the 42nd International Conference on Machine Learning, PMLR 267:17657-17671, 2025.

Abstract

The Abstraction and Reasoning Corpus (ARC-AGI) poses a significant challenge for large language models (LLMs), exposing limitations in their abstract reasoning abilities. In this work, we leverage task-specific data augmentations throughout the training, generation, and scoring phases, and employ a depth-first search algorithm to generate diverse, high-probability candidate solutions. Furthermore, we utilize the LLM not only as a generator but also as a scorer, using its output probabilities to select the most promising solutions. Our method achieves a score of 71.6% (286.5/400 solved tasks) on the public ARC-AGI evaluation set, demonstrating state-of-the-art performance among publicly available approaches. While concurrent closed-source work has reported higher scores, our method distinguishes itself through its transparency, reproducibility, and remarkably low inference cost, averaging only around 2ct per task on readily available hardware.

Cite this Paper


BibTeX
@InProceedings{pmlr-v267-franzen25a, title = {Product of Experts with {LLM}s: Boosting Performance on {ARC} Is a Matter of Perspective}, author = {Franzen, Daniel and Disselhoff, Jan and Hartmann, David}, booktitle = {Proceedings of the 42nd International Conference on Machine Learning}, pages = {17657--17671}, year = {2025}, editor = {Singh, Aarti and Fazel, Maryam and Hsu, Daniel and Lacoste-Julien, Simon and Berkenkamp, Felix and Maharaj, Tegan and Wagstaff, Kiri and Zhu, Jerry}, volume = {267}, series = {Proceedings of Machine Learning Research}, month = {13--19 Jul}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v267/main/assets/franzen25a/franzen25a.pdf}, url = {https://proceedings.mlr.press/v267/franzen25a.html}, abstract = {The Abstraction and Reasoning Corpus (ARC-AGI) poses a significant challenge for large language models (LLMs), exposing limitations in their abstract reasoning abilities. In this work, we leverage task-specific data augmentations throughout the training, generation, and scoring phases, and employ a depth-first search algorithm to generate diverse, high-probability candidate solutions. Furthermore, we utilize the LLM not only as a generator but also as a scorer, using its output probabilities to select the most promising solutions. Our method achieves a score of 71.6% (286.5/400 solved tasks) on the public ARC-AGI evaluation set, demonstrating state-of-the-art performance among publicly available approaches. While concurrent closed-source work has reported higher scores, our method distinguishes itself through its transparency, reproducibility, and remarkably low inference cost, averaging only around 2ct per task on readily available hardware.} }
Endnote
%0 Conference Paper %T Product of Experts with LLMs: Boosting Performance on ARC Is a Matter of Perspective %A Daniel Franzen %A Jan Disselhoff %A David Hartmann %B Proceedings of the 42nd International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2025 %E Aarti Singh %E Maryam Fazel %E Daniel Hsu %E Simon Lacoste-Julien %E Felix Berkenkamp %E Tegan Maharaj %E Kiri Wagstaff %E Jerry Zhu %F pmlr-v267-franzen25a %I PMLR %P 17657--17671 %U https://proceedings.mlr.press/v267/franzen25a.html %V 267 %X The Abstraction and Reasoning Corpus (ARC-AGI) poses a significant challenge for large language models (LLMs), exposing limitations in their abstract reasoning abilities. In this work, we leverage task-specific data augmentations throughout the training, generation, and scoring phases, and employ a depth-first search algorithm to generate diverse, high-probability candidate solutions. Furthermore, we utilize the LLM not only as a generator but also as a scorer, using its output probabilities to select the most promising solutions. Our method achieves a score of 71.6% (286.5/400 solved tasks) on the public ARC-AGI evaluation set, demonstrating state-of-the-art performance among publicly available approaches. While concurrent closed-source work has reported higher scores, our method distinguishes itself through its transparency, reproducibility, and remarkably low inference cost, averaging only around 2ct per task on readily available hardware.
APA
Franzen, D., Disselhoff, J. & Hartmann, D.. (2025). Product of Experts with LLMs: Boosting Performance on ARC Is a Matter of Perspective. Proceedings of the 42nd International Conference on Machine Learning, in Proceedings of Machine Learning Research 267:17657-17671 Available from https://proceedings.mlr.press/v267/franzen25a.html.

Related Material