An Architecture Search Framework for Inference-Time Techniques

Jon Saad-Falcon, Adrian Gamarra Lafuente, Shlok Natarajan, Nahum Maru, Hristo Todorov, Etash Kumar Guha, E. Kelly Buchanan, Mayee F Chen, Neel Guha, Christopher Re, Azalia Mirhoseini
Proceedings of the 42nd International Conference on Machine Learning, PMLR 267:52475-52507, 2025.

Abstract

Inference-time techniques, such as repeated sampling or iterative revisions, are emerging as powerful ways to enhance large-language models (LLMs) at test time. However, best practices for developing systems that combine these techniques remain underdeveloped due to our limited understanding of the utility of each technique across models and tasks, the interactions between them, and the massive search space for combining them. To address these challenges, we introduce Archon, a modular and automated framework for optimizing the process of selecting and combining inference-time techniques and LLMs. Given a compute budget and a set of available LLMs, Archon explores a large design space to discover optimized configurations tailored to target benchmarks. It can design custom or general-purpose architectures that advance the Pareto frontier of accuracy vs. maximum token budget compared to top-performing baselines. Across instruction-following, reasoning, and coding tasks, we show that Archon can leverage additional inference compute budget to design systems that outperform frontier models such as OpenAI’s o1, GPT-4o, and Claude 3.5 Sonnet by an average of 15.1%.

Cite this Paper


BibTeX
@InProceedings{pmlr-v267-saad-falcon25a, title = {An Architecture Search Framework for Inference-Time Techniques}, author = {Saad-Falcon, Jon and Lafuente, Adrian Gamarra and Natarajan, Shlok and Maru, Nahum and Todorov, Hristo and Guha, Etash Kumar and Buchanan, E. Kelly and Chen, Mayee F and Guha, Neel and Re, Christopher and Mirhoseini, Azalia}, booktitle = {Proceedings of the 42nd International Conference on Machine Learning}, pages = {52475--52507}, year = {2025}, editor = {Singh, Aarti and Fazel, Maryam and Hsu, Daniel and Lacoste-Julien, Simon and Berkenkamp, Felix and Maharaj, Tegan and Wagstaff, Kiri and Zhu, Jerry}, volume = {267}, series = {Proceedings of Machine Learning Research}, month = {13--19 Jul}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v267/main/assets/saad-falcon25a/saad-falcon25a.pdf}, url = {https://proceedings.mlr.press/v267/saad-falcon25a.html}, abstract = {Inference-time techniques, such as repeated sampling or iterative revisions, are emerging as powerful ways to enhance large-language models (LLMs) at test time. However, best practices for developing systems that combine these techniques remain underdeveloped due to our limited understanding of the utility of each technique across models and tasks, the interactions between them, and the massive search space for combining them. To address these challenges, we introduce Archon, a modular and automated framework for optimizing the process of selecting and combining inference-time techniques and LLMs. Given a compute budget and a set of available LLMs, Archon explores a large design space to discover optimized configurations tailored to target benchmarks. It can design custom or general-purpose architectures that advance the Pareto frontier of accuracy vs. maximum token budget compared to top-performing baselines. Across instruction-following, reasoning, and coding tasks, we show that Archon can leverage additional inference compute budget to design systems that outperform frontier models such as OpenAI’s o1, GPT-4o, and Claude 3.5 Sonnet by an average of 15.1%.} }
Endnote
%0 Conference Paper %T An Architecture Search Framework for Inference-Time Techniques %A Jon Saad-Falcon %A Adrian Gamarra Lafuente %A Shlok Natarajan %A Nahum Maru %A Hristo Todorov %A Etash Kumar Guha %A E. Kelly Buchanan %A Mayee F Chen %A Neel Guha %A Christopher Re %A Azalia Mirhoseini %B Proceedings of the 42nd International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2025 %E Aarti Singh %E Maryam Fazel %E Daniel Hsu %E Simon Lacoste-Julien %E Felix Berkenkamp %E Tegan Maharaj %E Kiri Wagstaff %E Jerry Zhu %F pmlr-v267-saad-falcon25a %I PMLR %P 52475--52507 %U https://proceedings.mlr.press/v267/saad-falcon25a.html %V 267 %X Inference-time techniques, such as repeated sampling or iterative revisions, are emerging as powerful ways to enhance large-language models (LLMs) at test time. However, best practices for developing systems that combine these techniques remain underdeveloped due to our limited understanding of the utility of each technique across models and tasks, the interactions between them, and the massive search space for combining them. To address these challenges, we introduce Archon, a modular and automated framework for optimizing the process of selecting and combining inference-time techniques and LLMs. Given a compute budget and a set of available LLMs, Archon explores a large design space to discover optimized configurations tailored to target benchmarks. It can design custom or general-purpose architectures that advance the Pareto frontier of accuracy vs. maximum token budget compared to top-performing baselines. Across instruction-following, reasoning, and coding tasks, we show that Archon can leverage additional inference compute budget to design systems that outperform frontier models such as OpenAI’s o1, GPT-4o, and Claude 3.5 Sonnet by an average of 15.1%.
APA
Saad-Falcon, J., Lafuente, A.G., Natarajan, S., Maru, N., Todorov, H., Guha, E.K., Buchanan, E.K., Chen, M.F., Guha, N., Re, C. & Mirhoseini, A.. (2025). An Architecture Search Framework for Inference-Time Techniques. Proceedings of the 42nd International Conference on Machine Learning, in Proceedings of Machine Learning Research 267:52475-52507 Available from https://proceedings.mlr.press/v267/saad-falcon25a.html.

Related Material