BEST-Route: Adaptive LLM Routing with Test-Time Optimal Compute

Dujian Ding, Ankur Mallick, Shaokun Zhang, Chi Wang, Daniel Madrigal, Mirian Del Carmen Hipolito Garcia, Menglin Xia, Laks V. S. Lakshmanan, Qingyun Wu, Victor Rühle
Proceedings of the 42nd International Conference on Machine Learning, PMLR 267:13870-13884, 2025.

Abstract

Large language models (LLMs) are powerful tools but are often expensive to deploy at scale. LLM query routing mitigates this by dynamically assigning queries to models of varying cost and quality to obtain a desired tradeoff. Prior query routing approaches generate only one response from the selected model and a single response from a small (inexpensive) model was often not good enough to beat a response from a large (expensive) model due to which they end up overusing the large model and missing out on potential cost savings. However, it is well known that for small models, generating multiple responses and selecting the best can enhance quality while remaining cheaper than a single large-model response. We leverage this idea to propose BEST-Route, a novel routing framework that chooses a model and the number of responses to sample from it based on query difficulty and the quality thresholds. Experiments on real-world datasets demonstrate that our method reduces costs by up to 60% with less than 1% performance drop.

Cite this Paper


BibTeX
@InProceedings{pmlr-v267-ding25d, title = {{BEST}-Route: Adaptive {LLM} Routing with Test-Time Optimal Compute}, author = {Ding, Dujian and Mallick, Ankur and Zhang, Shaokun and Wang, Chi and Madrigal, Daniel and Hipolito Garcia, Mirian Del Carmen and Xia, Menglin and Lakshmanan, Laks V. S. and Wu, Qingyun and R\"{u}hle, Victor}, booktitle = {Proceedings of the 42nd International Conference on Machine Learning}, pages = {13870--13884}, year = {2025}, editor = {Singh, Aarti and Fazel, Maryam and Hsu, Daniel and Lacoste-Julien, Simon and Berkenkamp, Felix and Maharaj, Tegan and Wagstaff, Kiri and Zhu, Jerry}, volume = {267}, series = {Proceedings of Machine Learning Research}, month = {13--19 Jul}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v267/main/assets/ding25d/ding25d.pdf}, url = {https://proceedings.mlr.press/v267/ding25d.html}, abstract = {Large language models (LLMs) are powerful tools but are often expensive to deploy at scale. LLM query routing mitigates this by dynamically assigning queries to models of varying cost and quality to obtain a desired tradeoff. Prior query routing approaches generate only one response from the selected model and a single response from a small (inexpensive) model was often not good enough to beat a response from a large (expensive) model due to which they end up overusing the large model and missing out on potential cost savings. However, it is well known that for small models, generating multiple responses and selecting the best can enhance quality while remaining cheaper than a single large-model response. We leverage this idea to propose BEST-Route, a novel routing framework that chooses a model and the number of responses to sample from it based on query difficulty and the quality thresholds. Experiments on real-world datasets demonstrate that our method reduces costs by up to 60% with less than 1% performance drop.} }
Endnote
%0 Conference Paper %T BEST-Route: Adaptive LLM Routing with Test-Time Optimal Compute %A Dujian Ding %A Ankur Mallick %A Shaokun Zhang %A Chi Wang %A Daniel Madrigal %A Mirian Del Carmen Hipolito Garcia %A Menglin Xia %A Laks V. S. Lakshmanan %A Qingyun Wu %A Victor Rühle %B Proceedings of the 42nd International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2025 %E Aarti Singh %E Maryam Fazel %E Daniel Hsu %E Simon Lacoste-Julien %E Felix Berkenkamp %E Tegan Maharaj %E Kiri Wagstaff %E Jerry Zhu %F pmlr-v267-ding25d %I PMLR %P 13870--13884 %U https://proceedings.mlr.press/v267/ding25d.html %V 267 %X Large language models (LLMs) are powerful tools but are often expensive to deploy at scale. LLM query routing mitigates this by dynamically assigning queries to models of varying cost and quality to obtain a desired tradeoff. Prior query routing approaches generate only one response from the selected model and a single response from a small (inexpensive) model was often not good enough to beat a response from a large (expensive) model due to which they end up overusing the large model and missing out on potential cost savings. However, it is well known that for small models, generating multiple responses and selecting the best can enhance quality while remaining cheaper than a single large-model response. We leverage this idea to propose BEST-Route, a novel routing framework that chooses a model and the number of responses to sample from it based on query difficulty and the quality thresholds. Experiments on real-world datasets demonstrate that our method reduces costs by up to 60% with less than 1% performance drop.
APA
Ding, D., Mallick, A., Zhang, S., Wang, C., Madrigal, D., Hipolito Garcia, M.D.C., Xia, M., Lakshmanan, L.V.S., Wu, Q. & Rühle, V.. (2025). BEST-Route: Adaptive LLM Routing with Test-Time Optimal Compute. Proceedings of the 42nd International Conference on Machine Learning, in Proceedings of Machine Learning Research 267:13870-13884 Available from https://proceedings.mlr.press/v267/ding25d.html.

Related Material