Scaling Inference-Efficient Language Models

Song Bian, Minghao Yan, Shivaram Venkataraman
Proceedings of the 42nd International Conference on Machine Learning, PMLR 267:4303-4323, 2025.

Abstract

Scaling laws are powerful tools to predict the performance of large language models. However, current scaling laws fall short of accounting for inference costs. In this work, we first show that model architecture affects inference latency, where models of the same size can have up to $3.5\times$ difference in latency. To tackle this challenge, we modify the Chinchilla scaling laws to co-optimize the model parameter count, the number of training tokens, and the model architecture. Due to the reason that models of similar training loss exhibit gaps in downstream evaluation, we also propose a novel method to train inference-efficient models based on the revised scaling laws. We perform extensive empirical studies to fit and evaluate our inference-aware scaling laws. We vary model parameters from 80M to 1B, training tokens from 1.6B to 30B, and model shapes, training 63 models. Guided by our inference-efficient scaling law and model selection method, we release the Morph-1B model, which improves inference latency by $1.8\times$ while maintaining accuracy on downstream tasks compared to open-source models, pushing the Pareto frontier of accuracy-latency tradeoff. Notably, our experiments reveal that wider and shallower models can yield efficiency gains while preserving accuracy.

Cite this Paper


BibTeX
@InProceedings{pmlr-v267-bian25b, title = {Scaling Inference-Efficient Language Models}, author = {Bian, Song and Yan, Minghao and Venkataraman, Shivaram}, booktitle = {Proceedings of the 42nd International Conference on Machine Learning}, pages = {4303--4323}, year = {2025}, editor = {Singh, Aarti and Fazel, Maryam and Hsu, Daniel and Lacoste-Julien, Simon and Berkenkamp, Felix and Maharaj, Tegan and Wagstaff, Kiri and Zhu, Jerry}, volume = {267}, series = {Proceedings of Machine Learning Research}, month = {13--19 Jul}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v267/main/assets/bian25b/bian25b.pdf}, url = {https://proceedings.mlr.press/v267/bian25b.html}, abstract = {Scaling laws are powerful tools to predict the performance of large language models. However, current scaling laws fall short of accounting for inference costs. In this work, we first show that model architecture affects inference latency, where models of the same size can have up to $3.5\times$ difference in latency. To tackle this challenge, we modify the Chinchilla scaling laws to co-optimize the model parameter count, the number of training tokens, and the model architecture. Due to the reason that models of similar training loss exhibit gaps in downstream evaluation, we also propose a novel method to train inference-efficient models based on the revised scaling laws. We perform extensive empirical studies to fit and evaluate our inference-aware scaling laws. We vary model parameters from 80M to 1B, training tokens from 1.6B to 30B, and model shapes, training 63 models. Guided by our inference-efficient scaling law and model selection method, we release the Morph-1B model, which improves inference latency by $1.8\times$ while maintaining accuracy on downstream tasks compared to open-source models, pushing the Pareto frontier of accuracy-latency tradeoff. Notably, our experiments reveal that wider and shallower models can yield efficiency gains while preserving accuracy.} }
Endnote
%0 Conference Paper %T Scaling Inference-Efficient Language Models %A Song Bian %A Minghao Yan %A Shivaram Venkataraman %B Proceedings of the 42nd International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2025 %E Aarti Singh %E Maryam Fazel %E Daniel Hsu %E Simon Lacoste-Julien %E Felix Berkenkamp %E Tegan Maharaj %E Kiri Wagstaff %E Jerry Zhu %F pmlr-v267-bian25b %I PMLR %P 4303--4323 %U https://proceedings.mlr.press/v267/bian25b.html %V 267 %X Scaling laws are powerful tools to predict the performance of large language models. However, current scaling laws fall short of accounting for inference costs. In this work, we first show that model architecture affects inference latency, where models of the same size can have up to $3.5\times$ difference in latency. To tackle this challenge, we modify the Chinchilla scaling laws to co-optimize the model parameter count, the number of training tokens, and the model architecture. Due to the reason that models of similar training loss exhibit gaps in downstream evaluation, we also propose a novel method to train inference-efficient models based on the revised scaling laws. We perform extensive empirical studies to fit and evaluate our inference-aware scaling laws. We vary model parameters from 80M to 1B, training tokens from 1.6B to 30B, and model shapes, training 63 models. Guided by our inference-efficient scaling law and model selection method, we release the Morph-1B model, which improves inference latency by $1.8\times$ while maintaining accuracy on downstream tasks compared to open-source models, pushing the Pareto frontier of accuracy-latency tradeoff. Notably, our experiments reveal that wider and shallower models can yield efficiency gains while preserving accuracy.
APA
Bian, S., Yan, M. & Venkataraman, S.. (2025). Scaling Inference-Efficient Language Models. Proceedings of the 42nd International Conference on Machine Learning, in Proceedings of Machine Learning Research 267:4303-4323 Available from https://proceedings.mlr.press/v267/bian25b.html.

Related Material