Soup-of-Experts: Pretraining Specialist Models via Parameters Averaging

Pierre Ablin, Angelos Katharopoulos, Skyler Seto, David Grangier
Proceedings of the 42nd International Conference on Machine Learning, PMLR 267:185-203, 2025.

Abstract

Machine learning models are routinely trained on a mixture of different data domains. Different domain weights yield very different downstream performances. We propose the Soup-of-Experts, a novel architecture that can instantiate a model at test time for any domain weights with minimal computational cost and without re-training the model. Our architecture consists of a bank of expert parameters, which are linearly combined to instantiate one model. We learn the linear combination coefficients as a function of the input domain weights. To train this architecture, we sample random domain weights, instantiate the corresponding model, and backprop through one batch of data sampled with these domain weights. We demonstrate how our approach obtains small specialized models on several language modeling tasks quickly. Soup-of-Experts are particularly appealing when one needs to ship many different specialist models quickly under a size constraint.

Cite this Paper


BibTeX
@InProceedings{pmlr-v267-ablin25a, title = {Soup-of-Experts: Pretraining Specialist Models via Parameters Averaging}, author = {Ablin, Pierre and Katharopoulos, Angelos and Seto, Skyler and Grangier, David}, booktitle = {Proceedings of the 42nd International Conference on Machine Learning}, pages = {185--203}, year = {2025}, editor = {Singh, Aarti and Fazel, Maryam and Hsu, Daniel and Lacoste-Julien, Simon and Berkenkamp, Felix and Maharaj, Tegan and Wagstaff, Kiri and Zhu, Jerry}, volume = {267}, series = {Proceedings of Machine Learning Research}, month = {13--19 Jul}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v267/main/assets/ablin25a/ablin25a.pdf}, url = {https://proceedings.mlr.press/v267/ablin25a.html}, abstract = {Machine learning models are routinely trained on a mixture of different data domains. Different domain weights yield very different downstream performances. We propose the Soup-of-Experts, a novel architecture that can instantiate a model at test time for any domain weights with minimal computational cost and without re-training the model. Our architecture consists of a bank of expert parameters, which are linearly combined to instantiate one model. We learn the linear combination coefficients as a function of the input domain weights. To train this architecture, we sample random domain weights, instantiate the corresponding model, and backprop through one batch of data sampled with these domain weights. We demonstrate how our approach obtains small specialized models on several language modeling tasks quickly. Soup-of-Experts are particularly appealing when one needs to ship many different specialist models quickly under a size constraint.} }
Endnote
%0 Conference Paper %T Soup-of-Experts: Pretraining Specialist Models via Parameters Averaging %A Pierre Ablin %A Angelos Katharopoulos %A Skyler Seto %A David Grangier %B Proceedings of the 42nd International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2025 %E Aarti Singh %E Maryam Fazel %E Daniel Hsu %E Simon Lacoste-Julien %E Felix Berkenkamp %E Tegan Maharaj %E Kiri Wagstaff %E Jerry Zhu %F pmlr-v267-ablin25a %I PMLR %P 185--203 %U https://proceedings.mlr.press/v267/ablin25a.html %V 267 %X Machine learning models are routinely trained on a mixture of different data domains. Different domain weights yield very different downstream performances. We propose the Soup-of-Experts, a novel architecture that can instantiate a model at test time for any domain weights with minimal computational cost and without re-training the model. Our architecture consists of a bank of expert parameters, which are linearly combined to instantiate one model. We learn the linear combination coefficients as a function of the input domain weights. To train this architecture, we sample random domain weights, instantiate the corresponding model, and backprop through one batch of data sampled with these domain weights. We demonstrate how our approach obtains small specialized models on several language modeling tasks quickly. Soup-of-Experts are particularly appealing when one needs to ship many different specialist models quickly under a size constraint.
APA
Ablin, P., Katharopoulos, A., Seto, S. & Grangier, D.. (2025). Soup-of-Experts: Pretraining Specialist Models via Parameters Averaging. Proceedings of the 42nd International Conference on Machine Learning, in Proceedings of Machine Learning Research 267:185-203 Available from https://proceedings.mlr.press/v267/ablin25a.html.

Related Material