Sparse Upcycling: Inference Inefficient Finetuning

Sasha Doubov, Nikhil Sardana, Vitaliy Chiley
Proceedings of The 4th NeurIPS Efficient Natural Language and Speech Processing Workshop, PMLR 262:194-205, 2024.

Abstract

Small, highly trained, open-source LLMs are widely used due to their inference efficiency, but further improving their quality remains a challenge. Sparse upcycling is a promising approach that transforms a pretrained dense model into a Mixture-of-Experts (MoE) architecture, increasing the model’s parameter count and potential quality. In this work, we compare the effectiveness of sparse upcycling against continued pretraining (CPT) across different model sizes, FLOP budgets, and pretraining durations. Our experiments show that sparse upcycling can achieve better quality, with improvements of over 20% relative to CPT in certain scenarios. However, this comes with a significant inference cost, leading to 40% slowdowns in high-demand inference settings for larger models. These results highlight the trade-off between model quality and inference efficiency, offering insights for practitioners seeking to balance performance with practical deployment costs.

Cite this Paper


BibTeX
@InProceedings{pmlr-v262-doubov24a, title = {Sparse Upcycling: Inference Inefficient Finetuning}, author = {Doubov, Sasha and Sardana, Nikhil and Chiley, Vitaliy}, booktitle = {Proceedings of The 4th NeurIPS Efficient Natural Language and Speech Processing Workshop}, pages = {194--205}, year = {2024}, editor = {Rezagholizadeh, Mehdi and Passban, Peyman and Samiee, Soheila and Partovi Nia, Vahid and Cheng, Yu and Deng, Yue and Liu, Qun and Chen, Boxing}, volume = {262}, series = {Proceedings of Machine Learning Research}, month = {14 Dec}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v262/main/assets/doubov24a/doubov24a.pdf}, url = {https://proceedings.mlr.press/v262/doubov24a.html}, abstract = {Small, highly trained, open-source LLMs are widely used due to their inference efficiency, but further improving their quality remains a challenge. Sparse upcycling is a promising approach that transforms a pretrained dense model into a Mixture-of-Experts (MoE) architecture, increasing the model’s parameter count and potential quality. In this work, we compare the effectiveness of sparse upcycling against continued pretraining (CPT) across different model sizes, FLOP budgets, and pretraining durations. Our experiments show that sparse upcycling can achieve better quality, with improvements of over 20% relative to CPT in certain scenarios. However, this comes with a significant inference cost, leading to 40% slowdowns in high-demand inference settings for larger models. These results highlight the trade-off between model quality and inference efficiency, offering insights for practitioners seeking to balance performance with practical deployment costs.} }
Endnote
%0 Conference Paper %T Sparse Upcycling: Inference Inefficient Finetuning %A Sasha Doubov %A Nikhil Sardana %A Vitaliy Chiley %B Proceedings of The 4th NeurIPS Efficient Natural Language and Speech Processing Workshop %C Proceedings of Machine Learning Research %D 2024 %E Mehdi Rezagholizadeh %E Peyman Passban %E Soheila Samiee %E Vahid Partovi Nia %E Yu Cheng %E Yue Deng %E Qun Liu %E Boxing Chen %F pmlr-v262-doubov24a %I PMLR %P 194--205 %U https://proceedings.mlr.press/v262/doubov24a.html %V 262 %X Small, highly trained, open-source LLMs are widely used due to their inference efficiency, but further improving their quality remains a challenge. Sparse upcycling is a promising approach that transforms a pretrained dense model into a Mixture-of-Experts (MoE) architecture, increasing the model’s parameter count and potential quality. In this work, we compare the effectiveness of sparse upcycling against continued pretraining (CPT) across different model sizes, FLOP budgets, and pretraining durations. Our experiments show that sparse upcycling can achieve better quality, with improvements of over 20% relative to CPT in certain scenarios. However, this comes with a significant inference cost, leading to 40% slowdowns in high-demand inference settings for larger models. These results highlight the trade-off between model quality and inference efficiency, offering insights for practitioners seeking to balance performance with practical deployment costs.
APA
Doubov, S., Sardana, N. & Chiley, V.. (2024). Sparse Upcycling: Inference Inefficient Finetuning. Proceedings of The 4th NeurIPS Efficient Natural Language and Speech Processing Workshop, in Proceedings of Machine Learning Research 262:194-205 Available from https://proceedings.mlr.press/v262/doubov24a.html.

Related Material