Transferring Knowledge From Large Foundation Models to Small Downstream Models

Shikai Qiu, Boran Han, Danielle C. Maddix, Shuai Zhang, Bernie Wang, Andrew Gordon Wilson
Proceedings of the 41st International Conference on Machine Learning, PMLR 235:41644-41657, 2024.

Abstract

How do we transfer the relevant knowledge from ever larger foundation models into small, task-specific downstream models that can run at much lower costs? Standard transfer learning using pre-trained weights as the initialization transfers limited information and commits us to often massive pre-trained architectures. This procedure also precludes combining multiple pre-trained models that learn complementary information. To address these shortcomings, we introduce Adaptive Feature Transfer (AFT). Instead of transferring weights, AFT operates purely on features, thereby decoupling the choice of the pre-trained model from the smaller downstream model. Rather than indiscriminately compressing all pre-trained features, AFT adaptively transfers pre-trained features that are most useful for performing the downstream task, using a simple regularization that adds minimal overhead. Across multiple vision, language, and multi-modal datasets, AFT achieves significantly better downstream performance compared to alternatives with a similar computational cost. Furthermore, AFT reliably translates improvement in pre-trained models into improvement in downstream performance, even if the downstream model is over $50\times$ smaller, and can effectively transfer complementary information learned by multiple pre-trained models.

Cite this Paper


BibTeX
@InProceedings{pmlr-v235-qiu24d, title = {Transferring Knowledge From Large Foundation Models to Small Downstream Models}, author = {Qiu, Shikai and Han, Boran and Maddix, Danielle C. and Zhang, Shuai and Wang, Bernie and Wilson, Andrew Gordon}, booktitle = {Proceedings of the 41st International Conference on Machine Learning}, pages = {41644--41657}, year = {2024}, editor = {Salakhutdinov, Ruslan and Kolter, Zico and Heller, Katherine and Weller, Adrian and Oliver, Nuria and Scarlett, Jonathan and Berkenkamp, Felix}, volume = {235}, series = {Proceedings of Machine Learning Research}, month = {21--27 Jul}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v235/main/assets/qiu24d/qiu24d.pdf}, url = {https://proceedings.mlr.press/v235/qiu24d.html}, abstract = {How do we transfer the relevant knowledge from ever larger foundation models into small, task-specific downstream models that can run at much lower costs? Standard transfer learning using pre-trained weights as the initialization transfers limited information and commits us to often massive pre-trained architectures. This procedure also precludes combining multiple pre-trained models that learn complementary information. To address these shortcomings, we introduce Adaptive Feature Transfer (AFT). Instead of transferring weights, AFT operates purely on features, thereby decoupling the choice of the pre-trained model from the smaller downstream model. Rather than indiscriminately compressing all pre-trained features, AFT adaptively transfers pre-trained features that are most useful for performing the downstream task, using a simple regularization that adds minimal overhead. Across multiple vision, language, and multi-modal datasets, AFT achieves significantly better downstream performance compared to alternatives with a similar computational cost. Furthermore, AFT reliably translates improvement in pre-trained models into improvement in downstream performance, even if the downstream model is over $50\times$ smaller, and can effectively transfer complementary information learned by multiple pre-trained models.} }
Endnote
%0 Conference Paper %T Transferring Knowledge From Large Foundation Models to Small Downstream Models %A Shikai Qiu %A Boran Han %A Danielle C. Maddix %A Shuai Zhang %A Bernie Wang %A Andrew Gordon Wilson %B Proceedings of the 41st International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2024 %E Ruslan Salakhutdinov %E Zico Kolter %E Katherine Heller %E Adrian Weller %E Nuria Oliver %E Jonathan Scarlett %E Felix Berkenkamp %F pmlr-v235-qiu24d %I PMLR %P 41644--41657 %U https://proceedings.mlr.press/v235/qiu24d.html %V 235 %X How do we transfer the relevant knowledge from ever larger foundation models into small, task-specific downstream models that can run at much lower costs? Standard transfer learning using pre-trained weights as the initialization transfers limited information and commits us to often massive pre-trained architectures. This procedure also precludes combining multiple pre-trained models that learn complementary information. To address these shortcomings, we introduce Adaptive Feature Transfer (AFT). Instead of transferring weights, AFT operates purely on features, thereby decoupling the choice of the pre-trained model from the smaller downstream model. Rather than indiscriminately compressing all pre-trained features, AFT adaptively transfers pre-trained features that are most useful for performing the downstream task, using a simple regularization that adds minimal overhead. Across multiple vision, language, and multi-modal datasets, AFT achieves significantly better downstream performance compared to alternatives with a similar computational cost. Furthermore, AFT reliably translates improvement in pre-trained models into improvement in downstream performance, even if the downstream model is over $50\times$ smaller, and can effectively transfer complementary information learned by multiple pre-trained models.
APA
Qiu, S., Han, B., Maddix, D.C., Zhang, S., Wang, B. & Wilson, A.G.. (2024). Transferring Knowledge From Large Foundation Models to Small Downstream Models. Proceedings of the 41st International Conference on Machine Learning, in Proceedings of Machine Learning Research 235:41644-41657 Available from https://proceedings.mlr.press/v235/qiu24d.html.

Related Material