Solvable Model for Inheriting the Regularization through Knowledge Distillation

Luca Saglietti, Lenka Zdeborova
Proceedings of the 2nd Mathematical and Scientific Machine Learning Conference, PMLR 145:809-846, 2022.

Abstract

In recent years the empirical success of transfer learning with neural networks has stimulated an increasing interest in obtaining a theoretical understanding of its core properties. Knowledge Dis- tillation where a smaller neural network is trained using the outputs of a larger neural network is a particularly interesting case of transfer learning. In the present work, we introduce a statistical physics framework that allows an analytic characterization of the properties of knowledge distil- lation (KD) in shallow neural networks. Focusing the analysis on a solvable model that exhibits a non-trivial generalization gap, we investigate the effectiveness of KD. We are able to show that, through KD, the regularization properties of the larger teacher model can be inherited by the smaller student and that the yielded generalization performance is closely linked to and limited by the op- timality of the teacher. Finally, we analyze the double descent phenomenology that can arise in the considered KD setting.

Cite this Paper


BibTeX
@InProceedings{pmlr-v145-saglietti22a, title = {Solvable Model for Inheriting the Regularization through Knowledge Distillation}, author = {Saglietti, Luca and Zdeborova, Lenka}, booktitle = {Proceedings of the 2nd Mathematical and Scientific Machine Learning Conference}, pages = {809--846}, year = {2022}, editor = {Bruna, Joan and Hesthaven, Jan and Zdeborova, Lenka}, volume = {145}, series = {Proceedings of Machine Learning Research}, month = {16--19 Aug}, publisher = {PMLR}, pdf = {https://proceedings.mlr.press/v145/saglietti22a/saglietti22a.pdf}, url = {https://proceedings.mlr.press/v145/saglietti22a.html}, abstract = {In recent years the empirical success of transfer learning with neural networks has stimulated an increasing interest in obtaining a theoretical understanding of its core properties. Knowledge Dis- tillation where a smaller neural network is trained using the outputs of a larger neural network is a particularly interesting case of transfer learning. In the present work, we introduce a statistical physics framework that allows an analytic characterization of the properties of knowledge distil- lation (KD) in shallow neural networks. Focusing the analysis on a solvable model that exhibits a non-trivial generalization gap, we investigate the effectiveness of KD. We are able to show that, through KD, the regularization properties of the larger teacher model can be inherited by the smaller student and that the yielded generalization performance is closely linked to and limited by the op- timality of the teacher. Finally, we analyze the double descent phenomenology that can arise in the considered KD setting. } }
Endnote
%0 Conference Paper %T Solvable Model for Inheriting the Regularization through Knowledge Distillation %A Luca Saglietti %A Lenka Zdeborova %B Proceedings of the 2nd Mathematical and Scientific Machine Learning Conference %C Proceedings of Machine Learning Research %D 2022 %E Joan Bruna %E Jan Hesthaven %E Lenka Zdeborova %F pmlr-v145-saglietti22a %I PMLR %P 809--846 %U https://proceedings.mlr.press/v145/saglietti22a.html %V 145 %X In recent years the empirical success of transfer learning with neural networks has stimulated an increasing interest in obtaining a theoretical understanding of its core properties. Knowledge Dis- tillation where a smaller neural network is trained using the outputs of a larger neural network is a particularly interesting case of transfer learning. In the present work, we introduce a statistical physics framework that allows an analytic characterization of the properties of knowledge distil- lation (KD) in shallow neural networks. Focusing the analysis on a solvable model that exhibits a non-trivial generalization gap, we investigate the effectiveness of KD. We are able to show that, through KD, the regularization properties of the larger teacher model can be inherited by the smaller student and that the yielded generalization performance is closely linked to and limited by the op- timality of the teacher. Finally, we analyze the double descent phenomenology that can arise in the considered KD setting.
APA
Saglietti, L. & Zdeborova, L.. (2022). Solvable Model for Inheriting the Regularization through Knowledge Distillation. Proceedings of the 2nd Mathematical and Scientific Machine Learning Conference, in Proceedings of Machine Learning Research 145:809-846 Available from https://proceedings.mlr.press/v145/saglietti22a.html.

Related Material