A statistical perspective on distillation

Aditya K Menon, Ankit Singh Rawat, Sashank Reddi, Seungyeon Kim, Sanjiv Kumar
Proceedings of the 38th International Conference on Machine Learning, PMLR 139:7632-7642, 2021.

Abstract

Knowledge distillation is a technique for improving a “student” model by replacing its one-hot training labels with a label distribution obtained from a “teacher” model. Despite its broad success, several basic questions — e.g., Why does distillation help? Why do more accurate teachers not necessarily distill better? — have received limited formal study. In this paper, we present a statistical perspective on distillation which provides an answer to these questions. Our core observation is that a “Bayes teacher” providing the true class-probabilities can lower the variance of the student objective, and thus improve performance. We then establish a bias-variance tradeoff that quantifies the value of teachers that approximate the Bayes class-probabilities. This provides a formal criterion as to what constitutes a “good” teacher, namely, the quality of its probability estimates. Finally, we illustrate how our statistical perspective facilitates novel applications of distillation to bipartite ranking and multiclass retrieval.

Cite this Paper


BibTeX
@InProceedings{pmlr-v139-menon21a, title = {A statistical perspective on distillation}, author = {Menon, Aditya K and Rawat, Ankit Singh and Reddi, Sashank and Kim, Seungyeon and Kumar, Sanjiv}, booktitle = {Proceedings of the 38th International Conference on Machine Learning}, pages = {7632--7642}, year = {2021}, editor = {Meila, Marina and Zhang, Tong}, volume = {139}, series = {Proceedings of Machine Learning Research}, month = {18--24 Jul}, publisher = {PMLR}, pdf = {http://proceedings.mlr.press/v139/menon21a/menon21a.pdf}, url = {https://proceedings.mlr.press/v139/menon21a.html}, abstract = {Knowledge distillation is a technique for improving a “student” model by replacing its one-hot training labels with a label distribution obtained from a “teacher” model. Despite its broad success, several basic questions — e.g., Why does distillation help? Why do more accurate teachers not necessarily distill better? — have received limited formal study. In this paper, we present a statistical perspective on distillation which provides an answer to these questions. Our core observation is that a “Bayes teacher” providing the true class-probabilities can lower the variance of the student objective, and thus improve performance. We then establish a bias-variance tradeoff that quantifies the value of teachers that approximate the Bayes class-probabilities. This provides a formal criterion as to what constitutes a “good” teacher, namely, the quality of its probability estimates. Finally, we illustrate how our statistical perspective facilitates novel applications of distillation to bipartite ranking and multiclass retrieval.} }
Endnote
%0 Conference Paper %T A statistical perspective on distillation %A Aditya K Menon %A Ankit Singh Rawat %A Sashank Reddi %A Seungyeon Kim %A Sanjiv Kumar %B Proceedings of the 38th International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2021 %E Marina Meila %E Tong Zhang %F pmlr-v139-menon21a %I PMLR %P 7632--7642 %U https://proceedings.mlr.press/v139/menon21a.html %V 139 %X Knowledge distillation is a technique for improving a “student” model by replacing its one-hot training labels with a label distribution obtained from a “teacher” model. Despite its broad success, several basic questions — e.g., Why does distillation help? Why do more accurate teachers not necessarily distill better? — have received limited formal study. In this paper, we present a statistical perspective on distillation which provides an answer to these questions. Our core observation is that a “Bayes teacher” providing the true class-probabilities can lower the variance of the student objective, and thus improve performance. We then establish a bias-variance tradeoff that quantifies the value of teachers that approximate the Bayes class-probabilities. This provides a formal criterion as to what constitutes a “good” teacher, namely, the quality of its probability estimates. Finally, we illustrate how our statistical perspective facilitates novel applications of distillation to bipartite ranking and multiclass retrieval.
APA
Menon, A.K., Rawat, A.S., Reddi, S., Kim, S. & Kumar, S.. (2021). A statistical perspective on distillation. Proceedings of the 38th International Conference on Machine Learning, in Proceedings of Machine Learning Research 139:7632-7642 Available from https://proceedings.mlr.press/v139/menon21a.html.

Related Material