Revisit the Essence of Distilling Knowledge through Calibration

Wen-Shu Fan, Su Lu, Xin-Chun Li, De-Chuan Zhan, Le Gan
Proceedings of the 41st International Conference on Machine Learning, PMLR 235:12882-12894, 2024.

Abstract

Knowledge Distillation (KD) has evolved into a practical technology for transferring knowledge from a well-performing model (teacher) to a weak model (student). A counter-intuitive phenomenon known as capacity mismatch has been identified, wherein KD performance may not be good when a better teacher instructs the student. Various preliminary methods have been proposed to alleviate capacity mismatch, but a unifying explanation for its cause remains lacking. In this paper, we propose a unifying analytical framework to pinpoint the core of capacity mismatch based on calibration. Through extensive analytical experiments, we observe a positive correlation between the calibration of the teacher model and the KD performance with original KD methods. As this correlation arises due to the sensitivity of metrics (e.g., KL divergence) to calibration, we recommend employing measurements insensitive to calibration such as ranking-based loss. Our experiments demonstrate that ranking-based loss can effectively replace KL divergence, aiding large models with poor calibration to teach better.

Cite this Paper


BibTeX
@InProceedings{pmlr-v235-fan24d, title = {Revisit the Essence of Distilling Knowledge through Calibration}, author = {Fan, Wen-Shu and Lu, Su and Li, Xin-Chun and Zhan, De-Chuan and Gan, Le}, booktitle = {Proceedings of the 41st International Conference on Machine Learning}, pages = {12882--12894}, year = {2024}, editor = {Salakhutdinov, Ruslan and Kolter, Zico and Heller, Katherine and Weller, Adrian and Oliver, Nuria and Scarlett, Jonathan and Berkenkamp, Felix}, volume = {235}, series = {Proceedings of Machine Learning Research}, month = {21--27 Jul}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v235/main/assets/fan24d/fan24d.pdf}, url = {https://proceedings.mlr.press/v235/fan24d.html}, abstract = {Knowledge Distillation (KD) has evolved into a practical technology for transferring knowledge from a well-performing model (teacher) to a weak model (student). A counter-intuitive phenomenon known as capacity mismatch has been identified, wherein KD performance may not be good when a better teacher instructs the student. Various preliminary methods have been proposed to alleviate capacity mismatch, but a unifying explanation for its cause remains lacking. In this paper, we propose a unifying analytical framework to pinpoint the core of capacity mismatch based on calibration. Through extensive analytical experiments, we observe a positive correlation between the calibration of the teacher model and the KD performance with original KD methods. As this correlation arises due to the sensitivity of metrics (e.g., KL divergence) to calibration, we recommend employing measurements insensitive to calibration such as ranking-based loss. Our experiments demonstrate that ranking-based loss can effectively replace KL divergence, aiding large models with poor calibration to teach better.} }
Endnote
%0 Conference Paper %T Revisit the Essence of Distilling Knowledge through Calibration %A Wen-Shu Fan %A Su Lu %A Xin-Chun Li %A De-Chuan Zhan %A Le Gan %B Proceedings of the 41st International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2024 %E Ruslan Salakhutdinov %E Zico Kolter %E Katherine Heller %E Adrian Weller %E Nuria Oliver %E Jonathan Scarlett %E Felix Berkenkamp %F pmlr-v235-fan24d %I PMLR %P 12882--12894 %U https://proceedings.mlr.press/v235/fan24d.html %V 235 %X Knowledge Distillation (KD) has evolved into a practical technology for transferring knowledge from a well-performing model (teacher) to a weak model (student). A counter-intuitive phenomenon known as capacity mismatch has been identified, wherein KD performance may not be good when a better teacher instructs the student. Various preliminary methods have been proposed to alleviate capacity mismatch, but a unifying explanation for its cause remains lacking. In this paper, we propose a unifying analytical framework to pinpoint the core of capacity mismatch based on calibration. Through extensive analytical experiments, we observe a positive correlation between the calibration of the teacher model and the KD performance with original KD methods. As this correlation arises due to the sensitivity of metrics (e.g., KL divergence) to calibration, we recommend employing measurements insensitive to calibration such as ranking-based loss. Our experiments demonstrate that ranking-based loss can effectively replace KL divergence, aiding large models with poor calibration to teach better.
APA
Fan, W., Lu, S., Li, X., Zhan, D. & Gan, L.. (2024). Revisit the Essence of Distilling Knowledge through Calibration. Proceedings of the 41st International Conference on Machine Learning, in Proceedings of Machine Learning Research 235:12882-12894 Available from https://proceedings.mlr.press/v235/fan24d.html.

Related Material