Improving Ensemble Distillation With Weight Averaging and Diversifying Perturbation

Giung Nam, Hyungi Lee, Byeongho Heo, Juho Lee
Proceedings of the 39th International Conference on Machine Learning, PMLR 162:16353-16367, 2022.

Abstract

Ensembles of deep neural networks have demonstrated superior performance, but their heavy computational cost hinders applying them for resource-limited environments. It motivates distilling knowledge from the ensemble teacher into a smaller student network, and there are two important design choices for this ensemble distillation: 1) how to construct the student network, and 2) what data should be shown during training. In this paper, we propose a weight averaging technique where a student with multiple subnetworks is trained to absorb the functional diversity of ensemble teachers, but then those subnetworks are properly averaged for inference, giving a single student network with no additional inference cost. We also propose a perturbation strategy that seeks inputs from which the diversities of teachers can be better transferred to the student. Combining these two, our method significantly improves upon previous methods on various image classification tasks.

Cite this Paper


BibTeX
@InProceedings{pmlr-v162-nam22a, title = {Improving Ensemble Distillation With Weight Averaging and Diversifying Perturbation}, author = {Nam, Giung and Lee, Hyungi and Heo, Byeongho and Lee, Juho}, booktitle = {Proceedings of the 39th International Conference on Machine Learning}, pages = {16353--16367}, year = {2022}, editor = {Chaudhuri, Kamalika and Jegelka, Stefanie and Song, Le and Szepesvari, Csaba and Niu, Gang and Sabato, Sivan}, volume = {162}, series = {Proceedings of Machine Learning Research}, month = {17--23 Jul}, publisher = {PMLR}, pdf = {https://proceedings.mlr.press/v162/nam22a/nam22a.pdf}, url = {https://proceedings.mlr.press/v162/nam22a.html}, abstract = {Ensembles of deep neural networks have demonstrated superior performance, but their heavy computational cost hinders applying them for resource-limited environments. It motivates distilling knowledge from the ensemble teacher into a smaller student network, and there are two important design choices for this ensemble distillation: 1) how to construct the student network, and 2) what data should be shown during training. In this paper, we propose a weight averaging technique where a student with multiple subnetworks is trained to absorb the functional diversity of ensemble teachers, but then those subnetworks are properly averaged for inference, giving a single student network with no additional inference cost. We also propose a perturbation strategy that seeks inputs from which the diversities of teachers can be better transferred to the student. Combining these two, our method significantly improves upon previous methods on various image classification tasks.} }
Endnote
%0 Conference Paper %T Improving Ensemble Distillation With Weight Averaging and Diversifying Perturbation %A Giung Nam %A Hyungi Lee %A Byeongho Heo %A Juho Lee %B Proceedings of the 39th International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2022 %E Kamalika Chaudhuri %E Stefanie Jegelka %E Le Song %E Csaba Szepesvari %E Gang Niu %E Sivan Sabato %F pmlr-v162-nam22a %I PMLR %P 16353--16367 %U https://proceedings.mlr.press/v162/nam22a.html %V 162 %X Ensembles of deep neural networks have demonstrated superior performance, but their heavy computational cost hinders applying them for resource-limited environments. It motivates distilling knowledge from the ensemble teacher into a smaller student network, and there are two important design choices for this ensemble distillation: 1) how to construct the student network, and 2) what data should be shown during training. In this paper, we propose a weight averaging technique where a student with multiple subnetworks is trained to absorb the functional diversity of ensemble teachers, but then those subnetworks are properly averaged for inference, giving a single student network with no additional inference cost. We also propose a perturbation strategy that seeks inputs from which the diversities of teachers can be better transferred to the student. Combining these two, our method significantly improves upon previous methods on various image classification tasks.
APA
Nam, G., Lee, H., Heo, B. & Lee, J.. (2022). Improving Ensemble Distillation With Weight Averaging and Diversifying Perturbation. Proceedings of the 39th International Conference on Machine Learning, in Proceedings of Machine Learning Research 162:16353-16367 Available from https://proceedings.mlr.press/v162/nam22a.html.

Related Material