Robust distillation for worst-class performance: on the interplay between teacher and student objectives

Serena Wang, Harikrishna Narasimhan, Yichen Zhou, Sara Hooker, Michal Lukasik, Aditya Krishna Menon
Proceedings of the Thirty-Ninth Conference on Uncertainty in Artificial Intelligence, PMLR 216:2237-2247, 2023.

Abstract

Knowledge distillation is a popular technique that has been shown to produce remarkable gains in average accuracy. However, recent work has shown that these gains are not uniform across subgroups in the data, and can often come at the cost of accuracy on rare subgroups and classes. Robust optimization is a common remedy to improve worst-class accuracy in standard learning settings, but in distillation it is unknown whether it is best to apply robust objectives when training the teacher, the student, or both. This work studies the interplay between robust objectives for the teacher and student. Empirically, we show that that jointly modifying the teacher and student objectives can lead to better worst-class student performance and even Pareto improvement in the trade-off between worst-class and overall performance. Theoretically, we show that the per-class calibration of teacher scores is key when training a robust student. Both the theory and experiments support the surprising finding that applying a robust teacher training objective does not always yield a more robust student.

Cite this Paper


BibTeX
@InProceedings{pmlr-v216-wang23e, title = {Robust distillation for worst-class performance: on the interplay between teacher and student objectives}, author = {Wang, Serena and Narasimhan, Harikrishna and Zhou, Yichen and Hooker, Sara and Lukasik, Michal and Menon, Aditya Krishna}, booktitle = {Proceedings of the Thirty-Ninth Conference on Uncertainty in Artificial Intelligence}, pages = {2237--2247}, year = {2023}, editor = {Evans, Robin J. and Shpitser, Ilya}, volume = {216}, series = {Proceedings of Machine Learning Research}, month = {31 Jul--04 Aug}, publisher = {PMLR}, pdf = {https://proceedings.mlr.press/v216/wang23e/wang23e.pdf}, url = {https://proceedings.mlr.press/v216/wang23e.html}, abstract = {Knowledge distillation is a popular technique that has been shown to produce remarkable gains in average accuracy. However, recent work has shown that these gains are not uniform across subgroups in the data, and can often come at the cost of accuracy on rare subgroups and classes. Robust optimization is a common remedy to improve worst-class accuracy in standard learning settings, but in distillation it is unknown whether it is best to apply robust objectives when training the teacher, the student, or both. This work studies the interplay between robust objectives for the teacher and student. Empirically, we show that that jointly modifying the teacher and student objectives can lead to better worst-class student performance and even Pareto improvement in the trade-off between worst-class and overall performance. Theoretically, we show that the per-class calibration of teacher scores is key when training a robust student. Both the theory and experiments support the surprising finding that applying a robust teacher training objective does not always yield a more robust student.} }
Endnote
%0 Conference Paper %T Robust distillation for worst-class performance: on the interplay between teacher and student objectives %A Serena Wang %A Harikrishna Narasimhan %A Yichen Zhou %A Sara Hooker %A Michal Lukasik %A Aditya Krishna Menon %B Proceedings of the Thirty-Ninth Conference on Uncertainty in Artificial Intelligence %C Proceedings of Machine Learning Research %D 2023 %E Robin J. Evans %E Ilya Shpitser %F pmlr-v216-wang23e %I PMLR %P 2237--2247 %U https://proceedings.mlr.press/v216/wang23e.html %V 216 %X Knowledge distillation is a popular technique that has been shown to produce remarkable gains in average accuracy. However, recent work has shown that these gains are not uniform across subgroups in the data, and can often come at the cost of accuracy on rare subgroups and classes. Robust optimization is a common remedy to improve worst-class accuracy in standard learning settings, but in distillation it is unknown whether it is best to apply robust objectives when training the teacher, the student, or both. This work studies the interplay between robust objectives for the teacher and student. Empirically, we show that that jointly modifying the teacher and student objectives can lead to better worst-class student performance and even Pareto improvement in the trade-off between worst-class and overall performance. Theoretically, we show that the per-class calibration of teacher scores is key when training a robust student. Both the theory and experiments support the surprising finding that applying a robust teacher training objective does not always yield a more robust student.
APA
Wang, S., Narasimhan, H., Zhou, Y., Hooker, S., Lukasik, M. & Menon, A.K.. (2023). Robust distillation for worst-class performance: on the interplay between teacher and student objectives. Proceedings of the Thirty-Ninth Conference on Uncertainty in Artificial Intelligence, in Proceedings of Machine Learning Research 216:2237-2247 Available from https://proceedings.mlr.press/v216/wang23e.html.

Related Material