Understanding Self-Distillation in the Presence of Label Noise

Rudrajit Das, Sujay Sanghavi
Proceedings of the 40th International Conference on Machine Learning, PMLR 202:7102-7140, 2023.

Abstract

Self-distillation (SD) is the process of first training a "teacher" model and then using its predictions to train a "student" model that has the same architecture. Specifically, the student’s loss is $\big(\xi*\ell(\text{teacher’s predictions}, \text{ student’s predictions}) + (1-\xi)*\ell(\text{given labels}, \text{ student’s predictions})\big)$, where $\ell$ is the loss function and $\xi$ is some parameter $\in [0,1]$. SD has been empirically observed to provide performance gains in several settings. In this paper, we theoretically characterize the effect of SD in two supervised learning problems with noisy labels. We first analyze SD for regularized linear regression and show that in the high label noise regime, the optimal value of $\xi$ that minimizes the expected error in estimating the ground truth parameter is surprisingly greater than 1. Empirically, we show that $\xi > 1$ works better than $\xi \leq 1$ even with the cross-entropy loss for several classification datasets when 50% or 30% of the labels are corrupted. Further, we quantify when optimal SD is better than optimal regularization. Next, we analyze SD in the case of logistic regression for binary classification with random label corruption and quantify the range of label corruption in which the student outperforms the teacher (w.r.t. accuracy). To our knowledge, this is the first result of its kind for the cross-entropy loss.

Cite this Paper


BibTeX
@InProceedings{pmlr-v202-das23d, title = {Understanding Self-Distillation in the Presence of Label Noise}, author = {Das, Rudrajit and Sanghavi, Sujay}, booktitle = {Proceedings of the 40th International Conference on Machine Learning}, pages = {7102--7140}, year = {2023}, editor = {Krause, Andreas and Brunskill, Emma and Cho, Kyunghyun and Engelhardt, Barbara and Sabato, Sivan and Scarlett, Jonathan}, volume = {202}, series = {Proceedings of Machine Learning Research}, month = {23--29 Jul}, publisher = {PMLR}, pdf = {https://proceedings.mlr.press/v202/das23d/das23d.pdf}, url = {https://proceedings.mlr.press/v202/das23d.html}, abstract = {Self-distillation (SD) is the process of first training a "teacher" model and then using its predictions to train a "student" model that has the same architecture. Specifically, the student’s loss is $\big(\xi*\ell(\text{teacher’s predictions}, \text{ student’s predictions}) + (1-\xi)*\ell(\text{given labels}, \text{ student’s predictions})\big)$, where $\ell$ is the loss function and $\xi$ is some parameter $\in [0,1]$. SD has been empirically observed to provide performance gains in several settings. In this paper, we theoretically characterize the effect of SD in two supervised learning problems with noisy labels. We first analyze SD for regularized linear regression and show that in the high label noise regime, the optimal value of $\xi$ that minimizes the expected error in estimating the ground truth parameter is surprisingly greater than 1. Empirically, we show that $\xi > 1$ works better than $\xi \leq 1$ even with the cross-entropy loss for several classification datasets when 50% or 30% of the labels are corrupted. Further, we quantify when optimal SD is better than optimal regularization. Next, we analyze SD in the case of logistic regression for binary classification with random label corruption and quantify the range of label corruption in which the student outperforms the teacher (w.r.t. accuracy). To our knowledge, this is the first result of its kind for the cross-entropy loss.} }
Endnote
%0 Conference Paper %T Understanding Self-Distillation in the Presence of Label Noise %A Rudrajit Das %A Sujay Sanghavi %B Proceedings of the 40th International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2023 %E Andreas Krause %E Emma Brunskill %E Kyunghyun Cho %E Barbara Engelhardt %E Sivan Sabato %E Jonathan Scarlett %F pmlr-v202-das23d %I PMLR %P 7102--7140 %U https://proceedings.mlr.press/v202/das23d.html %V 202 %X Self-distillation (SD) is the process of first training a "teacher" model and then using its predictions to train a "student" model that has the same architecture. Specifically, the student’s loss is $\big(\xi*\ell(\text{teacher’s predictions}, \text{ student’s predictions}) + (1-\xi)*\ell(\text{given labels}, \text{ student’s predictions})\big)$, where $\ell$ is the loss function and $\xi$ is some parameter $\in [0,1]$. SD has been empirically observed to provide performance gains in several settings. In this paper, we theoretically characterize the effect of SD in two supervised learning problems with noisy labels. We first analyze SD for regularized linear regression and show that in the high label noise regime, the optimal value of $\xi$ that minimizes the expected error in estimating the ground truth parameter is surprisingly greater than 1. Empirically, we show that $\xi > 1$ works better than $\xi \leq 1$ even with the cross-entropy loss for several classification datasets when 50% or 30% of the labels are corrupted. Further, we quantify when optimal SD is better than optimal regularization. Next, we analyze SD in the case of logistic regression for binary classification with random label corruption and quantify the range of label corruption in which the student outperforms the teacher (w.r.t. accuracy). To our knowledge, this is the first result of its kind for the cross-entropy loss.
APA
Das, R. & Sanghavi, S.. (2023). Understanding Self-Distillation in the Presence of Label Noise. Proceedings of the 40th International Conference on Machine Learning, in Proceedings of Machine Learning Research 202:7102-7140 Available from https://proceedings.mlr.press/v202/das23d.html.

Related Material