Retraining with Predicted Hard Labels Provably Increases Model Accuracy

Rudrajit Das, Inderjit S Dhillon, Alessandro Epasto, Adel Javanmard, Jieming Mao, Vahab Mirrokni, Sujay Sanghavi, Peilin Zhong
Proceedings of the 42nd International Conference on Machine Learning, PMLR 267:12509-12538, 2025.

Abstract

The performance of a model trained with noisy labels is often improved by simply retraining the model with its own predicted hard labels (i.e., $1$/$0$ labels). Yet, a detailed theoretical characterization of this phenomenon is lacking. In this paper, we theoretically analyze retraining in a linearly separable binary classification setting with randomly corrupted labels given to us and prove that retraining can improve the population accuracy obtained by initially training with the given (noisy) labels. To the best of our knowledge, this is the first such theoretical result. Retraining finds application in improving training with local label differential privacy (DP), which involves training with noisy labels. We empirically show that retraining selectively on the samples for which the predicted label matches the given label significantly improves label DP training at no extra privacy cost; we call this consensus-based retraining. For example, when training ResNet-18 on CIFAR-100 with $\epsilon=3$ label DP, we obtain more than $6$% improvement in accuracy with consensus-based retraining.

Cite this Paper


BibTeX
@InProceedings{pmlr-v267-das25b, title = {Retraining with Predicted Hard Labels Provably Increases Model Accuracy}, author = {Das, Rudrajit and Dhillon, Inderjit S and Epasto, Alessandro and Javanmard, Adel and Mao, Jieming and Mirrokni, Vahab and Sanghavi, Sujay and Zhong, Peilin}, booktitle = {Proceedings of the 42nd International Conference on Machine Learning}, pages = {12509--12538}, year = {2025}, editor = {Singh, Aarti and Fazel, Maryam and Hsu, Daniel and Lacoste-Julien, Simon and Berkenkamp, Felix and Maharaj, Tegan and Wagstaff, Kiri and Zhu, Jerry}, volume = {267}, series = {Proceedings of Machine Learning Research}, month = {13--19 Jul}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v267/main/assets/das25b/das25b.pdf}, url = {https://proceedings.mlr.press/v267/das25b.html}, abstract = {The performance of a model trained with noisy labels is often improved by simply retraining the model with its own predicted hard labels (i.e., $1$/$0$ labels). Yet, a detailed theoretical characterization of this phenomenon is lacking. In this paper, we theoretically analyze retraining in a linearly separable binary classification setting with randomly corrupted labels given to us and prove that retraining can improve the population accuracy obtained by initially training with the given (noisy) labels. To the best of our knowledge, this is the first such theoretical result. Retraining finds application in improving training with local label differential privacy (DP), which involves training with noisy labels. We empirically show that retraining selectively on the samples for which the predicted label matches the given label significantly improves label DP training at no extra privacy cost; we call this consensus-based retraining. For example, when training ResNet-18 on CIFAR-100 with $\epsilon=3$ label DP, we obtain more than $6$% improvement in accuracy with consensus-based retraining.} }
Endnote
%0 Conference Paper %T Retraining with Predicted Hard Labels Provably Increases Model Accuracy %A Rudrajit Das %A Inderjit S Dhillon %A Alessandro Epasto %A Adel Javanmard %A Jieming Mao %A Vahab Mirrokni %A Sujay Sanghavi %A Peilin Zhong %B Proceedings of the 42nd International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2025 %E Aarti Singh %E Maryam Fazel %E Daniel Hsu %E Simon Lacoste-Julien %E Felix Berkenkamp %E Tegan Maharaj %E Kiri Wagstaff %E Jerry Zhu %F pmlr-v267-das25b %I PMLR %P 12509--12538 %U https://proceedings.mlr.press/v267/das25b.html %V 267 %X The performance of a model trained with noisy labels is often improved by simply retraining the model with its own predicted hard labels (i.e., $1$/$0$ labels). Yet, a detailed theoretical characterization of this phenomenon is lacking. In this paper, we theoretically analyze retraining in a linearly separable binary classification setting with randomly corrupted labels given to us and prove that retraining can improve the population accuracy obtained by initially training with the given (noisy) labels. To the best of our knowledge, this is the first such theoretical result. Retraining finds application in improving training with local label differential privacy (DP), which involves training with noisy labels. We empirically show that retraining selectively on the samples for which the predicted label matches the given label significantly improves label DP training at no extra privacy cost; we call this consensus-based retraining. For example, when training ResNet-18 on CIFAR-100 with $\epsilon=3$ label DP, we obtain more than $6$% improvement in accuracy with consensus-based retraining.
APA
Das, R., Dhillon, I.S., Epasto, A., Javanmard, A., Mao, J., Mirrokni, V., Sanghavi, S. & Zhong, P.. (2025). Retraining with Predicted Hard Labels Provably Increases Model Accuracy. Proceedings of the 42nd International Conference on Machine Learning, in Proceedings of Machine Learning Research 267:12509-12538 Available from https://proceedings.mlr.press/v267/das25b.html.

Related Material