A Theoretical Analysis of Soft-Label vs Hard-Label Training in Neural Networks

Saptarshi Mandal, Xiaojun Lin, Rayadurgam Srikant
Proceedings of the 7th Annual Learning for Dynamics \& Control Conference, PMLR 283:1078-1089, 2025.

Abstract

Knowledge distillation, where a small student model learns from a pre-trained large teacher model, has achieved substantial empirical success since the seminal work of (Hinton et al., 2015). Despite prior theoretical studies exploring the benefits of knowledge distillation, an important question remains unanswered: why does soft-label training from the teacher require significantly fewer neurons than directly training a small neural network with hard labels? To address this, we first present motivating experimental results using simple neural network models on a binary classification problem. These results demonstrate that soft-label training consistently outperforms hard-label training in accuracy, with the performance gap becoming more pronounced as the dataset becomes increasingly difficult to classify. We then substantiate these observations with a theoretical contribution based on two-layer neural network models. Specifically, we show that soft-label training using gradient descent requires only \(O\left(\frac{1}{\gamma^2 \epsilon}\right)\){neurons} to achieve a classification loss averaged over epochs smaller than some \(\epsilon > 0\), where \(\gamma\){is} the separation margin of the limiting kernel. In contrast, hard-label training requires \(O\left(\frac{1}{\gamma^4} \cdot \ln\left(\frac{1}{\epsilon}\right)\right)\){neurons}, as derived from an adapted version of the gradient descent analysis in (Ji and Telgarsky, 2020). This implies that when \(\gamma \leq \epsilon\), i.e., when the dataset is challenging to classify, the neuron requirement for soft-label training can be significantly lower than that for hard-label training. Finally, we present experimental results on deep neural networks, further validating these theoretical findings.

Cite this Paper


BibTeX
@InProceedings{pmlr-v283-mandal25a, title = {A Theoretical Analysis of Soft-Label vs Hard-Label Training in Neural Networks}, author = {Mandal, Saptarshi and Lin, Xiaojun and Srikant, Rayadurgam}, booktitle = {Proceedings of the 7th Annual Learning for Dynamics \& Control Conference}, pages = {1078--1089}, year = {2025}, editor = {Ozay, Necmiye and Balzano, Laura and Panagou, Dimitra and Abate, Alessandro}, volume = {283}, series = {Proceedings of Machine Learning Research}, month = {04--06 Jun}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v283/main/assets/mandal25a/mandal25a.pdf}, url = {https://proceedings.mlr.press/v283/mandal25a.html}, abstract = {Knowledge distillation, where a small student model learns from a pre-trained large teacher model, has achieved substantial empirical success since the seminal work of (Hinton et al., 2015). Despite prior theoretical studies exploring the benefits of knowledge distillation, an important question remains unanswered: why does soft-label training from the teacher require significantly fewer neurons than directly training a small neural network with hard labels? To address this, we first present motivating experimental results using simple neural network models on a binary classification problem. These results demonstrate that soft-label training consistently outperforms hard-label training in accuracy, with the performance gap becoming more pronounced as the dataset becomes increasingly difficult to classify. We then substantiate these observations with a theoretical contribution based on two-layer neural network models. Specifically, we show that soft-label training using gradient descent requires only \(O\left(\frac{1}{\gamma^2 \epsilon}\right)\){neurons} to achieve a classification loss averaged over epochs smaller than some \(\epsilon > 0\), where \(\gamma\){is} the separation margin of the limiting kernel. In contrast, hard-label training requires \(O\left(\frac{1}{\gamma^4} \cdot \ln\left(\frac{1}{\epsilon}\right)\right)\){neurons}, as derived from an adapted version of the gradient descent analysis in (Ji and Telgarsky, 2020). This implies that when \(\gamma \leq \epsilon\), i.e., when the dataset is challenging to classify, the neuron requirement for soft-label training can be significantly lower than that for hard-label training. Finally, we present experimental results on deep neural networks, further validating these theoretical findings.} }
Endnote
%0 Conference Paper %T A Theoretical Analysis of Soft-Label vs Hard-Label Training in Neural Networks %A Saptarshi Mandal %A Xiaojun Lin %A Rayadurgam Srikant %B Proceedings of the 7th Annual Learning for Dynamics \& Control Conference %C Proceedings of Machine Learning Research %D 2025 %E Necmiye Ozay %E Laura Balzano %E Dimitra Panagou %E Alessandro Abate %F pmlr-v283-mandal25a %I PMLR %P 1078--1089 %U https://proceedings.mlr.press/v283/mandal25a.html %V 283 %X Knowledge distillation, where a small student model learns from a pre-trained large teacher model, has achieved substantial empirical success since the seminal work of (Hinton et al., 2015). Despite prior theoretical studies exploring the benefits of knowledge distillation, an important question remains unanswered: why does soft-label training from the teacher require significantly fewer neurons than directly training a small neural network with hard labels? To address this, we first present motivating experimental results using simple neural network models on a binary classification problem. These results demonstrate that soft-label training consistently outperforms hard-label training in accuracy, with the performance gap becoming more pronounced as the dataset becomes increasingly difficult to classify. We then substantiate these observations with a theoretical contribution based on two-layer neural network models. Specifically, we show that soft-label training using gradient descent requires only \(O\left(\frac{1}{\gamma^2 \epsilon}\right)\){neurons} to achieve a classification loss averaged over epochs smaller than some \(\epsilon > 0\), where \(\gamma\){is} the separation margin of the limiting kernel. In contrast, hard-label training requires \(O\left(\frac{1}{\gamma^4} \cdot \ln\left(\frac{1}{\epsilon}\right)\right)\){neurons}, as derived from an adapted version of the gradient descent analysis in (Ji and Telgarsky, 2020). This implies that when \(\gamma \leq \epsilon\), i.e., when the dataset is challenging to classify, the neuron requirement for soft-label training can be significantly lower than that for hard-label training. Finally, we present experimental results on deep neural networks, further validating these theoretical findings.
APA
Mandal, S., Lin, X. & Srikant, R.. (2025). A Theoretical Analysis of Soft-Label vs Hard-Label Training in Neural Networks. Proceedings of the 7th Annual Learning for Dynamics \& Control Conference, in Proceedings of Machine Learning Research 283:1078-1089 Available from https://proceedings.mlr.press/v283/mandal25a.html.

Related Material