Learning a Single Neuron with Adversarial Label Noise via Gradient Descent

Ilias Diakonikolas, Vasilis Kontonis, Christos Tzamos, Nikos Zarifis
Proceedings of Thirty Fifth Conference on Learning Theory, PMLR 178:4313-4361, 2022.

Abstract

We study the fundamental problem of learning a single neuron, i.e., a function of the form $\x \mapsto \sigma(\vec w \cdot \x)$ for monotone activations $\sigma:\R \mapsto \R$, with respect to the $L_2^2$-loss in the presence of adversarial label noise. Specifically, we are given labeled examples from a distribution $D$ on $(\x{}, y) \in \R^d \times \R$ such that there exists $\vec w^\ast \in \R^d$ achieving $F(\vec w^\ast) = \opt$, where $F(\vec w) = \E_{(\x{},y) \sim D}[(\sigma(\vec w\cdot \x) - y)^2]$. The goal of the learner is to output a hypothesis vector $\wt{\vec w}$ such that $F(\wt{\vec w}) = C \, \opt+\eps$ with high probability, where $C$ is a universal constant. As our main contribution, we give efficient constant-factor approximate learners for a broad class of distributions (including log-concave distributions) and activation functions (including ReLUs and sigmoids). Concretely, for the class of isotropic log-concave distributions, we obtain the following important corollaries: \begin{itemize}[leftmargin=3pc, rightmargin = 1.5pc] \item For the logistic activation, i.e., $\sigma(t) = 1/(1+e^{-t})$, we obtain the first polynomial-time constant factor approximation, even under the Gaussian distribution. Moreover, our algorithm has sample complexity $\wt{O}(d/\eps)$, which is tight within polylogarithmic factors. \item For the ReLU activation, i.e., $\sigma(t) = \max(0,t)$, we give an efficient algorithm with sample complexity $\wt{O}(d \, \polylog(1/\eps))$. Prior to our work, the best known constant-factor approximate learner had sample complexity $\Omega(d/\eps)$. \end{itemize} In both settings, our algorithms are simple, performing gradient-descent on the (regularized) $L_2^2$-loss. The correctness of our algorithms relies on novel structural results that we establish, showing that (essentially all) stationary points of the underlying non-convex loss are approximately optimal.

Cite this Paper


BibTeX
@InProceedings{pmlr-v178-diakonikolas22c, title = {Learning a Single Neuron with Adversarial Label Noise via Gradient Descent}, author = {Diakonikolas, Ilias and Kontonis, Vasilis and Tzamos, Christos and Zarifis, Nikos}, booktitle = {Proceedings of Thirty Fifth Conference on Learning Theory}, pages = {4313--4361}, year = {2022}, editor = {Loh, Po-Ling and Raginsky, Maxim}, volume = {178}, series = {Proceedings of Machine Learning Research}, month = {02--05 Jul}, publisher = {PMLR}, pdf = {https://proceedings.mlr.press/v178/diakonikolas22c/diakonikolas22c.pdf}, url = {https://proceedings.mlr.press/v178/diakonikolas22c.html}, abstract = {We study the fundamental problem of learning a single neuron, i.e., a function of the form $\x \mapsto \sigma(\vec w \cdot \x)$ for monotone activations $\sigma:\R \mapsto \R$, with respect to the $L_2^2$-loss in the presence of adversarial label noise. Specifically, we are given labeled examples from a distribution $D$ on $(\x{}, y) \in \R^d \times \R$ such that there exists $\vec w^\ast \in \R^d$ achieving $F(\vec w^\ast) = \opt$, where $F(\vec w) = \E_{(\x{},y) \sim D}[(\sigma(\vec w\cdot \x) - y)^2]$. The goal of the learner is to output a hypothesis vector $\wt{\vec w}$ such that $F(\wt{\vec w}) = C \, \opt+\eps$ with high probability, where $C$ is a universal constant. As our main contribution, we give efficient constant-factor approximate learners for a broad class of distributions (including log-concave distributions) and activation functions (including ReLUs and sigmoids). Concretely, for the class of isotropic log-concave distributions, we obtain the following important corollaries: \begin{itemize}[leftmargin=3pc, rightmargin = 1.5pc] \item For the logistic activation, i.e., $\sigma(t) = 1/(1+e^{-t})$, we obtain the first polynomial-time constant factor approximation, even under the Gaussian distribution. Moreover, our algorithm has sample complexity $\wt{O}(d/\eps)$, which is tight within polylogarithmic factors. \item For the ReLU activation, i.e., $\sigma(t) = \max(0,t)$, we give an efficient algorithm with sample complexity $\wt{O}(d \, \polylog(1/\eps))$. Prior to our work, the best known constant-factor approximate learner had sample complexity $\Omega(d/\eps)$. \end{itemize} In both settings, our algorithms are simple, performing gradient-descent on the (regularized) $L_2^2$-loss. The correctness of our algorithms relies on novel structural results that we establish, showing that (essentially all) stationary points of the underlying non-convex loss are approximately optimal.} }
Endnote
%0 Conference Paper %T Learning a Single Neuron with Adversarial Label Noise via Gradient Descent %A Ilias Diakonikolas %A Vasilis Kontonis %A Christos Tzamos %A Nikos Zarifis %B Proceedings of Thirty Fifth Conference on Learning Theory %C Proceedings of Machine Learning Research %D 2022 %E Po-Ling Loh %E Maxim Raginsky %F pmlr-v178-diakonikolas22c %I PMLR %P 4313--4361 %U https://proceedings.mlr.press/v178/diakonikolas22c.html %V 178 %X We study the fundamental problem of learning a single neuron, i.e., a function of the form $\x \mapsto \sigma(\vec w \cdot \x)$ for monotone activations $\sigma:\R \mapsto \R$, with respect to the $L_2^2$-loss in the presence of adversarial label noise. Specifically, we are given labeled examples from a distribution $D$ on $(\x{}, y) \in \R^d \times \R$ such that there exists $\vec w^\ast \in \R^d$ achieving $F(\vec w^\ast) = \opt$, where $F(\vec w) = \E_{(\x{},y) \sim D}[(\sigma(\vec w\cdot \x) - y)^2]$. The goal of the learner is to output a hypothesis vector $\wt{\vec w}$ such that $F(\wt{\vec w}) = C \, \opt+\eps$ with high probability, where $C$ is a universal constant. As our main contribution, we give efficient constant-factor approximate learners for a broad class of distributions (including log-concave distributions) and activation functions (including ReLUs and sigmoids). Concretely, for the class of isotropic log-concave distributions, we obtain the following important corollaries: \begin{itemize}[leftmargin=3pc, rightmargin = 1.5pc] \item For the logistic activation, i.e., $\sigma(t) = 1/(1+e^{-t})$, we obtain the first polynomial-time constant factor approximation, even under the Gaussian distribution. Moreover, our algorithm has sample complexity $\wt{O}(d/\eps)$, which is tight within polylogarithmic factors. \item For the ReLU activation, i.e., $\sigma(t) = \max(0,t)$, we give an efficient algorithm with sample complexity $\wt{O}(d \, \polylog(1/\eps))$. Prior to our work, the best known constant-factor approximate learner had sample complexity $\Omega(d/\eps)$. \end{itemize} In both settings, our algorithms are simple, performing gradient-descent on the (regularized) $L_2^2$-loss. The correctness of our algorithms relies on novel structural results that we establish, showing that (essentially all) stationary points of the underlying non-convex loss are approximately optimal.
APA
Diakonikolas, I., Kontonis, V., Tzamos, C. & Zarifis, N.. (2022). Learning a Single Neuron with Adversarial Label Noise via Gradient Descent. Proceedings of Thirty Fifth Conference on Learning Theory, in Proceedings of Machine Learning Research 178:4313-4361 Available from https://proceedings.mlr.press/v178/diakonikolas22c.html.

Related Material