Learning a Single Neuron with Adversarial Label Noise via Gradient Descent

Ilias Diakonikolas; Vasilis Kontonis; Christos Tzamos; Nikos Zarifis

Learning a Single Neuron with Adversarial Label Noise via Gradient Descent

Ilias Diakonikolas, Vasilis Kontonis, Christos Tzamos, Nikos Zarifis

Proceedings of Thirty Fifth Conference on Learning Theory, PMLR 178:4313-4361, 2022.

Abstract

We study the fundamental problem of learning a single neuron, i.e., a function of the form

$\x \mapsto \sigma(\vec w \cdot \x)$ for monotone activations

$\sigma:\R \mapsto \R$ , with respect to the

$L_2^2$ -loss in the presence of adversarial label noise. Specifically, we are given labeled examples from a distribution

$D$ on

$(\x{}, y) \in \R^d \times \R$ such that there exists

$\vec w^\ast \in \R^d$ achieving

$F(\vec w^\ast) = \opt$ , where

$F(\vec w) = \E_{(\x{},y) \sim D}[(\sigma(\vec w\cdot \x) - y)^2]$ . The goal of the learner is to output a hypothesis vector

$\wt{\vec w}$ such that

$F(\wt{\vec w}) = C \, \opt+\eps$ with high probability, where

$C$ is a universal constant. As our main contribution, we give efficient constant-factor approximate learners for a broad class of distributions (including log-concave distributions) and activation functions (including ReLUs and sigmoids). Concretely, for the class of isotropic log-concave distributions, we obtain the following important corollaries:

$\begin{itemize}[leftmargin=3pc, rightmargin = 1.5pc] \item For the logistic activation, i.e., $\sigma(t) = 1/(1+e^{-t})$, we obtain the first polynomial-time constant factor approximation, even under the Gaussian distribution. Moreover, our algorithm has sample complexity $\wt{O}(d/\eps)$, which is tight within polylogarithmic factors. \item For the ReLU activation, i.e., $\sigma(t) = \max(0,t)$, we give an efficient algorithm with sample complexity $\wt{O}(d \, \polylog(1/\eps))$. Prior to our work, the best known constant-factor approximate learner had sample complexity $\Omega(d/\eps)$. \end{itemize}$ In both settings, our algorithms are simple, performing gradient-descent on the (regularized)

$L_2^2$ -loss. The correctness of our algorithms relies on novel structural results that we establish, showing that (essentially all) stationary points of the underlying non-convex loss are approximately optimal.

Cite this Paper

BibTeX


@InProceedings{pmlr-v178-diakonikolas22c,
  title = 	 {Learning a Single Neuron with Adversarial Label Noise via Gradient Descent},
  author =       {Diakonikolas, Ilias and Kontonis, Vasilis and Tzamos, Christos and Zarifis, Nikos},
  booktitle = 	 {Proceedings of Thirty Fifth Conference on Learning Theory},
  pages = 	 {4313--4361},
  year = 	 {2022},
  editor = 	 {Loh, Po-Ling and Raginsky, Maxim},
  volume = 	 {178},
  series = 	 {Proceedings of Machine Learning Research},
  month = 	 {02--05 Jul},
  publisher =    {PMLR},
  pdf = 	 {https://proceedings.mlr.press/v178/diakonikolas22c/diakonikolas22c.pdf},
  url = 	 {https://proceedings.mlr.press/v178/diakonikolas22c.html},
  abstract = 	 {We study the fundamental problem of learning a single neuron, i.e., a  function of the form $\x \mapsto \sigma(\vec w \cdot \x)$ for monotone activations  $\sigma:\R \mapsto \R$, with respect to the $L_2^2$-loss in the presence of adversarial label noise.  Specifically, we are given labeled examples from a distribution $D$ on $(\x{}, y) \in \R^d \times \R$ such that there exists $\vec w^\ast \in \R^d$ achieving $F(\vec w^\ast)  = \opt$, where $F(\vec w) = \E_{(\x{},y) \sim D}[(\sigma(\vec w\cdot \x) - y)^2]$. The goal of the learner is to output a hypothesis vector $\wt{\vec w}$ such that $F(\wt{\vec w}) = C \, \opt+\eps$ with  high probability, where $C$ is a universal constant. As our main contribution, we give efficient constant-factor approximate learners  for a broad class of distributions (including log-concave distributions) and activation functions (including ReLUs and sigmoids). Concretely, for the class of isotropic log-concave distributions, we obtain the following important corollaries: \begin{itemize}[leftmargin=3pc, rightmargin = 1.5pc] \item For the logistic activation, i.e., $\sigma(t) = 1/(1+e^{-t})$, we obtain the first polynomial-time constant factor approximation, even under the Gaussian distribution. Moreover, our algorithm has sample complexity $\wt{O}(d/\eps)$, which is tight within  polylogarithmic factors.   \item For the ReLU activation, i.e., $\sigma(t) =  \max(0,t)$, we give an efficient algorithm with  sample complexity $\wt{O}(d \, \polylog(1/\eps))$. Prior to our work, the best known  constant-factor approximate learner had sample complexity $\Omega(d/\eps)$. \end{itemize}  In both settings, our algorithms are simple, performing gradient-descent on the (regularized) $L_2^2$-loss.  The correctness of our algorithms relies on novel structural results that we establish,  showing that (essentially all) stationary points of the underlying non-convex loss are approximately optimal.}
}

Endnote

%0 Conference Paper
%T Learning a Single Neuron with Adversarial Label Noise via Gradient Descent
%A Ilias Diakonikolas
%A Vasilis Kontonis
%A Christos Tzamos
%A Nikos Zarifis
%B Proceedings of Thirty Fifth Conference on Learning Theory
%C Proceedings of Machine Learning Research
%D 2022
%E Po-Ling Loh
%E Maxim Raginsky	
%F pmlr-v178-diakonikolas22c
%I PMLR
%P 4313--4361
%U https://proceedings.mlr.press/v178/diakonikolas22c.html
%V 178
%X We study the fundamental problem of learning a single neuron, i.e., a  function of the form $\x \mapsto \sigma(\vec w \cdot \x)$ for monotone activations  $\sigma:\R \mapsto \R$, with respect to the $L_2^2$-loss in the presence of adversarial label noise.  Specifically, we are given labeled examples from a distribution $D$ on $(\x{}, y) \in \R^d \times \R$ such that there exists $\vec w^\ast \in \R^d$ achieving $F(\vec w^\ast)  = \opt$, where $F(\vec w) = \E_{(\x{},y) \sim D}[(\sigma(\vec w\cdot \x) - y)^2]$. The goal of the learner is to output a hypothesis vector $\wt{\vec w}$ such that $F(\wt{\vec w}) = C \, \opt+\eps$ with  high probability, where $C$ is a universal constant. As our main contribution, we give efficient constant-factor approximate learners  for a broad class of distributions (including log-concave distributions) and activation functions (including ReLUs and sigmoids). Concretely, for the class of isotropic log-concave distributions, we obtain the following important corollaries: \begin{itemize}[leftmargin=3pc, rightmargin = 1.5pc] \item For the logistic activation, i.e., $\sigma(t) = 1/(1+e^{-t})$, we obtain the first polynomial-time constant factor approximation, even under the Gaussian distribution. Moreover, our algorithm has sample complexity $\wt{O}(d/\eps)$, which is tight within  polylogarithmic factors.   \item For the ReLU activation, i.e., $\sigma(t) =  \max(0,t)$, we give an efficient algorithm with  sample complexity $\wt{O}(d \, \polylog(1/\eps))$. Prior to our work, the best known  constant-factor approximate learner had sample complexity $\Omega(d/\eps)$. \end{itemize}  In both settings, our algorithms are simple, performing gradient-descent on the (regularized) $L_2^2$-loss.  The correctness of our algorithms relies on novel structural results that we establish,  showing that (essentially all) stationary points of the underlying non-convex loss are approximately optimal.

APA


Diakonikolas, I., Kontonis, V., Tzamos, C. & Zarifis, N.. (2022). Learning a Single Neuron with Adversarial Label Noise via Gradient Descent. Proceedings of Thirty Fifth Conference on Learning Theory, in Proceedings of Machine Learning Research 178:4313-4361 Available from https://proceedings.mlr.press/v178/diakonikolas22c.html.

Related Material

Download PDF