[edit]

# Learning a Single Neuron with Adversarial Label Noise via Gradient Descent

*Proceedings of Thirty Fifth Conference on Learning Theory*, PMLR 178:4313-4361, 2022.

#### Abstract

We study the fundamental problem of learning a single neuron, i.e., a function of the form $\x \mapsto \sigma(\vec w \cdot \x)$ for monotone activations $\sigma:\R \mapsto \R$, with respect to the $L_2^2$-loss in the presence of adversarial label noise. Specifically, we are given labeled examples from a distribution $D$ on $(\x{}, y) \in \R^d \times \R$ such that there exists $\vec w^\ast \in \R^d$ achieving $F(\vec w^\ast) = \opt$, where $F(\vec w) = \E_{(\x{},y) \sim D}[(\sigma(\vec w\cdot \x) - y)^2]$. The goal of the learner is to output a hypothesis vector $\wt{\vec w}$ such that $F(\wt{\vec w}) = C \, \opt+\eps$ with high probability, where $C$ is a universal constant. As our main contribution, we give efficient constant-factor approximate learners for a broad class of distributions (including log-concave distributions) and activation functions (including ReLUs and sigmoids). Concretely, for the class of isotropic log-concave distributions, we obtain the following important corollaries: \begin{itemize}[leftmargin=3pc, rightmargin = 1.5pc] \item For the logistic activation, i.e., $\sigma(t) = 1/(1+e^{-t})$, we obtain the first polynomial-time constant factor approximation, even under the Gaussian distribution. Moreover, our algorithm has sample complexity $\wt{O}(d/\eps)$, which is tight within polylogarithmic factors. \item For the ReLU activation, i.e., $\sigma(t) = \max(0,t)$, we give an efficient algorithm with sample complexity $\wt{O}(d \, \polylog(1/\eps))$. Prior to our work, the best known constant-factor approximate learner had sample complexity $\Omega(d/\eps)$. \end{itemize} In both settings, our algorithms are simple, performing gradient-descent on the (regularized) $L_2^2$-loss. The correctness of our algorithms relies on novel structural results that we establish, showing that (essentially all) stationary points of the underlying non-convex loss are approximately optimal.