[edit]
Learning a Single Neuron with Adversarial Label Noise via Gradient Descent
Proceedings of Thirty Fifth Conference on Learning Theory, PMLR 178:4313-4361, 2022.
Abstract
We study the fundamental problem of learning a single neuron, i.e., a function of the form \x↦σ(→w⋅\x) for monotone activations σ:\R↦\R, with respect to the L22-loss in the presence of adversarial label noise. Specifically, we are given labeled examples from a distribution D on (\x,y)∈\Rd×\R such that there exists →w∗∈\Rd achieving F(→w∗)=\opt, where F(→w)=\E(\x,y)∼D[(σ(→w⋅\x)−y)2]. The goal of the learner is to output a hypothesis vector \wt→w such that F(\wt→w)=C\opt+\eps with high probability, where C is a universal constant. As our main contribution, we give efficient constant-factor approximate learners for a broad class of distributions (including log-concave distributions) and activation functions (including ReLUs and sigmoids). Concretely, for the class of isotropic log-concave distributions, we obtain the following important corollaries: \begin{itemize}[leftmargin=3pc, rightmargin = 1.5pc] \item For the logistic activation, i.e., $\sigma(t) = 1/(1+e^{-t})$, we obtain the first polynomial-time constant factor approximation, even under the Gaussian distribution. Moreover, our algorithm has sample complexity $\wt{O}(d/\eps)$, which is tight within polylogarithmic factors. \item For the ReLU activation, i.e., $\sigma(t) = \max(0,t)$, we give an efficient algorithm with sample complexity $\wt{O}(d \, \polylog(1/\eps))$. Prior to our work, the best known constant-factor approximate learner had sample complexity $\Omega(d/\eps)$. \end{itemize} In both settings, our algorithms are simple, performing gradient-descent on the (regularized) L22-loss. The correctness of our algorithms relies on novel structural results that we establish, showing that (essentially all) stationary points of the underlying non-convex loss are approximately optimal.