Computationally and Statistically Efficient Truncated Regression
[edit]
Proceedings of the ThirtySecond Conference on Learning Theory, PMLR 99:955960, 2019.
Abstract
We provide a computationally and statistically efficient estimator for the classical problem of truncated linear regression, where the dependent variable $y = \vec{w}^{\rm T} \vec{x}+{\varepsilon}$ and its corresponding vector of covariates $\vec{x} \in \mathbb{R}^k$ are only revealed if the dependent variable falls in some subset $S \subseteq \mathbb{R}$; otherwise the existence of the pair $(\vec{x},y)$ is hidden. This problem has remained a challenge since the early works of Tobin 1958, Amemiya et al. 1973, HAusman et al 1977, Breen et al. 1996, its applications are abundant, and its history dates back even further to the work of Galton 1897, Pearson 1908, Lee 1915, and Fisher 1931. While consistent estimators of the regression coefficients have been identified, the error rates are not wellunderstood, especially in highdimensional settings. Under a “thickness assumption” about the covariance matrix of the covariates in the revealed sample, we provide a computationally efficient estimator for the coefficient vector $\vec{w}$ from $n$ revealed samples that attains $\ell_2$ error $\tilde{O}(\sqrt{k / n})$, almost recovering the guarantees of least squares in the standard (untruncated) linear regression setting. Our estimator uses Projected Stochastic Gradient Descent (PSGD) on the negative loglikelihood of the truncated sample, and only needs oracle access to the set $S$, which may otherwise be arbitrary, and in particular may be nonconvex. PSGD must be restricted to an appropriately defined convex cone to guarantee that the negative loglikelihood is strongly convex, which in turn is established using concentration of matrices on variables with subexponential tails. We perform experiments on simulated data to illustrate the accuracy of our estimator. As a corollary of our work, we show that SGD provably learns the parameters of singlelayer neural networks with noisy Relu activation functions (see Nair and Hinton 2010, Bengio et al. 2013, Gulcehre et al. 2016), given linearly many, in the number of network parameters, inputoutput pairs in the realizable setting.
Related Material


