[edit]
Automatic Differentiation of Sketched Regression
Proceedings of the Twenty Third International Conference on Artificial Intelligence and Statistics, PMLR 108:4367-4376, 2020.
Abstract
Sketching for speeding up regression problems involves using a sketching matrix $S$ to quickly find the approximate solution to a linear least squares regression (LLS) problem: given $A$ of size $n \times d$, with $n \gg d$, along with $b$ of size $n \times 1$, we seek a vector $y$ with minimal regression error $\lVert A y - b\rVert_2$. This approximation technique is now standard in data science, and many software systems use sketched regression internally, as a component. It is often useful to calculate derivatives (gradients for the purpose of optimization, for example) of such large systems, where sketched LLS is merely a component of a larger system whose derivatives are needed. To support Automatic Differentiation (AD) of systems containing sketched LLS, we consider propagating derivatives through $\textrm{LLS}$: both propagating perturbations (forward AD) and gradients (reverse AD). AD performs accurate differentiation and is efficient for problems with a huge number of independent variables. Since we use $\textrm{LLS}_S$ (sketched LLS) instead of $\textrm{LLS}$ for reasons of efficiency, propagation of derivatives also needs to trade accuracy for efficiency, presumably by sketching. There are two approaches for this: (a) use AD to transform the code that definesĀ $\textrm{LLS}_S$, or (b) approximate exact derivative propagation through $\textrm{LLS}$ using sketching methods. We provide strong bounds on the errors produced due to these two natural forms of sketching in the context of AD, giving the first dimensionality reduction analysis for calculating the derivatives of a sketched computation. Our results crucially depend on the analysis of the operator norm of a sketched inverse matrix product. Extensive experiments on both synthetic and real-world experiments demonstrate the efficacy of our sketched gradients.