Optimal Deterministic Coresets for Ridge Regression
[edit]
Proceedings of the Twenty Third International Conference on Artificial Intelligence and Statistics, PMLR 108:41414150, 2020.
Abstract
We consider the ridge regression problem, for which we are given an nxd matrix A of examples and a corresponding nxd’ matrix B of labels, as well as a ridge parameter $\lambda \geq 0$, and would like to output an $X’ \in R^{d \times d’}$ for which $$\AX’B\_F^2 + \lambda \X’\_F^2 \leq (1+\epsilon)OPT,$$ where ${OPT} = \min_{Y \in \mathbb{R}^{d \times d’}} \AYB\_F^2 + \lambda \Y\_F^2.$ In the special case of $\lambda = 0$, this is ordinary multiresponse linear regression. Our focus is on deterministically constructing coresets for this problem. Here the goal is to select and reweight a small subset of rows of $A$ and corresponding labels of $B$, denoted by $SA$ and $SB$, so that if $X’$ is the minimizer to $\min_{X’} \SAX’SB\_F^2 + \lambda \X’\_F^2$, then $\AX’B\_F^2 + \lambda \X’\_F^2 \leq (1+\epsilon)OPT$. We show how to efficiently(poly(n,d,1/\epsilon) time) and deterministically select $O({sd}_{\lambda}/\epsilon)$ rows of $A$ and $B$ to achieve this property, and prove a matching lower bound, showing that it is necessary to select $\Omega({sd}_{\lambda}/\epsilon)$ rows no matter what the weights are, for any $1 < 1/\epsilon \leq sd_{\lambda}$. Here ${sd}_{\lambda}$ is the statistical dimension of the input, and we assume $d’ = O({sd}_{\lambda}) \leq d$. In the case of ordinary regression, this gives a deterministic algorithm achieving $O(d/\epsilon)$ rows and a matching lower bound for any $1 \leq 1/\epsilon \leq d$; for $1/\epsilon > d$ we show $\Theta(d^2)$ rows are sufficient. Finally we show our new coresets are mergeable, giving a deterministic protocol for ridge regression with $O({sd}_{\lambda}/\epsilon)$ words of communication per server, in the important case when the rows of $A$ and $B$ have a constant number of nonzero entries and there are a constant number of servers. Prior to our work the best deterministic protocols in this setting required $\Omega(min({sd}_{\lambda}^2,{sd}_{\lambda}/\epsilon^2))$ communication.
Related Material


