[edit]
Optimal Deterministic Coresets for Ridge Regression
Proceedings of the Twenty Third International Conference on Artificial Intelligence and Statistics, PMLR 108:4141-4150, 2020.
Abstract
We consider the ridge regression problem, for which we are given an nxd matrix A of examples and a corresponding nxd’ matrix B of labels, as well as a ridge parameter $\lambda \geq 0$, and would like to output an $X’ \in R^{d \times d’}$ for which $$\|AX’-B\|_F^2 + \lambda \|X’\|_F^2 \leq (1+\epsilon)OPT,$$ where ${OPT} = \min_{Y \in \mathbb{R}^{d \times d’}} \|AY-B\|_F^2 + \lambda \|Y\|_F^2.$ In the special case of $\lambda = 0$, this is ordinary multi-response linear regression. Our focus is on deterministically constructing coresets for this problem. Here the goal is to select and re-weight a small subset of rows of $A$ and corresponding labels of $B$, denoted by $SA$ and $SB$, so that if $X’$ is the minimizer to $\min_{X’} \|SAX’-SB\|_F^2 + \lambda \|X’\|_F^2$, then $\|AX’-B\|_F^2 + \lambda \|X’\|_F^2 \leq (1+\epsilon)OPT$. We show how to efficiently(poly(n,d,1/\epsilon) time) and deterministically select $O({sd}_{\lambda}/\epsilon)$ rows of $A$ and $B$ to achieve this property, and prove a matching lower bound, showing that it is necessary to select $\Omega({sd}_{\lambda}/\epsilon)$ rows no matter what the weights are, for any $1 < 1/\epsilon \leq sd_{\lambda}$. Here ${sd}_{\lambda}$ is the statistical dimension of the input, and we assume $d’ = O({sd}_{\lambda}) \leq d$. In the case of ordinary regression, this gives a deterministic algorithm achieving $O(d/\epsilon)$ rows and a matching lower bound for any $1 \leq 1/\epsilon \leq d$; for $1/\epsilon > d$ we show $\Theta(d^2)$ rows are sufficient. Finally we show our new coresets are mergeable, giving a deterministic protocol for ridge regression with $O({sd}_{\lambda}/\epsilon)$ words of communication per server, in the important case when the rows of $A$ and $B$ have a constant number of non-zero entries and there are a constant number of servers. Prior to our work the best deterministic protocols in this setting required $\Omega(min({sd}_{\lambda}^2,{sd}_{\lambda}/\epsilon^2))$ communication.