[edit]
Optimal Deterministic Coresets for Ridge Regression
Proceedings of the Twenty Third International Conference on Artificial Intelligence and Statistics, PMLR 108:4141-4150, 2020.
Abstract
We consider the ridge regression problem, for which we are given an nxd matrix A of examples and a corresponding nxd’ matrix B of labels, as well as a ridge parameter λ≥0, and would like to output an X′∈Rd×d′ for which ‖ where {OPT} = \min_{Y \in \mathbb{R}^{d \times d’}} \|AY-B\|_F^2 + \lambda \|Y\|_F^2. In the special case of \lambda = 0, this is ordinary multi-response linear regression. Our focus is on deterministically constructing coresets for this problem. Here the goal is to select and re-weight a small subset of rows of A and corresponding labels of B, denoted by SA and SB, so that if X’ is the minimizer to \min_{X’} \|SAX’-SB\|_F^2 + \lambda \|X’\|_F^2, then \|AX’-B\|_F^2 + \lambda \|X’\|_F^2 \leq (1+\epsilon)OPT. We show how to efficiently(poly(n,d,1/\epsilon) time) and deterministically select O({sd}_{\lambda}/\epsilon) rows of A and B to achieve this property, and prove a matching lower bound, showing that it is necessary to select \Omega({sd}_{\lambda}/\epsilon) rows no matter what the weights are, for any 1 < 1/\epsilon \leq sd_{\lambda}. Here {sd}_{\lambda} is the statistical dimension of the input, and we assume d’ = O({sd}_{\lambda}) \leq d. In the case of ordinary regression, this gives a deterministic algorithm achieving O(d/\epsilon) rows and a matching lower bound for any 1 \leq 1/\epsilon \leq d; for 1/\epsilon > d we show \Theta(d^2) rows are sufficient. Finally we show our new coresets are mergeable, giving a deterministic protocol for ridge regression with O({sd}_{\lambda}/\epsilon) words of communication per server, in the important case when the rows of A and B have a constant number of non-zero entries and there are a constant number of servers. Prior to our work the best deterministic protocols in this setting required \Omega(min({sd}_{\lambda}^2,{sd}_{\lambda}/\epsilon^2)) communication.