Optimal Deterministic Coresets for Ridge Regression

Praneeth Kacham, David Woodruff
Proceedings of the Twenty Third International Conference on Artificial Intelligence and Statistics, PMLR 108:4141-4150, 2020.

Abstract

We consider the ridge regression problem, for which we are given an nxd matrix A of examples and a corresponding nxd’ matrix B of labels, as well as a ridge parameter $\lambda \geq 0$, and would like to output an $X’ \in R^{d \times d’}$ for which $$\|AX’-B\|_F^2 + \lambda \|X’\|_F^2 \leq (1+\epsilon)OPT,$$ where ${OPT} = \min_{Y \in \mathbb{R}^{d \times d’}} \|AY-B\|_F^2 + \lambda \|Y\|_F^2.$ In the special case of $\lambda = 0$, this is ordinary multi-response linear regression. Our focus is on deterministically constructing coresets for this problem. Here the goal is to select and re-weight a small subset of rows of $A$ and corresponding labels of $B$, denoted by $SA$ and $SB$, so that if $X’$ is the minimizer to $\min_{X’} \|SAX’-SB\|_F^2 + \lambda \|X’\|_F^2$, then $\|AX’-B\|_F^2 + \lambda \|X’\|_F^2 \leq (1+\epsilon)OPT$. We show how to efficiently(poly(n,d,1/\epsilon) time) and deterministically select $O({sd}_{\lambda}/\epsilon)$ rows of $A$ and $B$ to achieve this property, and prove a matching lower bound, showing that it is necessary to select $\Omega({sd}_{\lambda}/\epsilon)$ rows no matter what the weights are, for any $1 < 1/\epsilon \leq sd_{\lambda}$. Here ${sd}_{\lambda}$ is the statistical dimension of the input, and we assume $d’ = O({sd}_{\lambda}) \leq d$. In the case of ordinary regression, this gives a deterministic algorithm achieving $O(d/\epsilon)$ rows and a matching lower bound for any $1 \leq 1/\epsilon \leq d$; for $1/\epsilon > d$ we show $\Theta(d^2)$ rows are sufficient. Finally we show our new coresets are mergeable, giving a deterministic protocol for ridge regression with $O({sd}_{\lambda}/\epsilon)$ words of communication per server, in the important case when the rows of $A$ and $B$ have a constant number of non-zero entries and there are a constant number of servers. Prior to our work the best deterministic protocols in this setting required $\Omega(min({sd}_{\lambda}^2,{sd}_{\lambda}/\epsilon^2))$ communication.

Cite this Paper


BibTeX
@InProceedings{pmlr-v108-kacham20a, title = {Optimal Deterministic Coresets for Ridge Regression}, author = {Kacham, Praneeth and Woodruff, David}, booktitle = {Proceedings of the Twenty Third International Conference on Artificial Intelligence and Statistics}, pages = {4141--4150}, year = {2020}, editor = {Silvia Chiappa and Roberto Calandra}, volume = {108}, series = {Proceedings of Machine Learning Research}, month = {26--28 Aug}, publisher = {PMLR}, pdf = {http://proceedings.mlr.press/v108/kacham20a/kacham20a.pdf}, url = { http://proceedings.mlr.press/v108/kacham20a.html }, abstract = {We consider the ridge regression problem, for which we are given an nxd matrix A of examples and a corresponding nxd’ matrix B of labels, as well as a ridge parameter $\lambda \geq 0$, and would like to output an $X’ \in R^{d \times d’}$ for which $$\|AX’-B\|_F^2 + \lambda \|X’\|_F^2 \leq (1+\epsilon)OPT,$$ where ${OPT} = \min_{Y \in \mathbb{R}^{d \times d’}} \|AY-B\|_F^2 + \lambda \|Y\|_F^2.$ In the special case of $\lambda = 0$, this is ordinary multi-response linear regression. Our focus is on deterministically constructing coresets for this problem. Here the goal is to select and re-weight a small subset of rows of $A$ and corresponding labels of $B$, denoted by $SA$ and $SB$, so that if $X’$ is the minimizer to $\min_{X’} \|SAX’-SB\|_F^2 + \lambda \|X’\|_F^2$, then $\|AX’-B\|_F^2 + \lambda \|X’\|_F^2 \leq (1+\epsilon)OPT$. We show how to efficiently(poly(n,d,1/\epsilon) time) and deterministically select $O({sd}_{\lambda}/\epsilon)$ rows of $A$ and $B$ to achieve this property, and prove a matching lower bound, showing that it is necessary to select $\Omega({sd}_{\lambda}/\epsilon)$ rows no matter what the weights are, for any $1 < 1/\epsilon \leq sd_{\lambda}$. Here ${sd}_{\lambda}$ is the statistical dimension of the input, and we assume $d’ = O({sd}_{\lambda}) \leq d$. In the case of ordinary regression, this gives a deterministic algorithm achieving $O(d/\epsilon)$ rows and a matching lower bound for any $1 \leq 1/\epsilon \leq d$; for $1/\epsilon > d$ we show $\Theta(d^2)$ rows are sufficient. Finally we show our new coresets are mergeable, giving a deterministic protocol for ridge regression with $O({sd}_{\lambda}/\epsilon)$ words of communication per server, in the important case when the rows of $A$ and $B$ have a constant number of non-zero entries and there are a constant number of servers. Prior to our work the best deterministic protocols in this setting required $\Omega(min({sd}_{\lambda}^2,{sd}_{\lambda}/\epsilon^2))$ communication. } }
Endnote
%0 Conference Paper %T Optimal Deterministic Coresets for Ridge Regression %A Praneeth Kacham %A David Woodruff %B Proceedings of the Twenty Third International Conference on Artificial Intelligence and Statistics %C Proceedings of Machine Learning Research %D 2020 %E Silvia Chiappa %E Roberto Calandra %F pmlr-v108-kacham20a %I PMLR %P 4141--4150 %U http://proceedings.mlr.press/v108/kacham20a.html %V 108 %X We consider the ridge regression problem, for which we are given an nxd matrix A of examples and a corresponding nxd’ matrix B of labels, as well as a ridge parameter $\lambda \geq 0$, and would like to output an $X’ \in R^{d \times d’}$ for which $$\|AX’-B\|_F^2 + \lambda \|X’\|_F^2 \leq (1+\epsilon)OPT,$$ where ${OPT} = \min_{Y \in \mathbb{R}^{d \times d’}} \|AY-B\|_F^2 + \lambda \|Y\|_F^2.$ In the special case of $\lambda = 0$, this is ordinary multi-response linear regression. Our focus is on deterministically constructing coresets for this problem. Here the goal is to select and re-weight a small subset of rows of $A$ and corresponding labels of $B$, denoted by $SA$ and $SB$, so that if $X’$ is the minimizer to $\min_{X’} \|SAX’-SB\|_F^2 + \lambda \|X’\|_F^2$, then $\|AX’-B\|_F^2 + \lambda \|X’\|_F^2 \leq (1+\epsilon)OPT$. We show how to efficiently(poly(n,d,1/\epsilon) time) and deterministically select $O({sd}_{\lambda}/\epsilon)$ rows of $A$ and $B$ to achieve this property, and prove a matching lower bound, showing that it is necessary to select $\Omega({sd}_{\lambda}/\epsilon)$ rows no matter what the weights are, for any $1 < 1/\epsilon \leq sd_{\lambda}$. Here ${sd}_{\lambda}$ is the statistical dimension of the input, and we assume $d’ = O({sd}_{\lambda}) \leq d$. In the case of ordinary regression, this gives a deterministic algorithm achieving $O(d/\epsilon)$ rows and a matching lower bound for any $1 \leq 1/\epsilon \leq d$; for $1/\epsilon > d$ we show $\Theta(d^2)$ rows are sufficient. Finally we show our new coresets are mergeable, giving a deterministic protocol for ridge regression with $O({sd}_{\lambda}/\epsilon)$ words of communication per server, in the important case when the rows of $A$ and $B$ have a constant number of non-zero entries and there are a constant number of servers. Prior to our work the best deterministic protocols in this setting required $\Omega(min({sd}_{\lambda}^2,{sd}_{\lambda}/\epsilon^2))$ communication.
APA
Kacham, P. & Woodruff, D.. (2020). Optimal Deterministic Coresets for Ridge Regression. Proceedings of the Twenty Third International Conference on Artificial Intelligence and Statistics, in Proceedings of Machine Learning Research 108:4141-4150 Available from http://proceedings.mlr.press/v108/kacham20a.html .

Related Material