Dimensionality Reduction for the Sum-of-Distances Metric

Zhili Feng, Praneeth Kacham, David Woodruff
Proceedings of the 38th International Conference on Machine Learning, PMLR 139:3220-3229, 2021.

Abstract

We give a dimensionality reduction procedure to approximate the sum of distances of a given set of n points in Rd to any “shape” that lies in a k-dimensional subspace. Here, by “shape” we mean any set of points in Rd. Our algorithm takes an input in the form of an n×d matrix A, where each row of A denotes a data point, and outputs a subspace P of dimension O(k3/ϵ6) such that the projections of each of the n points onto the subspace P and the distances of each of the points to the subspace P are sufficient to obtain an ϵ-approximation to the sum of distances to any arbitrary shape that lies in a k-dimensional subspace of Rd. These include important problems such as k-median, k-subspace approximation, and (j,l) subspace clustering with jlk. Dimensionality reduction reduces the data storage requirement to (n+d)k3/ϵ6 from nnz(A). Here nnz(A) could potentially be as large as nd. Our algorithm runs in time nnz(A)/ϵ2+(n+d)poly(k/ϵ), up to logarithmic factors. For dense matrices, where nnz(A)nd, we give a faster algorithm, that runs in time nd+(n+d)poly(k/ϵ) up to logarithmic factors. Our dimensionality reduction algorithm can also be used to obtain poly(k/ϵ) size coresets for k-median and (k,1)-subspace approximation problems in polynomial time.

Cite this Paper


BibTeX
@InProceedings{pmlr-v139-feng21a, title = {Dimensionality Reduction for the Sum-of-Distances Metric}, author = {Feng, Zhili and Kacham, Praneeth and Woodruff, David}, booktitle = {Proceedings of the 38th International Conference on Machine Learning}, pages = {3220--3229}, year = {2021}, editor = {Meila, Marina and Zhang, Tong}, volume = {139}, series = {Proceedings of Machine Learning Research}, month = {18--24 Jul}, publisher = {PMLR}, pdf = {http://proceedings.mlr.press/v139/feng21a/feng21a.pdf}, url = {https://proceedings.mlr.press/v139/feng21a.html}, abstract = {We give a dimensionality reduction procedure to approximate the sum of distances of a given set of $n$ points in $R^d$ to any “shape” that lies in a $k$-dimensional subspace. Here, by “shape” we mean any set of points in $R^d$. Our algorithm takes an input in the form of an $n \times d$ matrix $A$, where each row of $A$ denotes a data point, and outputs a subspace $P$ of dimension $O(k^{3}/\epsilon^6)$ such that the projections of each of the $n$ points onto the subspace $P$ and the distances of each of the points to the subspace $P$ are sufficient to obtain an $\epsilon$-approximation to the sum of distances to any arbitrary shape that lies in a $k$-dimensional subspace of $R^d$. These include important problems such as $k$-median, $k$-subspace approximation, and $(j,l)$ subspace clustering with $j \cdot l \leq k$. Dimensionality reduction reduces the data storage requirement to $(n+d)k^{3}/\epsilon^6$ from nnz$(A)$. Here nnz$(A)$ could potentially be as large as $nd$. Our algorithm runs in time nnz$(A)/\epsilon^2 + (n+d)$poly$(k/\epsilon)$, up to logarithmic factors. For dense matrices, where nnz$(A) \approx nd$, we give a faster algorithm, that runs in time $nd + (n+d)$poly$(k/\epsilon)$ up to logarithmic factors. Our dimensionality reduction algorithm can also be used to obtain poly$(k/\epsilon)$ size coresets for $k$-median and $(k,1)$-subspace approximation problems in polynomial time.} }
Endnote
%0 Conference Paper %T Dimensionality Reduction for the Sum-of-Distances Metric %A Zhili Feng %A Praneeth Kacham %A David Woodruff %B Proceedings of the 38th International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2021 %E Marina Meila %E Tong Zhang %F pmlr-v139-feng21a %I PMLR %P 3220--3229 %U https://proceedings.mlr.press/v139/feng21a.html %V 139 %X We give a dimensionality reduction procedure to approximate the sum of distances of a given set of $n$ points in $R^d$ to any “shape” that lies in a $k$-dimensional subspace. Here, by “shape” we mean any set of points in $R^d$. Our algorithm takes an input in the form of an $n \times d$ matrix $A$, where each row of $A$ denotes a data point, and outputs a subspace $P$ of dimension $O(k^{3}/\epsilon^6)$ such that the projections of each of the $n$ points onto the subspace $P$ and the distances of each of the points to the subspace $P$ are sufficient to obtain an $\epsilon$-approximation to the sum of distances to any arbitrary shape that lies in a $k$-dimensional subspace of $R^d$. These include important problems such as $k$-median, $k$-subspace approximation, and $(j,l)$ subspace clustering with $j \cdot l \leq k$. Dimensionality reduction reduces the data storage requirement to $(n+d)k^{3}/\epsilon^6$ from nnz$(A)$. Here nnz$(A)$ could potentially be as large as $nd$. Our algorithm runs in time nnz$(A)/\epsilon^2 + (n+d)$poly$(k/\epsilon)$, up to logarithmic factors. For dense matrices, where nnz$(A) \approx nd$, we give a faster algorithm, that runs in time $nd + (n+d)$poly$(k/\epsilon)$ up to logarithmic factors. Our dimensionality reduction algorithm can also be used to obtain poly$(k/\epsilon)$ size coresets for $k$-median and $(k,1)$-subspace approximation problems in polynomial time.
APA
Feng, Z., Kacham, P. & Woodruff, D.. (2021). Dimensionality Reduction for the Sum-of-Distances Metric. Proceedings of the 38th International Conference on Machine Learning, in Proceedings of Machine Learning Research 139:3220-3229 Available from https://proceedings.mlr.press/v139/feng21a.html.

Related Material