Clustering High Dimensional Dynamic Data Streams

Vladimir Braverman, Gereon Frahling, Harry Lang, Christian Sohler, Lin F. Yang
Proceedings of the 34th International Conference on Machine Learning, PMLR 70:576-585, 2017.

Abstract

We present data streaming algorithms for the $k$-median problem in high-dimensional dynamic geometric data streams, i.e. streams allowing both insertions and deletions of points from a discrete Euclidean space $\{1, 2, \ldots \Delta\}^d$. Our algorithms use $k \epsilon^{-2} \mathrm{poly}(d \log \Delta)$ space/time and maintain with high probability a small weighted set of points (a coreset) such that for every set of $k$ centers the cost of the coreset $(1+\epsilon)$-approximates the cost of the streamed point set. We also provide algorithms that guarantee only positive weights in the coreset with additional logarithmic factors in the space and time complexities. We can use this positively-weighted coreset to compute a $(1+\epsilon)$-approximation for the $k$-median problem by any efficient offline $k$-median algorithm. All previous algorithms for computing a $(1+\epsilon)$-approximation for the $k$-median problem over dynamic data streams required space and time exponential in $d$. Our algorithms can be generalized to metric spaces of bounded doubling dimension.

Cite this Paper


BibTeX
@InProceedings{pmlr-v70-braverman17a, title = {Clustering High Dimensional Dynamic Data Streams}, author = {Vladimir Braverman and Gereon Frahling and Harry Lang and Christian Sohler and Lin F. Yang}, booktitle = {Proceedings of the 34th International Conference on Machine Learning}, pages = {576--585}, year = {2017}, editor = {Precup, Doina and Teh, Yee Whye}, volume = {70}, series = {Proceedings of Machine Learning Research}, month = {06--11 Aug}, publisher = {PMLR}, pdf = {http://proceedings.mlr.press/v70/braverman17a/braverman17a.pdf}, url = {https://proceedings.mlr.press/v70/braverman17a.html}, abstract = {We present data streaming algorithms for the $k$-median problem in high-dimensional dynamic geometric data streams, i.e. streams allowing both insertions and deletions of points from a discrete Euclidean space $\{1, 2, \ldots \Delta\}^d$. Our algorithms use $k \epsilon^{-2} \mathrm{poly}(d \log \Delta)$ space/time and maintain with high probability a small weighted set of points (a coreset) such that for every set of $k$ centers the cost of the coreset $(1+\epsilon)$-approximates the cost of the streamed point set. We also provide algorithms that guarantee only positive weights in the coreset with additional logarithmic factors in the space and time complexities. We can use this positively-weighted coreset to compute a $(1+\epsilon)$-approximation for the $k$-median problem by any efficient offline $k$-median algorithm. All previous algorithms for computing a $(1+\epsilon)$-approximation for the $k$-median problem over dynamic data streams required space and time exponential in $d$. Our algorithms can be generalized to metric spaces of bounded doubling dimension.} }
Endnote
%0 Conference Paper %T Clustering High Dimensional Dynamic Data Streams %A Vladimir Braverman %A Gereon Frahling %A Harry Lang %A Christian Sohler %A Lin F. Yang %B Proceedings of the 34th International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2017 %E Doina Precup %E Yee Whye Teh %F pmlr-v70-braverman17a %I PMLR %P 576--585 %U https://proceedings.mlr.press/v70/braverman17a.html %V 70 %X We present data streaming algorithms for the $k$-median problem in high-dimensional dynamic geometric data streams, i.e. streams allowing both insertions and deletions of points from a discrete Euclidean space $\{1, 2, \ldots \Delta\}^d$. Our algorithms use $k \epsilon^{-2} \mathrm{poly}(d \log \Delta)$ space/time and maintain with high probability a small weighted set of points (a coreset) such that for every set of $k$ centers the cost of the coreset $(1+\epsilon)$-approximates the cost of the streamed point set. We also provide algorithms that guarantee only positive weights in the coreset with additional logarithmic factors in the space and time complexities. We can use this positively-weighted coreset to compute a $(1+\epsilon)$-approximation for the $k$-median problem by any efficient offline $k$-median algorithm. All previous algorithms for computing a $(1+\epsilon)$-approximation for the $k$-median problem over dynamic data streams required space and time exponential in $d$. Our algorithms can be generalized to metric spaces of bounded doubling dimension.
APA
Braverman, V., Frahling, G., Lang, H., Sohler, C. & Yang, L.F.. (2017). Clustering High Dimensional Dynamic Data Streams. Proceedings of the 34th International Conference on Machine Learning, in Proceedings of Machine Learning Research 70:576-585 Available from https://proceedings.mlr.press/v70/braverman17a.html.

Related Material