Distributed Clustering via LSH Based Data Partitioning

Aditya Bhaskara, Maheshakya Wijewardena
Proceedings of the 35th International Conference on Machine Learning, PMLR 80:570-579, 2018.

Abstract

Given the importance of clustering in the analysisof large scale data, distributed algorithms for formulations such as k-means, k-median, etc. have been extensively studied. A successful approach here has been the “reduce and merge” paradigm, in which each machine reduces its input size to {Õ}(k), and this data reduction continues (possibly iteratively) until all the data fits on one machine, at which point the problem is solved locally. This approach has the intrinsic bottleneck that each machine must solve a problem of size $\geq$ k, and needs to communicate at least $\Omega$(k) points to the other machines. We propose a novel data partitioning idea to overcome this bottleneck, and in effect, have different machines focus on “finding different clusters”. Under the assumption that we know the optimum value of the objective up to a poly(n) factor (arbitrary polynomial), we establish worst-case approximation guarantees for our method. We see that our algorithm results in lower communication as well as a near-optimal number of ‘rounds’ of computation (in the popular MapReduce framework).

Cite this Paper


BibTeX
@InProceedings{pmlr-v80-bhaskara18a, title = {Distributed Clustering via {LSH} Based Data Partitioning}, author = {Bhaskara, Aditya and Wijewardena, Maheshakya}, booktitle = {Proceedings of the 35th International Conference on Machine Learning}, pages = {570--579}, year = {2018}, editor = {Dy, Jennifer and Krause, Andreas}, volume = {80}, series = {Proceedings of Machine Learning Research}, month = {10--15 Jul}, publisher = {PMLR}, pdf = {http://proceedings.mlr.press/v80/bhaskara18a/bhaskara18a.pdf}, url = {https://proceedings.mlr.press/v80/bhaskara18a.html}, abstract = {Given the importance of clustering in the analysisof large scale data, distributed algorithms for formulations such as k-means, k-median, etc. have been extensively studied. A successful approach here has been the “reduce and merge” paradigm, in which each machine reduces its input size to {Õ}(k), and this data reduction continues (possibly iteratively) until all the data fits on one machine, at which point the problem is solved locally. This approach has the intrinsic bottleneck that each machine must solve a problem of size $\geq$ k, and needs to communicate at least $\Omega$(k) points to the other machines. We propose a novel data partitioning idea to overcome this bottleneck, and in effect, have different machines focus on “finding different clusters”. Under the assumption that we know the optimum value of the objective up to a poly(n) factor (arbitrary polynomial), we establish worst-case approximation guarantees for our method. We see that our algorithm results in lower communication as well as a near-optimal number of ‘rounds’ of computation (in the popular MapReduce framework).} }
Endnote
%0 Conference Paper %T Distributed Clustering via LSH Based Data Partitioning %A Aditya Bhaskara %A Maheshakya Wijewardena %B Proceedings of the 35th International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2018 %E Jennifer Dy %E Andreas Krause %F pmlr-v80-bhaskara18a %I PMLR %P 570--579 %U https://proceedings.mlr.press/v80/bhaskara18a.html %V 80 %X Given the importance of clustering in the analysisof large scale data, distributed algorithms for formulations such as k-means, k-median, etc. have been extensively studied. A successful approach here has been the “reduce and merge” paradigm, in which each machine reduces its input size to {Õ}(k), and this data reduction continues (possibly iteratively) until all the data fits on one machine, at which point the problem is solved locally. This approach has the intrinsic bottleneck that each machine must solve a problem of size $\geq$ k, and needs to communicate at least $\Omega$(k) points to the other machines. We propose a novel data partitioning idea to overcome this bottleneck, and in effect, have different machines focus on “finding different clusters”. Under the assumption that we know the optimum value of the objective up to a poly(n) factor (arbitrary polynomial), we establish worst-case approximation guarantees for our method. We see that our algorithm results in lower communication as well as a near-optimal number of ‘rounds’ of computation (in the popular MapReduce framework).
APA
Bhaskara, A. & Wijewardena, M.. (2018). Distributed Clustering via LSH Based Data Partitioning. Proceedings of the 35th International Conference on Machine Learning, in Proceedings of Machine Learning Research 80:570-579 Available from https://proceedings.mlr.press/v80/bhaskara18a.html.

Related Material