[edit]

# Distributed Clustering via LSH Based Data Partitioning

*Proceedings of the 35th International Conference on Machine Learning*, PMLR 80:570-579, 2018.

#### Abstract

Given the importance of clustering in the analysisof large scale data, distributed algorithms for formulations such as k-means, k-median, etc. have been extensively studied. A successful approach here has been the “reduce and merge” paradigm, in which each machine reduces its input size to {Õ}(k), and this data reduction continues (possibly iteratively) until all the data fits on one machine, at which point the problem is solved locally. This approach has the intrinsic bottleneck that each machine must solve a problem of size $\geq$ k, and needs to communicate at least $\Omega$(k) points to the other machines. We propose a novel data partitioning idea to overcome this bottleneck, and in effect, have different machines focus on “finding different clusters”. Under the assumption that we know the optimum value of the objective up to a poly(n) factor (arbitrary polynomial), we establish worst-case approximation guarantees for our method. We see that our algorithm results in lower communication as well as a near-optimal number of ‘rounds’ of computation (in the popular MapReduce framework).