[edit]
K-Means Clustering with Distributed Dimensions
Proceedings of The 33rd International Conference on Machine Learning, PMLR 48:1339-1348, 2016.
Abstract
Distributed clustering has attracted significant attention in recent years. In this paper, we study the k-means problem in the distributed dimension setting, where the dimensions of the data are partitioned across multiple machines. We provide new approximation algorithms, which incur low communication costs and achieve constant approximation ratios. The communication complexity of our algorithms significantly improve on existing algorithms. We also provide the first communication lower bound, which nearly matches our upper bound in a certain range of parameter setting. Our experimental results show that our algorithms outperform existing algorithms on real data-sets in the distributed dimension setting.