[edit]
Massively Parallel k-Means Clustering for Perturbation Resilient Instances
Proceedings of the 39th International Conference on Machine Learning, PMLR 162:4180-4201, 2022.
Abstract
We consider k-means clustering of n data points in Euclidean space in the Massively Parallel Computation (MPC) model, a computational model which is an abstraction of modern massively parallel computing system such as MapReduce. Recent work provides evidence that getting O(1)-approximate k-means solution for general input points using o(logn) rounds in the MPC model may be impossible under certain conditions [Ghaffari, Kuhn \& Uitto’2019]. However, the real-world data points usually have better structures. One instance of interest is the set of data points which is perturbation resilient [Bilu \& Linial’2010]. In particular, a point set is α-perturbation resilient for k-means if perturbing pairwise distances by multiplicative factors in the range [1,α] does not change the optimum k-means clusters. We bypass the worst case lower bound by considering the perturbation resilient input points and showing o(logn) rounds k-means clustering algorithms for these instances in the MPC model. Specifically, we show a fully scalable (1+ε)-approximate k-means clustering algorithm for O(α)-perturbation resilient instance in the MPC model using O(1) rounds and Oε,d(n1+1/α2+o(1)) total space. If the space per machine is sufficiently larger than k, i.e., at least k⋅nΩ(1), we also develop an optimal k-means clustering algorithm for O(α)-perturbation resilient instance in MPC using O(1) rounds and Od(n1+o(1)⋅(n1/α2+k)) total space.