Clustering SemiRandom Mixtures of Gaussians
[edit]
Proceedings of the 35th International Conference on Machine Learning, PMLR 80:294303, 2018.
Abstract
Gaussian mixture models (GMM) are the most widely used statistical model for the kmeans clustering problem and form a popular framework for clustering in machine learning and data analysis. In this paper, we propose a natural robust model for kmeans clustering that generalizes the Gaussian mixture model, and that we believe will be useful in identifying robust algorithms. Our first contribution is a polynomial time algorithm that provably recovers the groundtruth up to small classification error w.h.p., assuming certain separation between the components. Perhaps surprisingly, the algorithm we analyze is the popular Lloyd’s algorithm for kmeans clustering that is the methodofchoice in practice. Our second result complements the upper bound by giving a nearly matching lower bound on the number of misclassified points incurred by any kmeans clustering algorithm on the semirandom model.
Related Material


