Clustering Semi-Random Mixtures of Gaussians

Aravindan Vijayaraghavan, Pranjal Awasthi
Proceedings of the 35th International Conference on Machine Learning, PMLR 80:5055-5064, 2018.

Abstract

Gaussian mixture models (GMM) are the most widely used statistical model for the k-means clustering problem and form a popular framework for clustering in machine learning and data analysis. In this paper, we propose a natural robust model for k-means clustering that generalizes the Gaussian mixture model, and that we believe will be useful in identifying robust algorithms. Our first contribution is a polynomial time algorithm that provably recovers the ground-truth up to small classification error w.h.p., assuming certain separation between the components. Perhaps surprisingly, the algorithm we analyze is the popular Lloyd’s algorithm for k-means clustering that is the method-of-choice in practice. Our second result complements the upper bound by giving a nearly matching lower bound on the number of misclassified points incurred by any k-means clustering algorithm on the semi-random model.

Cite this Paper


BibTeX
@InProceedings{pmlr-v80-vijayaraghavan18a, title = {Clustering Semi-Random Mixtures of {G}aussians}, author = {Vijayaraghavan, Aravindan and Awasthi, Pranjal}, booktitle = {Proceedings of the 35th International Conference on Machine Learning}, pages = {5055--5064}, year = {2018}, editor = {Dy, Jennifer and Krause, Andreas}, volume = {80}, series = {Proceedings of Machine Learning Research}, month = {10--15 Jul}, publisher = {PMLR}, pdf = {http://proceedings.mlr.press/v80/vijayaraghavan18a/vijayaraghavan18a.pdf}, url = {http://proceedings.mlr.press/v80/vijayaraghavan18a.html}, abstract = {Gaussian mixture models (GMM) are the most widely used statistical model for the k-means clustering problem and form a popular framework for clustering in machine learning and data analysis. In this paper, we propose a natural robust model for k-means clustering that generalizes the Gaussian mixture model, and that we believe will be useful in identifying robust algorithms. Our first contribution is a polynomial time algorithm that provably recovers the ground-truth up to small classification error w.h.p., assuming certain separation between the components. Perhaps surprisingly, the algorithm we analyze is the popular Lloyd’s algorithm for k-means clustering that is the method-of-choice in practice. Our second result complements the upper bound by giving a nearly matching lower bound on the number of misclassified points incurred by any k-means clustering algorithm on the semi-random model.} }
Endnote
%0 Conference Paper %T Clustering Semi-Random Mixtures of Gaussians %A Aravindan Vijayaraghavan %A Pranjal Awasthi %B Proceedings of the 35th International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2018 %E Jennifer Dy %E Andreas Krause %F pmlr-v80-vijayaraghavan18a %I PMLR %P 5055--5064 %U http://proceedings.mlr.press/v80/vijayaraghavan18a.html %V 80 %X Gaussian mixture models (GMM) are the most widely used statistical model for the k-means clustering problem and form a popular framework for clustering in machine learning and data analysis. In this paper, we propose a natural robust model for k-means clustering that generalizes the Gaussian mixture model, and that we believe will be useful in identifying robust algorithms. Our first contribution is a polynomial time algorithm that provably recovers the ground-truth up to small classification error w.h.p., assuming certain separation between the components. Perhaps surprisingly, the algorithm we analyze is the popular Lloyd’s algorithm for k-means clustering that is the method-of-choice in practice. Our second result complements the upper bound by giving a nearly matching lower bound on the number of misclassified points incurred by any k-means clustering algorithm on the semi-random model.
APA
Vijayaraghavan, A. & Awasthi, P.. (2018). Clustering Semi-Random Mixtures of Gaussians. Proceedings of the 35th International Conference on Machine Learning, in Proceedings of Machine Learning Research 80:5055-5064 Available from http://proceedings.mlr.press/v80/vijayaraghavan18a.html.

Related Material