On Lloyd’s Algorithm: New Theoretical Insights for Clustering in Practice

Cheng Tang; Claire Monteleoni

On Lloyd’s Algorithm: New Theoretical Insights for Clustering in Practice

Cheng Tang, Claire Monteleoni

Proceedings of the 19th International Conference on Artificial Intelligence and Statistics, PMLR 51:1280-1289, 2016.

Abstract

We provide new analyses of Lloyd’s algorithm (1982), commonly known as the k-means clustering algorithm. Kumar and Kannan (2010) showed that running k-SVD followed by a constant approximation k-means algorithm, and then Lloyd’s algorithm, will correctly cluster nearly all of the dataset with respect to the optimal clustering, provided the dataset satisfies a deterministic clusterability assumption. This method is viewed as the "Swiss Army knife" for clustering problems, subsuming popular generative models such as Gaussian mixtures. However, it is tailored to high dimensional data, i.e., when d ≫k . We analyze Lloyd’s algorithm for general d without using the spectral projection, which leads to a weaker assumption in the case d < k. Surprisingly, we show that a simple and scalable heuristic that combines random sampling with Single-Linkage serves as a good seeding algorithm for Lloyd’s algorithm under this assumption. We then study stopping criteria for Lloyd’s algorithm under the lens of clusterability, accompanied by controlled simulations.

Cite this Paper

BibTeX


@InProceedings{pmlr-v51-tang16b,
  title = 	 {On Lloyd's Algorithm: New Theoretical Insights for Clustering in Practice},
  author = 	 {Tang, Cheng and Monteleoni, Claire},
  booktitle = 	 {Proceedings of the 19th International Conference on Artificial Intelligence and Statistics},
  pages = 	 {1280--1289},
  year = 	 {2016},
  editor = 	 {Gretton, Arthur and Robert, Christian C.},
  volume = 	 {51},
  series = 	 {Proceedings of Machine Learning Research},
  address = 	 {Cadiz, Spain},
  month = 	 {09--11 May},
  publisher =    {PMLR},
  pdf = 	 {http://proceedings.mlr.press/v51/tang16b.pdf},
  url = 	 {https://proceedings.mlr.press/v51/tang16b.html},
  abstract = 	 {We provide new analyses of Lloyd’s algorithm (1982), commonly known as the k-means clustering algorithm. Kumar and Kannan (2010) showed that running k-SVD followed by a constant approximation k-means algorithm, and then Lloyd’s algorithm, will correctly cluster nearly all of the dataset with respect to the optimal clustering, provided the dataset satisfies a deterministic clusterability assumption. This method is viewed as the "Swiss Army knife" for clustering problems, subsuming popular generative models such as Gaussian mixtures. However, it is tailored to high dimensional data, i.e., when d ≫k . We analyze Lloyd’s algorithm for general d without using the spectral projection, which leads to a weaker assumption in the case d < k. Surprisingly, we show that a simple and scalable heuristic that combines random sampling with Single-Linkage serves as a good seeding algorithm for Lloyd’s algorithm under this assumption. We then study stopping criteria for Lloyd’s algorithm under the lens of clusterability, accompanied by controlled simulations.}
}

Endnote

%0 Conference Paper
%T On Lloyd’s Algorithm: New Theoretical Insights for Clustering in Practice
%A Cheng Tang
%A Claire Monteleoni
%B Proceedings of the 19th International Conference on Artificial Intelligence and Statistics
%C Proceedings of Machine Learning Research
%D 2016
%E Arthur Gretton
%E Christian C. Robert	
%F pmlr-v51-tang16b
%I PMLR
%P 1280--1289
%U https://proceedings.mlr.press/v51/tang16b.html
%V 51
%X We provide new analyses of Lloyd’s algorithm (1982), commonly known as the k-means clustering algorithm. Kumar and Kannan (2010) showed that running k-SVD followed by a constant approximation k-means algorithm, and then Lloyd’s algorithm, will correctly cluster nearly all of the dataset with respect to the optimal clustering, provided the dataset satisfies a deterministic clusterability assumption. This method is viewed as the "Swiss Army knife" for clustering problems, subsuming popular generative models such as Gaussian mixtures. However, it is tailored to high dimensional data, i.e., when d ≫k . We analyze Lloyd’s algorithm for general d without using the spectral projection, which leads to a weaker assumption in the case d < k. Surprisingly, we show that a simple and scalable heuristic that combines random sampling with Single-Linkage serves as a good seeding algorithm for Lloyd’s algorithm under this assumption. We then study stopping criteria for Lloyd’s algorithm under the lens of clusterability, accompanied by controlled simulations.

RIS


TY  - CPAPER
TI  - On Lloyd’s Algorithm: New Theoretical Insights for Clustering in Practice
AU  - Cheng Tang
AU  - Claire Monteleoni
BT  - Proceedings of the 19th International Conference on Artificial Intelligence and Statistics
DA  - 2016/05/02
ED  - Arthur Gretton
ED  - Christian C. Robert	
ID  - pmlr-v51-tang16b
PB  - PMLR
DP  - Proceedings of Machine Learning Research
VL  - 51
SP  - 1280
EP  - 1289
L1  - http://proceedings.mlr.press/v51/tang16b.pdf
UR  - https://proceedings.mlr.press/v51/tang16b.html
AB  - We provide new analyses of Lloyd’s algorithm (1982), commonly known as the k-means clustering algorithm. Kumar and Kannan (2010) showed that running k-SVD followed by a constant approximation k-means algorithm, and then Lloyd’s algorithm, will correctly cluster nearly all of the dataset with respect to the optimal clustering, provided the dataset satisfies a deterministic clusterability assumption. This method is viewed as the "Swiss Army knife" for clustering problems, subsuming popular generative models such as Gaussian mixtures. However, it is tailored to high dimensional data, i.e., when d ≫k . We analyze Lloyd’s algorithm for general d without using the spectral projection, which leads to a weaker assumption in the case d < k. Surprisingly, we show that a simple and scalable heuristic that combines random sampling with Single-Linkage serves as a good seeding algorithm for Lloyd’s algorithm under this assumption. We then study stopping criteria for Lloyd’s algorithm under the lens of clusterability, accompanied by controlled simulations.
ER  -

APA


Tang, C. & Monteleoni, C.. (2016). On Lloyd’s Algorithm: New Theoretical Insights for Clustering in Practice. Proceedings of the 19th International Conference on Artificial Intelligence and Statistics, in Proceedings of Machine Learning Research 51:1280-1289 Available from https://proceedings.mlr.press/v51/tang16b.html.

On Lloyd’s Algorithm: New Theoretical Insights for Clustering in Practice

Abstract

Cite this Paper

Related Material