Distributed and Provably Good Seedings for k-Means in Constant Rounds

Olivier Bachem; Mario Lucic; Andreas Krause

Distributed and Provably Good Seedings for k-Means in Constant Rounds

Olivier Bachem, Mario Lucic, Andreas Krause

Proceedings of the 34th International Conference on Machine Learning, PMLR 70:292-300, 2017.

Abstract

The k-Means++ algorithm is the state of the art algorithm to solve k-Means clustering problems as the computed clusterings are O(log k) competitive in expectation. However, its seeding step requires k inherently sequential passes through the full data set making it hard to scale to massive data sets. The standard remedy is to use the k-Means|| algorithm which reduces the number of sequential rounds and is thus suitable for a distributed setting. In this paper, we provide a novel analysis of the k-Means|| algorithm that bounds the expected solution quality for any number of rounds and oversampling factors greater than k, the two parameters one needs to choose in practice. In particular, we show that k-Means|| provides provably good clusterings even for a small, constant number of iterations. This theoretical finding explains the common observation that k-Means|| performs extremely well in practice even if the number of rounds is low. We further provide a hard instance that shows that an additive error term as encountered in our analysis is inevitable if less than k-1 rounds are employed.

Cite this Paper

BibTeX


@InProceedings{pmlr-v70-bachem17b,
  title = 	 {Distributed and Provably Good Seedings for k-Means in Constant Rounds},
  author =       {Olivier Bachem and Mario Lucic and Andreas Krause},
  booktitle = 	 {Proceedings of the 34th International Conference on Machine Learning},
  pages = 	 {292--300},
  year = 	 {2017},
  editor = 	 {Precup, Doina and Teh, Yee Whye},
  volume = 	 {70},
  series = 	 {Proceedings of Machine Learning Research},
  month = 	 {06--11 Aug},
  publisher =    {PMLR},
  pdf = 	 {http://proceedings.mlr.press/v70/bachem17b/bachem17b.pdf},
  url = 	 {https://proceedings.mlr.press/v70/bachem17b.html},
  abstract = 	 {The k-Means++ algorithm is the state of the art algorithm to solve k-Means clustering problems as the computed clusterings are O(log k) competitive in expectation. However, its seeding step requires k inherently sequential passes through the full data set making it hard to scale to massive data sets. The standard remedy is to use the k-Means|| algorithm which reduces the number of sequential rounds and is thus suitable for a distributed setting. In this paper, we provide a novel analysis of the k-Means|| algorithm that bounds the expected solution quality for any number of rounds and oversampling factors greater than k, the two parameters one needs to choose in practice. In particular, we show that k-Means|| provides provably good clusterings even for a small, constant number of iterations. This theoretical finding explains the common observation that k-Means|| performs extremely well in practice even if the number of rounds is low. We further provide a hard instance that shows that an additive error term as encountered in our analysis is inevitable if less than k-1 rounds are employed.}
}

Endnote

%0 Conference Paper
%T Distributed and Provably Good Seedings for k-Means in Constant Rounds
%A Olivier Bachem
%A Mario Lucic
%A Andreas Krause
%B Proceedings of the 34th International Conference on Machine Learning
%C Proceedings of Machine Learning Research
%D 2017
%E Doina Precup
%E Yee Whye Teh	
%F pmlr-v70-bachem17b
%I PMLR
%P 292--300
%U https://proceedings.mlr.press/v70/bachem17b.html
%V 70
%X The k-Means++ algorithm is the state of the art algorithm to solve k-Means clustering problems as the computed clusterings are O(log k) competitive in expectation. However, its seeding step requires k inherently sequential passes through the full data set making it hard to scale to massive data sets. The standard remedy is to use the k-Means|| algorithm which reduces the number of sequential rounds and is thus suitable for a distributed setting. In this paper, we provide a novel analysis of the k-Means|| algorithm that bounds the expected solution quality for any number of rounds and oversampling factors greater than k, the two parameters one needs to choose in practice. In particular, we show that k-Means|| provides provably good clusterings even for a small, constant number of iterations. This theoretical finding explains the common observation that k-Means|| performs extremely well in practice even if the number of rounds is low. We further provide a hard instance that shows that an additive error term as encountered in our analysis is inevitable if less than k-1 rounds are employed.

APA


Bachem, O., Lucic, M. & Krause, A.. (2017). Distributed and Provably Good Seedings for k-Means in Constant Rounds. Proceedings of the 34th International Conference on Machine Learning, in Proceedings of Machine Learning Research 70:292-300 Available from https://proceedings.mlr.press/v70/bachem17b.html.

Related Material

Download PDF