Fast Distributed k-Means with a Small Number of Rounds

Tom Hess; Ron Visbord; Sivan Sabato

Fast Distributed k-Means with a Small Number of Rounds

Tom Hess, Ron Visbord, Sivan Sabato

Proceedings of The 26th International Conference on Artificial Intelligence and Statistics, PMLR 206:850-874, 2023.

Abstract

We propose a new algorithm for k-means clustering in a distributed setting, where the data is distributed across many machines, and a coordinator communicates with these machines to calculate the output clustering. Our algorithm guarantees a cost approximation factor and a number of communication rounds that depend only on the computational capacity of the coordinator. Moreover, the algorithm includes a built-in stopping mechanism, which allows it to use fewer communication rounds whenever possible. We show both theoretically and empirically that in many natural cases, indeed 1-4 rounds suffice. In comparison with the popular k-means

$||$ algorithm, our approach allows exploiting a larger coordinator capacity to obtain a smaller number of rounds. Our experiments show that the k-means cost obtained by the proposed algorithm is usually better than the cost obtained by k-means

$||$ , even when the latter is allowed a larger number of rounds. Moreover, the machine running time in our approach is considerably smaller than that of k-means

$||$ .

Cite this Paper

BibTeX


@InProceedings{pmlr-v206-hess23a,
  title = 	 {Fast Distributed k-Means with a Small Number of Rounds},
  author =       {Hess, Tom and Visbord, Ron and Sabato, Sivan},
  booktitle = 	 {Proceedings of The 26th International Conference on Artificial Intelligence and Statistics},
  pages = 	 {850--874},
  year = 	 {2023},
  editor = 	 {Ruiz, Francisco and Dy, Jennifer and van de Meent, Jan-Willem},
  volume = 	 {206},
  series = 	 {Proceedings of Machine Learning Research},
  month = 	 {25--27 Apr},
  publisher =    {PMLR},
  pdf = 	 {https://proceedings.mlr.press/v206/hess23a/hess23a.pdf},
  url = 	 {https://proceedings.mlr.press/v206/hess23a.html},
  abstract = 	 {We propose a new algorithm for k-means clustering in a distributed setting, where the data is distributed across many machines, and a coordinator communicates with these machines to calculate the output clustering. Our algorithm guarantees a cost approximation factor and a number of communication rounds that depend only on the computational capacity of the coordinator. Moreover, the algorithm includes a built-in stopping mechanism, which allows it to use fewer communication rounds whenever possible. We show both theoretically and empirically that in many natural cases, indeed 1-4 rounds suffice. In comparison with the popular k-means$||$ algorithm, our approach allows exploiting a larger coordinator capacity to obtain a smaller number of rounds. Our experiments show that the k-means cost obtained by the proposed algorithm is usually better than the cost obtained by k-means$||$, even when the latter is allowed a larger number of rounds. Moreover, the machine running time in our approach is considerably smaller than that of k-means$||$.}
}

Endnote

%0 Conference Paper
%T Fast Distributed k-Means with a Small Number of Rounds
%A Tom Hess
%A Ron Visbord
%A Sivan Sabato
%B Proceedings of The 26th International Conference on Artificial Intelligence and Statistics
%C Proceedings of Machine Learning Research
%D 2023
%E Francisco Ruiz
%E Jennifer Dy
%E Jan-Willem van de Meent	
%F pmlr-v206-hess23a
%I PMLR
%P 850--874
%U https://proceedings.mlr.press/v206/hess23a.html
%V 206
%X We propose a new algorithm for k-means clustering in a distributed setting, where the data is distributed across many machines, and a coordinator communicates with these machines to calculate the output clustering. Our algorithm guarantees a cost approximation factor and a number of communication rounds that depend only on the computational capacity of the coordinator. Moreover, the algorithm includes a built-in stopping mechanism, which allows it to use fewer communication rounds whenever possible. We show both theoretically and empirically that in many natural cases, indeed 1-4 rounds suffice. In comparison with the popular k-means$||$ algorithm, our approach allows exploiting a larger coordinator capacity to obtain a smaller number of rounds. Our experiments show that the k-means cost obtained by the proposed algorithm is usually better than the cost obtained by k-means$||$, even when the latter is allowed a larger number of rounds. Moreover, the machine running time in our approach is considerably smaller than that of k-means$||$.

APA


Hess, T., Visbord, R. & Sabato, S.. (2023). Fast Distributed k-Means with a Small Number of Rounds. Proceedings of The 26th International Conference on Artificial Intelligence and Statistics, in Proceedings of Machine Learning Research 206:850-874 Available from https://proceedings.mlr.press/v206/hess23a.html.

Fast Distributed k-Means with a Small Number of Rounds

Abstract

Cite this Paper

Related Material