Fast Algorithms for Distributed k-Clustering with Outliers

Junyu Huang; Qilong Feng; Ziyun Huang; Jinhui Xu; Jianxin Wang

Fast Algorithms for Distributed k-Clustering with Outliers

Junyu Huang, Qilong Feng, Ziyun Huang, Jinhui Xu, Jianxin Wang

Proceedings of the 40th International Conference on Machine Learning, PMLR 202:13845-13868, 2023.

Abstract

In this paper, we study the

$k$ -clustering problems with outliers in distributed setting. The current best results for the distributed

$k$ -center problem with outliers have quadratic local running time with communication cost dependent on the aspect ratio

$\Delta$ of the given instance, which may constraint the scalability of the algorithms for handling large-scale datasets. To achieve better communication cost for the problem with faster local running time, we propose an inliers-recalling sampling method, which avoids guessing the optimal radius of the given instance, and can achieve a 4-round bi-criteria

$(14(1+\epsilon),1+\epsilon)$ -approximation with linear local running time in the data size and communication cost independent of the aspect ratio. To obtain a more practical algorithm for the problem, we propose another space-narrowing sampling method, which automatically adjusts the sample size to adapt to different outliers distributions on each machine, and can achieve a 2-round bi-criteria

$(14(1+\epsilon),1+\epsilon)$ -approximation with communication cost independent of the number of outliers. We show that, if the data points are randomly partitioned across machines, our proposed sampling-based methods can be extended to the

$k$ -median/means problems with outliers, and can achieve

$(O(\frac{1}{\epsilon^2}),1+\epsilon)$ -approximation with communication cost independent of the number of outliers. Empirical experiments suggest that the proposed 2-round distributed algorithms outperform other state-of-the-art algorithms.

Cite this Paper

BibTeX


@InProceedings{pmlr-v202-huang23f,
  title = 	 {Fast Algorithms for Distributed k-Clustering with Outliers},
  author =       {Huang, Junyu and Feng, Qilong and Huang, Ziyun and Xu, Jinhui and Wang, Jianxin},
  booktitle = 	 {Proceedings of the 40th International Conference on Machine Learning},
  pages = 	 {13845--13868},
  year = 	 {2023},
  editor = 	 {Krause, Andreas and Brunskill, Emma and Cho, Kyunghyun and Engelhardt, Barbara and Sabato, Sivan and Scarlett, Jonathan},
  volume = 	 {202},
  series = 	 {Proceedings of Machine Learning Research},
  month = 	 {23--29 Jul},
  publisher =    {PMLR},
  pdf = 	 {https://proceedings.mlr.press/v202/huang23f/huang23f.pdf},
  url = 	 {https://proceedings.mlr.press/v202/huang23f.html},
  abstract = 	 {In this paper, we study the $k$-clustering problems with outliers in distributed setting. The current best results for the distributed $k$-center problem with outliers have quadratic local running time with communication cost dependent on the aspect ratio $\Delta$ of the given instance, which may constraint the scalability of the algorithms for handling large-scale datasets. To achieve better communication cost for the problem with faster local running time, we propose an inliers-recalling sampling method, which avoids guessing the optimal radius of the given instance, and can achieve a 4-round bi-criteria $(14(1+\epsilon),1+\epsilon)$-approximation with linear local running time in the data size and communication cost independent of the aspect ratio. To obtain a more practical algorithm for the problem, we propose another space-narrowing sampling method, which automatically adjusts the sample size to adapt to different outliers distributions on each machine, and can achieve a 2-round bi-criteria $(14(1+\epsilon),1+\epsilon)$-approximation with communication cost independent of the number of outliers. We show that, if the data points are randomly partitioned across machines, our proposed sampling-based methods can be extended to the $k$-median/means problems with outliers, and can achieve $(O(\frac{1}{\epsilon^2}),1+\epsilon)$-approximation with communication cost independent of the number of outliers. Empirical experiments suggest that the proposed 2-round distributed algorithms outperform other state-of-the-art algorithms.}
}

Endnote

%0 Conference Paper
%T Fast Algorithms for Distributed k-Clustering with Outliers
%A Junyu Huang
%A Qilong Feng
%A Ziyun Huang
%A Jinhui Xu
%A Jianxin Wang
%B Proceedings of the 40th International Conference on Machine Learning
%C Proceedings of Machine Learning Research
%D 2023
%E Andreas Krause
%E Emma Brunskill
%E Kyunghyun Cho
%E Barbara Engelhardt
%E Sivan Sabato
%E Jonathan Scarlett	
%F pmlr-v202-huang23f
%I PMLR
%P 13845--13868
%U https://proceedings.mlr.press/v202/huang23f.html
%V 202
%X In this paper, we study the $k$-clustering problems with outliers in distributed setting. The current best results for the distributed $k$-center problem with outliers have quadratic local running time with communication cost dependent on the aspect ratio $\Delta$ of the given instance, which may constraint the scalability of the algorithms for handling large-scale datasets. To achieve better communication cost for the problem with faster local running time, we propose an inliers-recalling sampling method, which avoids guessing the optimal radius of the given instance, and can achieve a 4-round bi-criteria $(14(1+\epsilon),1+\epsilon)$-approximation with linear local running time in the data size and communication cost independent of the aspect ratio. To obtain a more practical algorithm for the problem, we propose another space-narrowing sampling method, which automatically adjusts the sample size to adapt to different outliers distributions on each machine, and can achieve a 2-round bi-criteria $(14(1+\epsilon),1+\epsilon)$-approximation with communication cost independent of the number of outliers. We show that, if the data points are randomly partitioned across machines, our proposed sampling-based methods can be extended to the $k$-median/means problems with outliers, and can achieve $(O(\frac{1}{\epsilon^2}),1+\epsilon)$-approximation with communication cost independent of the number of outliers. Empirical experiments suggest that the proposed 2-round distributed algorithms outperform other state-of-the-art algorithms.

APA


Huang, J., Feng, Q., Huang, Z., Xu, J. & Wang, J.. (2023). Fast Algorithms for Distributed k-Clustering with Outliers. Proceedings of the 40th International Conference on Machine Learning, in Proceedings of Machine Learning Research 202:13845-13868 Available from https://proceedings.mlr.press/v202/huang23f.html.

Fast Algorithms for Distributed k-Clustering with Outliers

Abstract

Cite this Paper

Related Material