Near-Linear Time Approximation Algorithms for k-means with Outliers

Junyu Huang; Qilong Feng; Ziyun Huang; Jinhui Xu; Jianxin Wang

Near-Linear Time Approximation Algorithms for k-means with Outliers

Junyu Huang, Qilong Feng, Ziyun Huang, Jinhui Xu, Jianxin Wang

Proceedings of the 41st International Conference on Machine Learning, PMLR 235:19723-19756, 2024.

Abstract

The k-means with outliers problem is one of the most extensively studied clustering problems in the field of machine learning, where the goal is to discard up to z outliers and identify a minimum k-means clustering on the remaining data points. Most previous results for this problem have running time dependent on the aspect ratio Δ (the ratio between the maximum and the minimum pairwise distances) to achieve fast approximations. To address the issue of aspect ratio dependency on the running time, we propose sampling-based algorithms with almost linear running time in the data size, where a crucial component of our approach is an algorithm called Fast-Sampling. Fast-Sampling algorithm can find inliers that well approximate the optimal clustering centers without relying on a guess for the optimal clustering costs, where a 4-approximate solution can be obtained in time $O(\frac{ndk\log\log n}{\epsilon^2})$ with O(k/ϵ) centers opened and (1+ϵ)z outliers discarded. To reduce the number of centers opened, we propose a center reduction algorithm, where an O(1/ϵ)-approximate solution can be obtained in time $O(\frac{ndk\log \log n}{\epsilon^2} + dpoly(k, \frac{1}{\epsilon})\log(n\Delta))$ with (1+ϵ)z outliers discarded and exactly k centers opened. Empirical experiments suggest that our proposed sampling-based algorithms outperform state-of-the-art algorithms for the k-means with outliers problem.

Cite this Paper

BibTeX


@InProceedings{pmlr-v235-huang24e,
  title = 	 {Near-Linear Time Approximation Algorithms for k-means with Outliers},
  author =       {Huang, Junyu and Feng, Qilong and Huang, Ziyun and Xu, Jinhui and Wang, Jianxin},
  booktitle = 	 {Proceedings of the 41st International Conference on Machine Learning},
  pages = 	 {19723--19756},
  year = 	 {2024},
  editor = 	 {Salakhutdinov, Ruslan and Kolter, Zico and Heller, Katherine and Weller, Adrian and Oliver, Nuria and Scarlett, Jonathan and Berkenkamp, Felix},
  volume = 	 {235},
  series = 	 {Proceedings of Machine Learning Research},
  month = 	 {21--27 Jul},
  publisher =    {PMLR},
  pdf = 	 {https://raw.githubusercontent.com/mlresearch/v235/main/assets/huang24e/huang24e.pdf},
  url = 	 {https://proceedings.mlr.press/v235/huang24e.html},
  abstract = 	 {The k-means with outliers problem is one of the most extensively studied clustering problems in the field of machine learning, where the goal is to discard up to z outliers and identify a minimum k-means clustering on the remaining data points. Most previous results for this problem have running time dependent on the aspect ratio Δ (the ratio between the maximum and the minimum pairwise distances) to achieve fast approximations. To address the issue of aspect ratio dependency on the running time, we propose sampling-based algorithms with almost linear running time in the data size, where a crucial component of our approach is an algorithm called Fast-Sampling. Fast-Sampling algorithm can find inliers that well approximate the optimal clustering centers without relying on a guess for the optimal clustering costs, where a 4-approximate solution can be obtained in time $O(\frac{ndk\log\log n}{\epsilon^2})$ with O(k/ϵ) centers opened and (1+ϵ)z outliers discarded. To reduce the number of centers opened, we propose a center reduction algorithm, where an O(1/ϵ)-approximate solution can be obtained in time $O(\frac{ndk\log \log n}{\epsilon^2} + dpoly(k, \frac{1}{\epsilon})\log(n\Delta))$ with (1+ϵ)z outliers discarded and exactly k centers opened. Empirical experiments suggest that our proposed sampling-based algorithms outperform state-of-the-art algorithms for the k-means with outliers problem.}
}

Endnote

%0 Conference Paper
%T Near-Linear Time Approximation Algorithms for k-means with Outliers
%A Junyu Huang
%A Qilong Feng
%A Ziyun Huang
%A Jinhui Xu
%A Jianxin Wang
%B Proceedings of the 41st International Conference on Machine Learning
%C Proceedings of Machine Learning Research
%D 2024
%E Ruslan Salakhutdinov
%E Zico Kolter
%E Katherine Heller
%E Adrian Weller
%E Nuria Oliver
%E Jonathan Scarlett
%E Felix Berkenkamp	
%F pmlr-v235-huang24e
%I PMLR
%P 19723--19756
%U https://proceedings.mlr.press/v235/huang24e.html
%V 235
%X The k-means with outliers problem is one of the most extensively studied clustering problems in the field of machine learning, where the goal is to discard up to z outliers and identify a minimum k-means clustering on the remaining data points. Most previous results for this problem have running time dependent on the aspect ratio Δ (the ratio between the maximum and the minimum pairwise distances) to achieve fast approximations. To address the issue of aspect ratio dependency on the running time, we propose sampling-based algorithms with almost linear running time in the data size, where a crucial component of our approach is an algorithm called Fast-Sampling. Fast-Sampling algorithm can find inliers that well approximate the optimal clustering centers without relying on a guess for the optimal clustering costs, where a 4-approximate solution can be obtained in time $O(\frac{ndk\log\log n}{\epsilon^2})$ with O(k/ϵ) centers opened and (1+ϵ)z outliers discarded. To reduce the number of centers opened, we propose a center reduction algorithm, where an O(1/ϵ)-approximate solution can be obtained in time $O(\frac{ndk\log \log n}{\epsilon^2} + dpoly(k, \frac{1}{\epsilon})\log(n\Delta))$ with (1+ϵ)z outliers discarded and exactly k centers opened. Empirical experiments suggest that our proposed sampling-based algorithms outperform state-of-the-art algorithms for the k-means with outliers problem.

APA


Huang, J., Feng, Q., Huang, Z., Xu, J. & Wang, J.. (2024). Near-Linear Time Approximation Algorithms for k-means with Outliers. Proceedings of the 41st International Conference on Machine Learning, in Proceedings of Machine Learning Research 235:19723-19756 Available from https://proceedings.mlr.press/v235/huang24e.html.

Near-Linear Time Approximation Algorithms for k-means with Outliers

Abstract

Cite this Paper

Related Material