Robust Algorithms for Online k-means Clustering

Aditya Bhaskara, Aravinda Kanchana Ruwanpathirana
Proceedings of the 31st International Conference on Algorithmic Learning Theory, PMLR 117:148-173, 2020.

Abstract

In the online version of the classic k-means clustering problem, the points of a dataset u1,u2, arrive one after another in an arbitrary order. When the algorithm sees a point, it should either add it to the set of centers, or let go of the point. Once added, a center cannot be removed. The goal is to end up with set of roughly k centers, while competing in k-means objective value with the best set of k centers in hindsight. Online versions of k-means and other clustering problem have received significant attention in the literature. The key idea in many algorithms is that of adaptive sampling: when a new point arrives, it is added to the set of centers with a probability that depends on the distance to the centers chosen so far. Our contributions are as follows:
  1. We give a modified adaptive sampling procedure that obtains a better approximation ratio (improving it from logarithmic to constant).
  2. Our main result is to show how to perform adaptive sampling when data has outliers (k points that are potentially arbitrarily far from the actual data, thus rendering distance-based sampling prone to picking the outliers).
  3. We also discuss lower bounds for k-means clustering in an online setting.

Cite this Paper


BibTeX
@InProceedings{pmlr-v117-bhaskara20a, title = {Robust Algorithms for Online $k$-means Clustering}, author = {Bhaskara, Aditya and Ruwanpathirana, Aravinda Kanchana}, booktitle = {Proceedings of the 31st International Conference on Algorithmic Learning Theory}, pages = {148--173}, year = {2020}, editor = {Kontorovich, Aryeh and Neu, Gergely}, volume = {117}, series = {Proceedings of Machine Learning Research}, month = {08 Feb--11 Feb}, publisher = {PMLR}, pdf = {http://proceedings.mlr.press/v117/bhaskara20a/bhaskara20a.pdf}, url = {https://proceedings.mlr.press/v117/bhaskara20a.html}, abstract = {In the online version of the classic $k$-means clustering problem, the points of a dataset $u_1, u_2, …$ arrive one after another in an arbitrary order. When the algorithm sees a point, it should either add it to the set of centers, or let go of the point. Once added, a center cannot be removed. The goal is to end up with set of roughly $k$ centers, while competing in $k$-means objective value with the best set of $k$ centers in hindsight. Online versions of $k$-means and other clustering problem have received significant attention in the literature. The key idea in many algorithms is that of adaptive sampling: when a new point arrives, it is added to the set of centers with a probability that depends on the distance to the centers chosen so far. Our contributions are as follows:
  1. We give a modified adaptive sampling procedure that obtains a better approximation ratio (improving it from logarithmic to constant).
  2. Our main result is to show how to perform adaptive sampling when data has outliers ($\gg k$ points that are potentially arbitrarily far from the actual data, thus rendering distance-based sampling prone to picking the outliers).
  3. We also discuss lower bounds for $k$-means clustering in an online setting.
} }
Endnote
%0 Conference Paper %T Robust Algorithms for Online $k$-means Clustering %A Aditya Bhaskara %A Aravinda Kanchana Ruwanpathirana %B Proceedings of the 31st International Conference on Algorithmic Learning Theory %C Proceedings of Machine Learning Research %D 2020 %E Aryeh Kontorovich %E Gergely Neu %F pmlr-v117-bhaskara20a %I PMLR %P 148--173 %U https://proceedings.mlr.press/v117/bhaskara20a.html %V 117 %X In the online version of the classic $k$-means clustering problem, the points of a dataset $u_1, u_2, …$ arrive one after another in an arbitrary order. When the algorithm sees a point, it should either add it to the set of centers, or let go of the point. Once added, a center cannot be removed. The goal is to end up with set of roughly $k$ centers, while competing in $k$-means objective value with the best set of $k$ centers in hindsight. Online versions of $k$-means and other clustering problem have received significant attention in the literature. The key idea in many algorithms is that of adaptive sampling: when a new point arrives, it is added to the set of centers with a probability that depends on the distance to the centers chosen so far. Our contributions are as follows:
  1. We give a modified adaptive sampling procedure that obtains a better approximation ratio (improving it from logarithmic to constant).
  2. Our main result is to show how to perform adaptive sampling when data has outliers ($\gg k$ points that are potentially arbitrarily far from the actual data, thus rendering distance-based sampling prone to picking the outliers).
  3. We also discuss lower bounds for $k$-means clustering in an online setting.
APA
Bhaskara, A. & Ruwanpathirana, A.K.. (2020). Robust Algorithms for Online $k$-means Clustering. Proceedings of the 31st International Conference on Algorithmic Learning Theory, in Proceedings of Machine Learning Research 117:148-173 Available from https://proceedings.mlr.press/v117/bhaskara20a.html.

Related Material