Online k-means Clustering on Arbitrary Data Streams

Robi Bhattacharjee, Jacob Imola, Michal Moshkovitz, Sanjoy Dasgupta
Proceedings of The 34th International Conference on Algorithmic Learning Theory, PMLR 201:204-236, 2023.

Abstract

We consider $k$-means clustering in an online setting where each new data point is assigned to its closest cluster center and incurs a loss equal to the squared distance to that center, after which the algorithm is allowed to update its centers. The goal over a data stream $X$ is to achieve a total loss that is not too much larger than $L(X, OPT_k)$, the best possible loss using $k$ fixed centers in hindsight. We start by introducing a data parameter, $\Lambda(X)$, such that for any online algorithm that maintains $O(k \, \text{poly}(\log n))$ centers after seeing $n$ points, there exists a data stream $X$ for which a loss of $\Omega(\Lambda(X))$ is inevitable. Next, we give a randomized algorithm that achieves online loss $O(\Lambda(X) + L(X, OPT_k))$, while taking $O(k \, \text{poly}(\log n))$ centers and additional memory. It has an update time of $O(k \, \text{poly}(\log n))$ and is the first algorithm to achieve polynomial space and time complexity in the online setting. We note that our results have implications to the related streaming setting, where one final clustering is outputted, and the no-substitution setting, where center selections are permanent. We show a general reduction between the no-substitution cost of a blackbox algorithm and its online cost. Finally, we translate our algorithm to the no-substitution setting and streaming settings, and it competes with and can outperform existing work in the areas.

Cite this Paper


BibTeX
@InProceedings{pmlr-v201-bhattacharjee23b, title = {Online k-means Clustering on Arbitrary Data Streams}, author = {Bhattacharjee, Robi and Imola, Jacob and Moshkovitz, Michal and Dasgupta, Sanjoy}, booktitle = {Proceedings of The 34th International Conference on Algorithmic Learning Theory}, pages = {204--236}, year = {2023}, editor = {Agrawal, Shipra and Orabona, Francesco}, volume = {201}, series = {Proceedings of Machine Learning Research}, month = {20 Feb--23 Feb}, publisher = {PMLR}, pdf = {https://proceedings.mlr.press/v201/bhattacharjee23b/bhattacharjee23b.pdf}, url = {https://proceedings.mlr.press/v201/bhattacharjee23b.html}, abstract = {We consider $k$-means clustering in an online setting where each new data point is assigned to its closest cluster center and incurs a loss equal to the squared distance to that center, after which the algorithm is allowed to update its centers. The goal over a data stream $X$ is to achieve a total loss that is not too much larger than $L(X, OPT_k)$, the best possible loss using $k$ fixed centers in hindsight. We start by introducing a data parameter, $\Lambda(X)$, such that for any online algorithm that maintains $O(k \, \text{poly}(\log n))$ centers after seeing $n$ points, there exists a data stream $X$ for which a loss of $\Omega(\Lambda(X))$ is inevitable. Next, we give a randomized algorithm that achieves online loss $O(\Lambda(X) + L(X, OPT_k))$, while taking $O(k \, \text{poly}(\log n))$ centers and additional memory. It has an update time of $O(k \, \text{poly}(\log n))$ and is the first algorithm to achieve polynomial space and time complexity in the online setting. We note that our results have implications to the related streaming setting, where one final clustering is outputted, and the no-substitution setting, where center selections are permanent. We show a general reduction between the no-substitution cost of a blackbox algorithm and its online cost. Finally, we translate our algorithm to the no-substitution setting and streaming settings, and it competes with and can outperform existing work in the areas.} }
Endnote
%0 Conference Paper %T Online k-means Clustering on Arbitrary Data Streams %A Robi Bhattacharjee %A Jacob Imola %A Michal Moshkovitz %A Sanjoy Dasgupta %B Proceedings of The 34th International Conference on Algorithmic Learning Theory %C Proceedings of Machine Learning Research %D 2023 %E Shipra Agrawal %E Francesco Orabona %F pmlr-v201-bhattacharjee23b %I PMLR %P 204--236 %U https://proceedings.mlr.press/v201/bhattacharjee23b.html %V 201 %X We consider $k$-means clustering in an online setting where each new data point is assigned to its closest cluster center and incurs a loss equal to the squared distance to that center, after which the algorithm is allowed to update its centers. The goal over a data stream $X$ is to achieve a total loss that is not too much larger than $L(X, OPT_k)$, the best possible loss using $k$ fixed centers in hindsight. We start by introducing a data parameter, $\Lambda(X)$, such that for any online algorithm that maintains $O(k \, \text{poly}(\log n))$ centers after seeing $n$ points, there exists a data stream $X$ for which a loss of $\Omega(\Lambda(X))$ is inevitable. Next, we give a randomized algorithm that achieves online loss $O(\Lambda(X) + L(X, OPT_k))$, while taking $O(k \, \text{poly}(\log n))$ centers and additional memory. It has an update time of $O(k \, \text{poly}(\log n))$ and is the first algorithm to achieve polynomial space and time complexity in the online setting. We note that our results have implications to the related streaming setting, where one final clustering is outputted, and the no-substitution setting, where center selections are permanent. We show a general reduction between the no-substitution cost of a blackbox algorithm and its online cost. Finally, we translate our algorithm to the no-substitution setting and streaming settings, and it competes with and can outperform existing work in the areas.
APA
Bhattacharjee, R., Imola, J., Moshkovitz, M. & Dasgupta, S.. (2023). Online k-means Clustering on Arbitrary Data Streams. Proceedings of The 34th International Conference on Algorithmic Learning Theory, in Proceedings of Machine Learning Research 201:204-236 Available from https://proceedings.mlr.press/v201/bhattacharjee23b.html.

Related Material