Online k-means Clustering

Vincent Cohen-Addad, Benjamin Guedj, Varun Kanade, Guy Rom
Proceedings of The 24th International Conference on Artificial Intelligence and Statistics, PMLR 130:1126-1134, 2021.

Abstract

We study the problem of learning a clustering of an online set of points. The specific formulation we use is the k-means objective: At each time step the algorithm has to maintain a set of k candidate centers and the loss incurred by the algorithm is the squared distance between the new point and the closest center. The goal is to minimize regret with respect to the best solution to the k-means objective in hindsight. We show that provided the data lies in a bounded region, learning is possible, namely an implementation of the Multiplicative Weights Update Algorithm (MWUA) using a discretized grid achieves a regret bound of $\tilde{O}(\sqrt{T})$ in expectation. We also present an online-to-offline reduction that shows that an efficient no-regret online algorithm (despite being allowed to choose a different set of candidate centres at each round) implies an offline efficient algorithm for the k-means problem, which is known to be NP-hard. In light of this hardness, we consider the slightly weaker requirement of comparing regret with respect to $(1 + \epsilon)OPT$ and present a no-regret algorithm with runtime $O\left(T \mathrm{poly}(\log(T),k,d,1/\epsilon)^{O(kd)}\right)$. Our algorithm is based on maintaining a set of points of bounded size which is a coreset that helps identifying the \emph{relevant} regions of the space for running an adaptive, more efficient, variant of the MWUA. We show that simpler online algorithms, such as \emph{Follow The Leader} (FTL), fail to produce sublinear regret in the worst case. We also report preliminary experiments with synthetic and real-world data. Our theoretical results answer an open question of Dasgupta (2008).

Cite this Paper


BibTeX
@InProceedings{pmlr-v130-cohen-addad21a, title = { Online k-means Clustering }, author = {Cohen-Addad, Vincent and Guedj, Benjamin and Kanade, Varun and Rom, Guy}, booktitle = {Proceedings of The 24th International Conference on Artificial Intelligence and Statistics}, pages = {1126--1134}, year = {2021}, editor = {Banerjee, Arindam and Fukumizu, Kenji}, volume = {130}, series = {Proceedings of Machine Learning Research}, month = {13--15 Apr}, publisher = {PMLR}, pdf = {http://proceedings.mlr.press/v130/cohen-addad21a/cohen-addad21a.pdf}, url = {https://proceedings.mlr.press/v130/cohen-addad21a.html}, abstract = { We study the problem of learning a clustering of an online set of points. The specific formulation we use is the k-means objective: At each time step the algorithm has to maintain a set of k candidate centers and the loss incurred by the algorithm is the squared distance between the new point and the closest center. The goal is to minimize regret with respect to the best solution to the k-means objective in hindsight. We show that provided the data lies in a bounded region, learning is possible, namely an implementation of the Multiplicative Weights Update Algorithm (MWUA) using a discretized grid achieves a regret bound of $\tilde{O}(\sqrt{T})$ in expectation. We also present an online-to-offline reduction that shows that an efficient no-regret online algorithm (despite being allowed to choose a different set of candidate centres at each round) implies an offline efficient algorithm for the k-means problem, which is known to be NP-hard. In light of this hardness, we consider the slightly weaker requirement of comparing regret with respect to $(1 + \epsilon)OPT$ and present a no-regret algorithm with runtime $O\left(T \mathrm{poly}(\log(T),k,d,1/\epsilon)^{O(kd)}\right)$. Our algorithm is based on maintaining a set of points of bounded size which is a coreset that helps identifying the \emph{relevant} regions of the space for running an adaptive, more efficient, variant of the MWUA. We show that simpler online algorithms, such as \emph{Follow The Leader} (FTL), fail to produce sublinear regret in the worst case. We also report preliminary experiments with synthetic and real-world data. Our theoretical results answer an open question of Dasgupta (2008). } }
Endnote
%0 Conference Paper %T Online k-means Clustering %A Vincent Cohen-Addad %A Benjamin Guedj %A Varun Kanade %A Guy Rom %B Proceedings of The 24th International Conference on Artificial Intelligence and Statistics %C Proceedings of Machine Learning Research %D 2021 %E Arindam Banerjee %E Kenji Fukumizu %F pmlr-v130-cohen-addad21a %I PMLR %P 1126--1134 %U https://proceedings.mlr.press/v130/cohen-addad21a.html %V 130 %X We study the problem of learning a clustering of an online set of points. The specific formulation we use is the k-means objective: At each time step the algorithm has to maintain a set of k candidate centers and the loss incurred by the algorithm is the squared distance between the new point and the closest center. The goal is to minimize regret with respect to the best solution to the k-means objective in hindsight. We show that provided the data lies in a bounded region, learning is possible, namely an implementation of the Multiplicative Weights Update Algorithm (MWUA) using a discretized grid achieves a regret bound of $\tilde{O}(\sqrt{T})$ in expectation. We also present an online-to-offline reduction that shows that an efficient no-regret online algorithm (despite being allowed to choose a different set of candidate centres at each round) implies an offline efficient algorithm for the k-means problem, which is known to be NP-hard. In light of this hardness, we consider the slightly weaker requirement of comparing regret with respect to $(1 + \epsilon)OPT$ and present a no-regret algorithm with runtime $O\left(T \mathrm{poly}(\log(T),k,d,1/\epsilon)^{O(kd)}\right)$. Our algorithm is based on maintaining a set of points of bounded size which is a coreset that helps identifying the \emph{relevant} regions of the space for running an adaptive, more efficient, variant of the MWUA. We show that simpler online algorithms, such as \emph{Follow The Leader} (FTL), fail to produce sublinear regret in the worst case. We also report preliminary experiments with synthetic and real-world data. Our theoretical results answer an open question of Dasgupta (2008).
APA
Cohen-Addad, V., Guedj, B., Kanade, V. & Rom, G.. (2021). Online k-means Clustering . Proceedings of The 24th International Conference on Artificial Intelligence and Statistics, in Proceedings of Machine Learning Research 130:1126-1134 Available from https://proceedings.mlr.press/v130/cohen-addad21a.html.

Related Material