Sketch-and-lift: scalable subsampled semidefinite program for K-means clustering

Yubo Zhuang, Xiaohui Chen, Yun Yang
Proceedings of The 25th International Conference on Artificial Intelligence and Statistics, PMLR 151:9214-9246, 2022.

Abstract

Semidefinite programming (SDP) is a powerful tool for tackling a wide range of computationally hard problems such as clustering. Despite the high accuracy, semidefinite programs are often too slow in practice with poor scalability on large (or even moderate) datasets. In this paper, we introduce a linear time complexity algorithm for approximating an SDP relaxed K-means clustering. The proposed sketch-and-lift (SL) approach solves an SDP on a subsampled dataset and then propagates the solution to all data points by a nearest-centroid rounding procedure. It is shown that the SL approach enjoys a similar exact recovery threshold as the K-means SDP on the full dataset, which is known to be information-theoretically tight under the Gaussian mixture model. The SL method can be made adaptive with enhanced theoretic properties when the cluster sizes are unbalanced. Our simulation experiments demonstrate that the statistical accuracy of the proposed method outperforms state-of-the-art fast clustering algorithms without sacrificing too much computational efficiency, and is comparable to the original K-means SDP with substantially reduced runtime.

Cite this Paper


BibTeX
@InProceedings{pmlr-v151-zhuang22a, title = { Sketch-and-lift: scalable subsampled semidefinite program for K-means clustering }, author = {Zhuang, Yubo and Chen, Xiaohui and Yang, Yun}, booktitle = {Proceedings of The 25th International Conference on Artificial Intelligence and Statistics}, pages = {9214--9246}, year = {2022}, editor = {Camps-Valls, Gustau and Ruiz, Francisco J. R. and Valera, Isabel}, volume = {151}, series = {Proceedings of Machine Learning Research}, month = {28--30 Mar}, publisher = {PMLR}, pdf = {https://proceedings.mlr.press/v151/zhuang22a/zhuang22a.pdf}, url = {https://proceedings.mlr.press/v151/zhuang22a.html}, abstract = { Semidefinite programming (SDP) is a powerful tool for tackling a wide range of computationally hard problems such as clustering. Despite the high accuracy, semidefinite programs are often too slow in practice with poor scalability on large (or even moderate) datasets. In this paper, we introduce a linear time complexity algorithm for approximating an SDP relaxed K-means clustering. The proposed sketch-and-lift (SL) approach solves an SDP on a subsampled dataset and then propagates the solution to all data points by a nearest-centroid rounding procedure. It is shown that the SL approach enjoys a similar exact recovery threshold as the K-means SDP on the full dataset, which is known to be information-theoretically tight under the Gaussian mixture model. The SL method can be made adaptive with enhanced theoretic properties when the cluster sizes are unbalanced. Our simulation experiments demonstrate that the statistical accuracy of the proposed method outperforms state-of-the-art fast clustering algorithms without sacrificing too much computational efficiency, and is comparable to the original K-means SDP with substantially reduced runtime. } }
Endnote
%0 Conference Paper %T Sketch-and-lift: scalable subsampled semidefinite program for K-means clustering %A Yubo Zhuang %A Xiaohui Chen %A Yun Yang %B Proceedings of The 25th International Conference on Artificial Intelligence and Statistics %C Proceedings of Machine Learning Research %D 2022 %E Gustau Camps-Valls %E Francisco J. R. Ruiz %E Isabel Valera %F pmlr-v151-zhuang22a %I PMLR %P 9214--9246 %U https://proceedings.mlr.press/v151/zhuang22a.html %V 151 %X Semidefinite programming (SDP) is a powerful tool for tackling a wide range of computationally hard problems such as clustering. Despite the high accuracy, semidefinite programs are often too slow in practice with poor scalability on large (or even moderate) datasets. In this paper, we introduce a linear time complexity algorithm for approximating an SDP relaxed K-means clustering. The proposed sketch-and-lift (SL) approach solves an SDP on a subsampled dataset and then propagates the solution to all data points by a nearest-centroid rounding procedure. It is shown that the SL approach enjoys a similar exact recovery threshold as the K-means SDP on the full dataset, which is known to be information-theoretically tight under the Gaussian mixture model. The SL method can be made adaptive with enhanced theoretic properties when the cluster sizes are unbalanced. Our simulation experiments demonstrate that the statistical accuracy of the proposed method outperforms state-of-the-art fast clustering algorithms without sacrificing too much computational efficiency, and is comparable to the original K-means SDP with substantially reduced runtime.
APA
Zhuang, Y., Chen, X. & Yang, Y.. (2022). Sketch-and-lift: scalable subsampled semidefinite program for K-means clustering . Proceedings of The 25th International Conference on Artificial Intelligence and Statistics, in Proceedings of Machine Learning Research 151:9214-9246 Available from https://proceedings.mlr.press/v151/zhuang22a.html.

Related Material