A Faster Sampling Algorithm for Spherical $k$-means

Rameshwar Pratap, Anup Deshmukh, Pratheeksha Nair, Tarun Dutt
Proceedings of The 10th Asian Conference on Machine Learning, PMLR 95:343-358, 2018.

Abstract

The Spherical $k$-means algorithm proposed by (Dhillon and Modha, 2001) is a popular algorithm for clustering high dimensional datasets. Although their algorithm is simple and easy to implement, a drawback of the same is that it doesn’t provide any provable guarantee on the clustering result. (Endo and Miyamoto, 2015) suggest an adaptive sampling based algorithm (Spherical $k$-means$++$) which gives near optimal results, with high probability. However, their algorithm requires $k$ sequential passes over the entire dataset, which may not be feasible when the dataset and/or the values of $k$ are large. In this work, we propose a Markov chain based sampling algorithm that takes only one pass over the data, and gives close to optimal clustering similar to Spherical $k$-means$++$, i.e., a faster algorithm while maintaining almost the same approximation. We present a theoretical analysis of the algorithm, and complement it with rigorous experiments on real-world datasets. Our proposed algorithm is simple and easy to implement, and can be easily adopted in practice.

Cite this Paper


BibTeX
@InProceedings{pmlr-v95-pratap18a, title = {A Faster Sampling Algorithm for Spherical $k$-means}, author = {Pratap, Rameshwar and Deshmukh, Anup and Nair, Pratheeksha and Dutt, Tarun}, booktitle = {Proceedings of The 10th Asian Conference on Machine Learning}, pages = {343--358}, year = {2018}, editor = {Zhu, Jun and Takeuchi, Ichiro}, volume = {95}, series = {Proceedings of Machine Learning Research}, month = {14--16 Nov}, publisher = {PMLR}, pdf = {http://proceedings.mlr.press/v95/pratap18a/pratap18a.pdf}, url = {https://proceedings.mlr.press/v95/pratap18a.html}, abstract = {The Spherical $k$-means algorithm proposed by (Dhillon and Modha, 2001) is a popular algorithm for clustering high dimensional datasets. Although their algorithm is simple and easy to implement, a drawback of the same is that it doesn’t provide any provable guarantee on the clustering result. (Endo and Miyamoto, 2015) suggest an adaptive sampling based algorithm (Spherical $k$-means$++$) which gives near optimal results, with high probability. However, their algorithm requires $k$ sequential passes over the entire dataset, which may not be feasible when the dataset and/or the values of $k$ are large. In this work, we propose a Markov chain based sampling algorithm that takes only one pass over the data, and gives close to optimal clustering similar to Spherical $k$-means$++$, i.e., a faster algorithm while maintaining almost the same approximation. We present a theoretical analysis of the algorithm, and complement it with rigorous experiments on real-world datasets. Our proposed algorithm is simple and easy to implement, and can be easily adopted in practice.} }
Endnote
%0 Conference Paper %T A Faster Sampling Algorithm for Spherical $k$-means %A Rameshwar Pratap %A Anup Deshmukh %A Pratheeksha Nair %A Tarun Dutt %B Proceedings of The 10th Asian Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2018 %E Jun Zhu %E Ichiro Takeuchi %F pmlr-v95-pratap18a %I PMLR %P 343--358 %U https://proceedings.mlr.press/v95/pratap18a.html %V 95 %X The Spherical $k$-means algorithm proposed by (Dhillon and Modha, 2001) is a popular algorithm for clustering high dimensional datasets. Although their algorithm is simple and easy to implement, a drawback of the same is that it doesn’t provide any provable guarantee on the clustering result. (Endo and Miyamoto, 2015) suggest an adaptive sampling based algorithm (Spherical $k$-means$++$) which gives near optimal results, with high probability. However, their algorithm requires $k$ sequential passes over the entire dataset, which may not be feasible when the dataset and/or the values of $k$ are large. In this work, we propose a Markov chain based sampling algorithm that takes only one pass over the data, and gives close to optimal clustering similar to Spherical $k$-means$++$, i.e., a faster algorithm while maintaining almost the same approximation. We present a theoretical analysis of the algorithm, and complement it with rigorous experiments on real-world datasets. Our proposed algorithm is simple and easy to implement, and can be easily adopted in practice.
APA
Pratap, R., Deshmukh, A., Nair, P. & Dutt, T.. (2018). A Faster Sampling Algorithm for Spherical $k$-means. Proceedings of The 10th Asian Conference on Machine Learning, in Proceedings of Machine Learning Research 95:343-358 Available from https://proceedings.mlr.press/v95/pratap18a.html.

Related Material