A Faster Sampling Algorithm for Spherical $k$-means

Rameshwar Pratap; Anup Deshmukh; Pratheeksha Nair; Tarun Dutt

A Faster Sampling Algorithm for Spherical $k$-means

Rameshwar Pratap, Anup Deshmukh, Pratheeksha Nair, Tarun Dutt

Proceedings of The 10th Asian Conference on Machine Learning, PMLR 95:343-358, 2018.

Abstract

The Spherical $k$-means algorithm proposed by (Dhillon and Modha, 2001) is a popular algorithm for clustering high dimensional datasets. Although their algorithm is simple and easy to implement, a drawback of the same is that it doesn’t provide any provable guarantee on the clustering result. (Endo and Miyamoto, 2015) suggest an adaptive sampling based algorithm (Spherical $k$-means$++$) which gives near optimal results, with high probability. However, their algorithm requires $k$ sequential passes over the entire dataset, which may not be feasible when the dataset and/or the values of $k$ are large. In this work, we propose a Markov chain based sampling algorithm that takes only one pass over the data, and gives close to optimal clustering similar to Spherical $k$-means$++$, i.e., a faster algorithm while maintaining almost the same approximation. We present a theoretical analysis of the algorithm, and complement it with rigorous experiments on real-world datasets. Our proposed algorithm is simple and easy to implement, and can be easily adopted in practice.

Cite this Paper

BibTeX


@InProceedings{pmlr-v95-pratap18a,
  title = 	 {A Faster Sampling Algorithm for Spherical $k$-means},
  author =       {Pratap, Rameshwar and Deshmukh, Anup and Nair, Pratheeksha and Dutt, Tarun},
  booktitle = 	 {Proceedings of The 10th Asian Conference on Machine Learning},
  pages = 	 {343--358},
  year = 	 {2018},
  editor = 	 {Zhu, Jun and Takeuchi, Ichiro},
  volume = 	 {95},
  series = 	 {Proceedings of Machine Learning Research},
  month = 	 {14--16 Nov},
  publisher =    {PMLR},
  pdf = 	 {http://proceedings.mlr.press/v95/pratap18a/pratap18a.pdf},
  url = 	 {https://proceedings.mlr.press/v95/pratap18a.html},
  abstract = 	 {The Spherical $k$-means algorithm proposed by (Dhillon and Modha, 2001) is a popular algorithm for clustering high dimensional  datasets. Although their algorithm is simple and easy to implement, a drawback of the same is that it doesn’t provide any provable  guarantee on the   clustering result. (Endo and Miyamoto, 2015) suggest an adaptive sampling based algorithm (Spherical $k$-means$++$) which gives near optimal results, with high probability. However, their algorithm requires $k$ sequential  passes over the entire dataset, which may not be feasible when the dataset and/or  the values of $k$ are large. In this work, we propose a Markov chain based sampling algorithm that takes only one pass over the data, and gives close to optimal clustering similar to  Spherical $k$-means$++$, i.e., a faster algorithm while maintaining almost the same approximation. We present a theoretical analysis of the algorithm, and complement it with rigorous experiments on   real-world datasets. Our proposed algorithm is simple and easy to implement, and can be easily adopted in practice.}
}

Endnote

%0 Conference Paper
%T A Faster Sampling Algorithm for Spherical $k$-means
%A Rameshwar Pratap
%A Anup Deshmukh
%A Pratheeksha Nair
%A Tarun Dutt
%B Proceedings of The 10th Asian Conference on Machine Learning
%C Proceedings of Machine Learning Research
%D 2018
%E Jun Zhu
%E Ichiro Takeuchi	
%F pmlr-v95-pratap18a
%I PMLR
%P 343--358
%U https://proceedings.mlr.press/v95/pratap18a.html
%V 95
%X The Spherical $k$-means algorithm proposed by (Dhillon and Modha, 2001) is a popular algorithm for clustering high dimensional  datasets. Although their algorithm is simple and easy to implement, a drawback of the same is that it doesn’t provide any provable  guarantee on the   clustering result. (Endo and Miyamoto, 2015) suggest an adaptive sampling based algorithm (Spherical $k$-means$++$) which gives near optimal results, with high probability. However, their algorithm requires $k$ sequential  passes over the entire dataset, which may not be feasible when the dataset and/or  the values of $k$ are large. In this work, we propose a Markov chain based sampling algorithm that takes only one pass over the data, and gives close to optimal clustering similar to  Spherical $k$-means$++$, i.e., a faster algorithm while maintaining almost the same approximation. We present a theoretical analysis of the algorithm, and complement it with rigorous experiments on   real-world datasets. Our proposed algorithm is simple and easy to implement, and can be easily adopted in practice.

APA


Pratap, R., Deshmukh, A., Nair, P. & Dutt, T.. (2018). A Faster Sampling Algorithm for Spherical $k$-means. Proceedings of The 10th Asian Conference on Machine Learning, in Proceedings of Machine Learning Research 95:343-358 Available from https://proceedings.mlr.press/v95/pratap18a.html.

Related Material

Download PDF