Towards Clustering High-dimensional Gaussian Mixture Clouds in Linear Running Time

Dan Kushnir, Shirin Jalali, Iraj Saniee
Proceedings of the Twenty-Second International Conference on Artificial Intelligence and Statistics, PMLR 89:1379-1387, 2019.

Abstract

Clustering mixtures of Gaussian distributions is a fundamental and challenging problem. State-of-the-art theoretical work on learning Gaussian mixture models has mostly focused on estimating the mixture parameters, where clustering is given as a byproduct. These methods have focused mostly on improving separation bounds for different mixture classes, and doing so in polynomial time and sample complexity. Less emphasis has been given to aligning these algorithms to the challenges of big data. In this paper, we focus on clustering $n$ samples from an arbitrary mixture of $c$-separated Gaussians in $\mathbb{R}^p$ in time that is linear in $p$ and $n$, and sample complexity that is independent of $p$. Our analysis suggests that for sufficiently separated Gaussians after $o(\log{p})$ random projections a good direction is found that yields a small clustering error. Specifically, for a user-specified error $e$, the expected number of such projections is small and bounded by $o(\ln p)$ when $\gamma\leq c\sqrt{\ln{\ln{p}}}$ and $\gamma=Q^{-1}(e)$ is the separation of the Gaussians with $Q$ as the tail distribution function of the normal distribution. Consequently, the expected overall running time of the algorithm is linear in $n$ and quasi-linear in $p$ at $o(\ln{p})O(np)$, and the sample complexity is independent of $p$. Unlike the methods that are based on $k$-means, our analysis is applicable to any mixture class (spherical or non-spherical). Finally, an extension to $k>2$ components is also provided.

Cite this Paper


BibTeX
@InProceedings{pmlr-v89-kushnir19a, title = {Towards Clustering High-dimensional Gaussian Mixture Clouds in Linear Running Time}, author = {Kushnir, Dan and Jalali, Shirin and Saniee, Iraj}, booktitle = {Proceedings of the Twenty-Second International Conference on Artificial Intelligence and Statistics}, pages = {1379--1387}, year = {2019}, editor = {Chaudhuri, Kamalika and Sugiyama, Masashi}, volume = {89}, series = {Proceedings of Machine Learning Research}, month = {16--18 Apr}, publisher = {PMLR}, pdf = {http://proceedings.mlr.press/v89/kushnir19a/kushnir19a.pdf}, url = {https://proceedings.mlr.press/v89/kushnir19a.html}, abstract = {Clustering mixtures of Gaussian distributions is a fundamental and challenging problem. State-of-the-art theoretical work on learning Gaussian mixture models has mostly focused on estimating the mixture parameters, where clustering is given as a byproduct. These methods have focused mostly on improving separation bounds for different mixture classes, and doing so in polynomial time and sample complexity. Less emphasis has been given to aligning these algorithms to the challenges of big data. In this paper, we focus on clustering $n$ samples from an arbitrary mixture of $c$-separated Gaussians in $\mathbb{R}^p$ in time that is linear in $p$ and $n$, and sample complexity that is independent of $p$. Our analysis suggests that for sufficiently separated Gaussians after $o(\log{p})$ random projections a good direction is found that yields a small clustering error. Specifically, for a user-specified error $e$, the expected number of such projections is small and bounded by $o(\ln p)$ when $\gamma\leq c\sqrt{\ln{\ln{p}}}$ and $\gamma=Q^{-1}(e)$ is the separation of the Gaussians with $Q$ as the tail distribution function of the normal distribution. Consequently, the expected overall running time of the algorithm is linear in $n$ and quasi-linear in $p$ at $o(\ln{p})O(np)$, and the sample complexity is independent of $p$. Unlike the methods that are based on $k$-means, our analysis is applicable to any mixture class (spherical or non-spherical). Finally, an extension to $k>2$ components is also provided.} }
Endnote
%0 Conference Paper %T Towards Clustering High-dimensional Gaussian Mixture Clouds in Linear Running Time %A Dan Kushnir %A Shirin Jalali %A Iraj Saniee %B Proceedings of the Twenty-Second International Conference on Artificial Intelligence and Statistics %C Proceedings of Machine Learning Research %D 2019 %E Kamalika Chaudhuri %E Masashi Sugiyama %F pmlr-v89-kushnir19a %I PMLR %P 1379--1387 %U https://proceedings.mlr.press/v89/kushnir19a.html %V 89 %X Clustering mixtures of Gaussian distributions is a fundamental and challenging problem. State-of-the-art theoretical work on learning Gaussian mixture models has mostly focused on estimating the mixture parameters, where clustering is given as a byproduct. These methods have focused mostly on improving separation bounds for different mixture classes, and doing so in polynomial time and sample complexity. Less emphasis has been given to aligning these algorithms to the challenges of big data. In this paper, we focus on clustering $n$ samples from an arbitrary mixture of $c$-separated Gaussians in $\mathbb{R}^p$ in time that is linear in $p$ and $n$, and sample complexity that is independent of $p$. Our analysis suggests that for sufficiently separated Gaussians after $o(\log{p})$ random projections a good direction is found that yields a small clustering error. Specifically, for a user-specified error $e$, the expected number of such projections is small and bounded by $o(\ln p)$ when $\gamma\leq c\sqrt{\ln{\ln{p}}}$ and $\gamma=Q^{-1}(e)$ is the separation of the Gaussians with $Q$ as the tail distribution function of the normal distribution. Consequently, the expected overall running time of the algorithm is linear in $n$ and quasi-linear in $p$ at $o(\ln{p})O(np)$, and the sample complexity is independent of $p$. Unlike the methods that are based on $k$-means, our analysis is applicable to any mixture class (spherical or non-spherical). Finally, an extension to $k>2$ components is also provided.
APA
Kushnir, D., Jalali, S. & Saniee, I.. (2019). Towards Clustering High-dimensional Gaussian Mixture Clouds in Linear Running Time. Proceedings of the Twenty-Second International Conference on Artificial Intelligence and Statistics, in Proceedings of Machine Learning Research 89:1379-1387 Available from https://proceedings.mlr.press/v89/kushnir19a.html.

Related Material