Towards Clustering Highdimensional Gaussian Mixture Clouds in Linear Running Time
[edit]
Proceedings of Machine Learning Research, PMLR 89:13791387, 2019.
Abstract
Clustering mixtures of Gaussian distributions is a fundamental and challenging problem. Stateoftheart theoretical work on learning Gaussian mixture models has mostly focused on estimating the mixture parameters, where clustering is given as a byproduct. These methods have focused mostly on improving separation bounds for different mixture classes, and doing so in polynomial time and sample complexity. Less emphasis has been given to aligning these algorithms to the challenges of big data. In this paper, we focus on clustering $n$ samples from an arbitrary mixture of $c$separated Gaussians in $\mathbb{R}^p$ in time that is linear in $p$ and $n$, and sample complexity that is independent of $p$. Our analysis suggests that for sufficiently separated Gaussians after $o(\log{p})$ random projections a good direction is found that yields a small clustering error. Specifically, for a userspecified error $e$, the expected number of such projections is small and bounded by $o(\ln p)$ when $\gamma\leq c\sqrt{\ln{\ln{p}}}$ and $\gamma=Q^{1}(e)$ is the separation of the Gaussians with $Q$ as the tail distribution function of the normal distribution. Consequently, the expected overall running time of the algorithm is linear in $n$ and quasilinear in $p$ at $o(\ln{p})O(np)$, and the sample complexity is independent of $p$. Unlike the methods that are based on $k$means, our analysis is applicable to any mixture class (spherical or nonspherical). Finally, an extension to $k>2$ components is also provided.
Related Material


