[edit]
Towards Clustering High-dimensional Gaussian Mixture Clouds in Linear Running Time
Proceedings of the Twenty-Second International Conference on Artificial Intelligence and Statistics, PMLR 89:1379-1387, 2019.
Abstract
Clustering mixtures of Gaussian distributions is a fundamental and challenging problem. State-of-the-art theoretical work on learning Gaussian mixture models has mostly focused on estimating the mixture parameters, where clustering is given as a byproduct. These methods have focused mostly on improving separation bounds for different mixture classes, and doing so in polynomial time and sample complexity. Less emphasis has been given to aligning these algorithms to the challenges of big data. In this paper, we focus on clustering $n$ samples from an arbitrary mixture of $c$-separated Gaussians in $\mathbb{R}^p$ in time that is linear in $p$ and $n$, and sample complexity that is independent of $p$. Our analysis suggests that for sufficiently separated Gaussians after $o(\log{p})$ random projections a good direction is found that yields a small clustering error. Specifically, for a user-specified error $e$, the expected number of such projections is small and bounded by $o(\ln p)$ when $\gamma\leq c\sqrt{\ln{\ln{p}}}$ and $\gamma=Q^{-1}(e)$ is the separation of the Gaussians with $Q$ as the tail distribution function of the normal distribution. Consequently, the expected overall running time of the algorithm is linear in $n$ and quasi-linear in $p$ at $o(\ln{p})O(np)$, and the sample complexity is independent of $p$. Unlike the methods that are based on $k$-means, our analysis is applicable to any mixture class (spherical or non-spherical). Finally, an extension to $k>2$ components is also provided.