[edit]
A Random Matrix Analysis of Data Stream Clustering: Coping With Limited Memory Resources
Proceedings of the 39th International Conference on Machine Learning, PMLR 162:12253-12281, 2022.
Abstract
This article introduces a random matrix framework for the analysis of clustering on high-dimensional data streams, a particularly relevant setting for a more sober processing of large amounts of data with limited memory and energy resources. Assuming data $\mathbf{x}_1, \mathbf{x}_2, \ldots$ arrives as a continuous flow and a small number $L$ of them can be kept in the learning pipeline, one has only access to the diagonal elements of the Gram kernel matrix: $\left[ \mathbf{K}_L \right]_{i, j} = \frac{1}{p} \mathbf{x}_i^\top \mathbf{x}_j \mathbf{1}_{\left\lvert i - j \right\rvert < L}$. Under a large-dimensional data regime, we derive the limiting spectral distribution of the banded kernel matrix $\mathbf{K}_L$ and study its isolated eigenvalues and eigenvectors, which behave in an unfamiliar way. We detail how these results can be used to perform efficient online kernel spectral clustering and provide theoretical performance guarantees. Our findings are empirically confirmed on image clustering tasks. Leveraging on optimality results of spectral methods for clustering, this work offers insights on efficient online clustering techniques for high-dimensional data.