On the optimality of kernels for highdimensional clustering
[edit]
Proceedings of the Twenty Third International Conference on Artificial Intelligence and Statistics, PMLR 108:21852195, 2020.
Abstract
This paper studies the optimality of kernel methods in high dimensional data clustering. Recent works have studied the large sample performance of kernel clustering in the high dimensional regime, where Euclidean distance becomes less informative. However, it is unknown whether popular methods, such as kernel kmeans, are optimal in this regime. We consider the problem of high dimensional Gaussian clustering and show that, with the exponential kernel function, the sufficient conditions for partial recovery of clusters using the NPhard kernel kmeans objective matches the known informationtheoretic limit up to a factor of $\sqrt{2}$. It also exactly matches the known upper bounds for the nonkernel setting. We also show that a semidefinite relaxation of the kernel kmeans procedure matches up to constant factors, the spectral threshold, below which no polynomialtime algorithm is known to succeed. This is the first work that provides such optimality guarantees for the kernel kmeans as well as its convex relaxation. Our proofs demonstrate the utility of the less known polynomial concentration results for random variables with exponentially decaying tails in the higherorder analysis of kernel methods.
Related Material


