[edit]
The Hidden Hubs Problem
Proceedings of the 2017 Conference on Learning Theory, PMLR 65:1190-1213, 2017.
Abstract
We introduce the following \em hidden hubs model H(n,k,σ0,σ1): the input is an n×n random matrix A with a subset S of k special rows (hubs); entries in rows outside S are generated from the Gaussian distribution p0=N(0,σ20), while for each row in S, an unknown subset of k of its entries are generated from p1=N(0,σ21), σ1>σ0, and the rest of the entries from p0. The special rows with higher variance entries can be viewed as hidden higher-degree hubs. The problem we address is to identify the hubs efficiently. The planted Gaussian Submatrix Model is the special case where the higher variance entries must all lie in a k×k submatrix. If k≥c√nlnn, just the row sums are sufficient to find S in the general model. For the Gaussian submatrix problem (and the related planted clique problem), this can be improved by a √lnn factor to k≥c√n by spectral or combinatorial methods. We give a polynomial-time algorithm to identify all the hidden hubs with high probability for k \ge n^0.5-δ for some δ>0, when \sigma_1^2>2\sigma_0^2. The algorithm extends to the setting where planted entries might have different variances, each at least \sigma_1^2. We also show a nearly matching lower bound: for \sigma_1^2 \le 2\sigma_0^2, there is no polynomial-time Statistical Query algorithm for distinguishing between a matrix whose entries are all from N(0,\sigma_0^2) and a matrix with k=n^0.5-δ hidden hubs for any δ>0. The lower bound as well as the algorithm are related to whether the chi-squared distance of the two distributions diverges. At the critical value \sigma_1^2=2\sigma_0^2, we show that the hidden hubs problem can be solved for k≥c\sqrt n(\ln n)^1/4, improving on the naive row sum-based method.