The Hidden Hubs Problem
[edit]
Proceedings of the 2017 Conference on Learning Theory, PMLR 65:11901213, 2017.
Abstract
We introduce the following \em hidden hubs model $H(n,k,\sigma_0, \sigma_1)$: the input is an $n \times n$ random matrix $A$ with a subset $S$ of $k$ special rows (hubs); entries in rows outside $S$ are generated from the Gaussian distribution $p_0 = N(0,\sigma_0^2)$, while for each row in $S$, an unknown subset of $k$ of its entries are generated from $p_1 = N(0,\sigma_1^2)$, $\sigma_1>\sigma_0$, and the rest of the entries from $p_0$. The special rows with higher variance entries can be viewed as hidden higherdegree hubs. The problem we address is to identify the hubs efficiently. The planted Gaussian Submatrix Model is the special case where the higher variance entries must all lie in a $k \times k$ submatrix. If $k≥c\sqrt{n}\ln n$, just the row sums are sufficient to find $S$ in the general model. For the Gaussian submatrix problem (and the related planted clique problem), this can be improved by a $\sqrt\ln n$ factor to $k \ge c\sqrt{n}$ by spectral or combinatorial methods. We give a polynomialtime algorithm to identify all the hidden hubs with high probability for $k \ge n^0.5δ$ for some $δ>0$, when $\sigma_1^2>2\sigma_0^2$. The algorithm extends to the setting where planted entries might have different variances, each at least $\sigma_1^2$. We also show a nearly matching lower bound: for $\sigma_1^2 \le 2\sigma_0^2$, there is no polynomialtime Statistical Query algorithm for distinguishing between a matrix whose entries are all from $N(0,\sigma_0^2)$ and a matrix with $k=n^0.5δ$ hidden hubs for any $δ>0$. The lower bound as well as the algorithm are related to whether the chisquared distance of the two distributions diverges. At the critical value $\sigma_1^2=2\sigma_0^2$, we show that the hidden hubs problem can be solved for $k≥c\sqrt n(\ln n)^1/4$, improving on the naive row sumbased method.
Related Material


