The Hidden Hubs Problem

Ravindran Kannan, Santosh Vempala
Proceedings of the 2017 Conference on Learning Theory, PMLR 65:1190-1213, 2017.

Abstract

We introduce the following \em hidden hubs model $H(n,k,\sigma_0, \sigma_1)$: the input is an $n \times n$ random matrix $A$ with a subset $S$ of $k$ special rows (hubs); entries in rows outside $S$ are generated from the Gaussian distribution $p_0 = N(0,\sigma_0^2)$, while for each row in $S$, an unknown subset of $k$ of its entries are generated from $p_1 = N(0,\sigma_1^2)$, $\sigma_1>\sigma_0$, and the rest of the entries from $p_0$. The special rows with higher variance entries can be viewed as hidden higher-degree hubs. The problem we address is to identify the hubs efficiently. The planted Gaussian Submatrix Model is the special case where the higher variance entries must all lie in a $k \times k$ submatrix. If $k≥c\sqrt{n}\ln n$, just the row sums are sufficient to find $S$ in the general model. For the Gaussian submatrix problem (and the related planted clique problem), this can be improved by a $\sqrt\ln n$ factor to $k \ge c\sqrt{n}$ by spectral or combinatorial methods. We give a polynomial-time algorithm to identify all the hidden hubs with high probability for $k \ge n^0.5-δ$ for some $δ>0$, when $\sigma_1^2>2\sigma_0^2$. The algorithm extends to the setting where planted entries might have different variances, each at least $\sigma_1^2$. We also show a nearly matching lower bound: for $\sigma_1^2 \le 2\sigma_0^2$, there is no polynomial-time Statistical Query algorithm for distinguishing between a matrix whose entries are all from $N(0,\sigma_0^2)$ and a matrix with $k=n^0.5-δ$ hidden hubs for any $δ>0$. The lower bound as well as the algorithm are related to whether the chi-squared distance of the two distributions diverges. At the critical value $\sigma_1^2=2\sigma_0^2$, we show that the hidden hubs problem can be solved for $k≥c\sqrt n(\ln n)^1/4$, improving on the naive row sum-based method.

Cite this Paper


BibTeX
@InProceedings{pmlr-v65-kannan17a, title = {The Hidden Hubs Problem}, author = {Kannan, Ravindran and Vempala, Santosh}, booktitle = {Proceedings of the 2017 Conference on Learning Theory}, pages = {1190--1213}, year = {2017}, editor = {Kale, Satyen and Shamir, Ohad}, volume = {65}, series = {Proceedings of Machine Learning Research}, month = {07--10 Jul}, publisher = {PMLR}, pdf = {http://proceedings.mlr.press/v65/kannan17a/kannan17a.pdf}, url = {https://proceedings.mlr.press/v65/kannan17a.html}, abstract = {We introduce the following \em hidden hubs model $H(n,k,\sigma_0, \sigma_1)$: the input is an $n \times n$ random matrix $A$ with a subset $S$ of $k$ special rows (hubs); entries in rows outside $S$ are generated from the Gaussian distribution $p_0 = N(0,\sigma_0^2)$, while for each row in $S$, an unknown subset of $k$ of its entries are generated from $p_1 = N(0,\sigma_1^2)$, $\sigma_1>\sigma_0$, and the rest of the entries from $p_0$. The special rows with higher variance entries can be viewed as hidden higher-degree hubs. The problem we address is to identify the hubs efficiently. The planted Gaussian Submatrix Model is the special case where the higher variance entries must all lie in a $k \times k$ submatrix. If $k≥c\sqrt{n}\ln n$, just the row sums are sufficient to find $S$ in the general model. For the Gaussian submatrix problem (and the related planted clique problem), this can be improved by a $\sqrt\ln n$ factor to $k \ge c\sqrt{n}$ by spectral or combinatorial methods. We give a polynomial-time algorithm to identify all the hidden hubs with high probability for $k \ge n^0.5-δ$ for some $δ>0$, when $\sigma_1^2>2\sigma_0^2$. The algorithm extends to the setting where planted entries might have different variances, each at least $\sigma_1^2$. We also show a nearly matching lower bound: for $\sigma_1^2 \le 2\sigma_0^2$, there is no polynomial-time Statistical Query algorithm for distinguishing between a matrix whose entries are all from $N(0,\sigma_0^2)$ and a matrix with $k=n^0.5-δ$ hidden hubs for any $δ>0$. The lower bound as well as the algorithm are related to whether the chi-squared distance of the two distributions diverges. At the critical value $\sigma_1^2=2\sigma_0^2$, we show that the hidden hubs problem can be solved for $k≥c\sqrt n(\ln n)^1/4$, improving on the naive row sum-based method.} }
Endnote
%0 Conference Paper %T The Hidden Hubs Problem %A Ravindran Kannan %A Santosh Vempala %B Proceedings of the 2017 Conference on Learning Theory %C Proceedings of Machine Learning Research %D 2017 %E Satyen Kale %E Ohad Shamir %F pmlr-v65-kannan17a %I PMLR %P 1190--1213 %U https://proceedings.mlr.press/v65/kannan17a.html %V 65 %X We introduce the following \em hidden hubs model $H(n,k,\sigma_0, \sigma_1)$: the input is an $n \times n$ random matrix $A$ with a subset $S$ of $k$ special rows (hubs); entries in rows outside $S$ are generated from the Gaussian distribution $p_0 = N(0,\sigma_0^2)$, while for each row in $S$, an unknown subset of $k$ of its entries are generated from $p_1 = N(0,\sigma_1^2)$, $\sigma_1>\sigma_0$, and the rest of the entries from $p_0$. The special rows with higher variance entries can be viewed as hidden higher-degree hubs. The problem we address is to identify the hubs efficiently. The planted Gaussian Submatrix Model is the special case where the higher variance entries must all lie in a $k \times k$ submatrix. If $k≥c\sqrt{n}\ln n$, just the row sums are sufficient to find $S$ in the general model. For the Gaussian submatrix problem (and the related planted clique problem), this can be improved by a $\sqrt\ln n$ factor to $k \ge c\sqrt{n}$ by spectral or combinatorial methods. We give a polynomial-time algorithm to identify all the hidden hubs with high probability for $k \ge n^0.5-δ$ for some $δ>0$, when $\sigma_1^2>2\sigma_0^2$. The algorithm extends to the setting where planted entries might have different variances, each at least $\sigma_1^2$. We also show a nearly matching lower bound: for $\sigma_1^2 \le 2\sigma_0^2$, there is no polynomial-time Statistical Query algorithm for distinguishing between a matrix whose entries are all from $N(0,\sigma_0^2)$ and a matrix with $k=n^0.5-δ$ hidden hubs for any $δ>0$. The lower bound as well as the algorithm are related to whether the chi-squared distance of the two distributions diverges. At the critical value $\sigma_1^2=2\sigma_0^2$, we show that the hidden hubs problem can be solved for $k≥c\sqrt n(\ln n)^1/4$, improving on the naive row sum-based method.
APA
Kannan, R. & Vempala, S.. (2017). The Hidden Hubs Problem. Proceedings of the 2017 Conference on Learning Theory, in Proceedings of Machine Learning Research 65:1190-1213 Available from https://proceedings.mlr.press/v65/kannan17a.html.

Related Material