The Hidden Hubs Problem

Ravindran Kannan; Santosh Vempala

The Hidden Hubs Problem

Ravindran Kannan, Santosh Vempala

Proceedings of the 2017 Conference on Learning Theory, PMLR 65:1190-1213, 2017.

Abstract

We introduce the following \em hidden hubs model $H(n,k,\sigma_0, \sigma_1)$: the input is an $n \times n$ random matrix $A$ with a subset $S$ of $k$ special rows (hubs); entries in rows outside $S$ are generated from the Gaussian distribution $p_0 = N(0,\sigma_0^2)$, while for each row in $S$, an unknown subset of $k$ of its entries are generated from $p_1 = N(0,\sigma_1^2)$, $\sigma_1>\sigma_0$, and the rest of the entries from $p_0$. The special rows with higher variance entries can be viewed as hidden higher-degree hubs. The problem we address is to identify the hubs efficiently. The planted Gaussian Submatrix Model is the special case where the higher variance entries must all lie in a $k \times k$ submatrix. If $k≥c\sqrt{n}\ln n$, just the row sums are sufficient to find $S$ in the general model. For the Gaussian submatrix problem (and the related planted clique problem), this can be improved by a $\sqrt\ln n$ factor to $k \ge c\sqrt{n}$ by spectral or combinatorial methods. We give a polynomial-time algorithm to identify all the hidden hubs with high probability for $k \ge n^0.5-δ$ for some $δ>0$, when $\sigma_1^2>2\sigma_0^2$. The algorithm extends to the setting where planted entries might have different variances, each at least $\sigma_1^2$. We also show a nearly matching lower bound: for $\sigma_1^2 \le 2\sigma_0^2$, there is no polynomial-time Statistical Query algorithm for distinguishing between a matrix whose entries are all from $N(0,\sigma_0^2)$ and a matrix with $k=n^0.5-δ$ hidden hubs for any $δ>0$. The lower bound as well as the algorithm are related to whether the chi-squared distance of the two distributions diverges. At the critical value $\sigma_1^2=2\sigma_0^2$, we show that the hidden hubs problem can be solved for $k≥c\sqrt n(\ln n)^1/4$, improving on the naive row sum-based method.

Cite this Paper

BibTeX


@InProceedings{pmlr-v65-kannan17a,
  title = 	 {The Hidden Hubs Problem},
  author = 	 {Kannan, Ravindran and Vempala, Santosh},
  booktitle = 	 {Proceedings of the 2017 Conference on Learning Theory},
  pages = 	 {1190--1213},
  year = 	 {2017},
  editor = 	 {Kale, Satyen and Shamir, Ohad},
  volume = 	 {65},
  series = 	 {Proceedings of Machine Learning Research},
  month = 	 {07--10 Jul},
  publisher =    {PMLR},
  pdf = 	 {http://proceedings.mlr.press/v65/kannan17a/kannan17a.pdf},
  url = 	 {https://proceedings.mlr.press/v65/kannan17a.html},
  abstract = 	 {We introduce the following \em hidden hubs model $H(n,k,\sigma_0, \sigma_1)$: the input is an $n \times n$ random matrix $A$ with a subset $S$ of $k$ special rows (hubs); entries in rows outside $S$ are generated from the Gaussian distribution $p_0 = N(0,\sigma_0^2)$, while for each row in $S$, an unknown subset of $k$ of its entries are generated from $p_1 = N(0,\sigma_1^2)$, $\sigma_1>\sigma_0$, and the rest of the entries from $p_0$. The special rows with higher variance entries can be viewed as hidden higher-degree hubs. The problem we address is to identify the hubs efficiently. The planted Gaussian Submatrix Model is the special case where the higher variance entries must all lie in a $k \times k$ submatrix. If $k≥c\sqrt{n}\ln n$, just the row sums are sufficient to find $S$ in the general model. For the Gaussian submatrix problem (and the related planted clique problem), this can be improved by a $\sqrt\ln n$ factor to $k \ge c\sqrt{n}$ by spectral or combinatorial methods. We give a polynomial-time algorithm to identify all the hidden hubs with high probability for $k \ge n^0.5-δ$ for some $δ>0$, when $\sigma_1^2>2\sigma_0^2$.  The algorithm extends to the setting where planted entries might have different variances, each at least $\sigma_1^2$. We also show a nearly matching lower bound: for $\sigma_1^2 \le 2\sigma_0^2$, there is no polynomial-time Statistical Query algorithm for distinguishing between a matrix whose entries are all from $N(0,\sigma_0^2)$ and a matrix with $k=n^0.5-δ$ hidden hubs for any $δ>0$. The lower bound as well as the algorithm are related to whether the chi-squared distance of the two distributions diverges. At the critical value $\sigma_1^2=2\sigma_0^2$, we show that the hidden hubs problem can be solved for $k≥c\sqrt n(\ln n)^1/4$, improving on the naive row sum-based method.}
}

Endnote

%0 Conference Paper
%T The Hidden Hubs Problem
%A Ravindran Kannan
%A Santosh Vempala
%B Proceedings of the 2017 Conference on Learning Theory
%C Proceedings of Machine Learning Research
%D 2017
%E Satyen Kale
%E Ohad Shamir	
%F pmlr-v65-kannan17a
%I PMLR
%P 1190--1213
%U https://proceedings.mlr.press/v65/kannan17a.html
%V 65
%X We introduce the following \em hidden hubs model $H(n,k,\sigma_0, \sigma_1)$: the input is an $n \times n$ random matrix $A$ with a subset $S$ of $k$ special rows (hubs); entries in rows outside $S$ are generated from the Gaussian distribution $p_0 = N(0,\sigma_0^2)$, while for each row in $S$, an unknown subset of $k$ of its entries are generated from $p_1 = N(0,\sigma_1^2)$, $\sigma_1>\sigma_0$, and the rest of the entries from $p_0$. The special rows with higher variance entries can be viewed as hidden higher-degree hubs. The problem we address is to identify the hubs efficiently. The planted Gaussian Submatrix Model is the special case where the higher variance entries must all lie in a $k \times k$ submatrix. If $k≥c\sqrt{n}\ln n$, just the row sums are sufficient to find $S$ in the general model. For the Gaussian submatrix problem (and the related planted clique problem), this can be improved by a $\sqrt\ln n$ factor to $k \ge c\sqrt{n}$ by spectral or combinatorial methods. We give a polynomial-time algorithm to identify all the hidden hubs with high probability for $k \ge n^0.5-δ$ for some $δ>0$, when $\sigma_1^2>2\sigma_0^2$.  The algorithm extends to the setting where planted entries might have different variances, each at least $\sigma_1^2$. We also show a nearly matching lower bound: for $\sigma_1^2 \le 2\sigma_0^2$, there is no polynomial-time Statistical Query algorithm for distinguishing between a matrix whose entries are all from $N(0,\sigma_0^2)$ and a matrix with $k=n^0.5-δ$ hidden hubs for any $δ>0$. The lower bound as well as the algorithm are related to whether the chi-squared distance of the two distributions diverges. At the critical value $\sigma_1^2=2\sigma_0^2$, we show that the hidden hubs problem can be solved for $k≥c\sqrt n(\ln n)^1/4$, improving on the naive row sum-based method.

APA


Kannan, R. & Vempala, S.. (2017). The Hidden Hubs Problem. Proceedings of the 2017 Conference on Learning Theory, in Proceedings of Machine Learning Research 65:1190-1213 Available from https://proceedings.mlr.press/v65/kannan17a.html.

Related Material

Download PDF