Sample complexity of population recovery

Yury Polyanskiy; Ananda Theertha Suresh; Yihong Wu

Sample complexity of population recovery

Yury Polyanskiy, Ananda Theertha Suresh, Yihong Wu

Proceedings of the 2017 Conference on Learning Theory, PMLR 65:1589-1618, 2017.

Abstract

The problem of population recovery refers to estimating a distribution based on incomplete or corrupted samples. Consider a random poll of sample size

$n$ conducted on a population of individuals, where each pollee is asked to answer

$d$ binary questions. We consider one of the two polling impediments: \beginitemize \item in lossy population recovery, a pollee may skip each question with probability

$ε$ ; \item in noisy population recovery, a pollee may lie on each question with probability

$ε$ . \enditemize Given

$n$ lossy or noisy samples, the goal is to estimate the probabilities of all

$2^d$ binary vectors simultaneously within accuracy

$δ$ with high probability. This paper settles the sample complexity of population recovery. For lossy model, the optimal sample complexity is

$\tildeΘ(δ^ -2\max{\fracε1-ε,1})$ , improving the state of the art by Moitra and Saks in several ways: a lower bound is established, the upper bound is improved and the result is dimension-free. Surprisingly, the sample complexity undergoes a phase transition from parametric to nonparametric rate when

$ε$ exceeds

$1/2$ . For noisy population recovery, the sharp sample complexity turns out to be dimension-dependent and scales as

$\exp(Θ(d^1/3 \log^2/3(1/δ)))$ except for the trivial cases of

$ε=0,1/2$ or

$1$ . For both models, our estimators simply compute the empirical mean of a certain function, which is found by pre-solving a linear program (LP). Curiously, the dual LP can be understood as Le Cam’s method for lower-bounding the minimax risk, thus establishing the statistical optimality of the proposed estimators. The value of the LP is determined by complex-analytic methods.

Cite this Paper

BibTeX


@InProceedings{pmlr-v65-polyanskiy17a,
  title = 	 {Sample complexity of population recovery},
  author = 	 {Polyanskiy, Yury and Suresh, Ananda Theertha and Wu, Yihong},
  booktitle = 	 {Proceedings of the 2017 Conference on Learning Theory},
  pages = 	 {1589--1618},
  year = 	 {2017},
  editor = 	 {Kale, Satyen and Shamir, Ohad},
  volume = 	 {65},
  series = 	 {Proceedings of Machine Learning Research},
  month = 	 {07--10 Jul},
  publisher =    {PMLR},
  pdf = 	 {http://proceedings.mlr.press/v65/polyanskiy17a/polyanskiy17a.pdf},
  url = 	 {https://proceedings.mlr.press/v65/polyanskiy17a.html},
  abstract = 	 {The problem of population recovery refers to estimating a distribution based on incomplete or corrupted samples.  Consider a random poll of sample size $n$ conducted on a population of individuals, where each pollee is asked to answer $d$ binary questions.  We consider one of the two polling impediments: \beginitemize \item in lossy population recovery, a pollee may skip each question with probability $ε$; \item in noisy population recovery, a pollee may lie on each question with probability $ε$. \enditemize Given $n$ lossy or noisy samples, the goal is to estimate the probabilities of all $2^d$ binary vectors simultaneously within accuracy $δ$ with high probability. This paper settles the sample complexity of population recovery.  For lossy model, the optimal sample complexity is $\tildeΘ(δ^ -2\max{\fracε1-ε,1})$, improving the state of the art by Moitra and Saks in several ways: a lower bound is established, the upper bound is improved and the result is dimension-free. Surprisingly, the sample complexity undergoes a phase transition from parametric to nonparametric rate when $ε$ exceeds $1/2$. For noisy population recovery, the sharp sample complexity turns out to be dimension-dependent and scales as $\exp(Θ(d^1/3 \log^2/3(1/δ)))$ except for the trivial cases of $ε=0,1/2$ or $1$. For both models, our estimators simply compute the empirical mean of a certain function, which is found by pre-solving a linear program (LP). Curiously, the dual LP can be understood as Le Cam’s method for lower-bounding the minimax risk, thus establishing the statistical optimality of the proposed estimators. The value of the LP is determined by complex-analytic methods. }
}

Endnote

%0 Conference Paper
%T Sample complexity of population recovery
%A Yury Polyanskiy
%A Ananda Theertha Suresh
%A Yihong Wu
%B Proceedings of the 2017 Conference on Learning Theory
%C Proceedings of Machine Learning Research
%D 2017
%E Satyen Kale
%E Ohad Shamir	
%F pmlr-v65-polyanskiy17a
%I PMLR
%P 1589--1618
%U https://proceedings.mlr.press/v65/polyanskiy17a.html
%V 65
%X The problem of population recovery refers to estimating a distribution based on incomplete or corrupted samples.  Consider a random poll of sample size $n$ conducted on a population of individuals, where each pollee is asked to answer $d$ binary questions.  We consider one of the two polling impediments: \beginitemize \item in lossy population recovery, a pollee may skip each question with probability $ε$; \item in noisy population recovery, a pollee may lie on each question with probability $ε$. \enditemize Given $n$ lossy or noisy samples, the goal is to estimate the probabilities of all $2^d$ binary vectors simultaneously within accuracy $δ$ with high probability. This paper settles the sample complexity of population recovery.  For lossy model, the optimal sample complexity is $\tildeΘ(δ^ -2\max{\fracε1-ε,1})$, improving the state of the art by Moitra and Saks in several ways: a lower bound is established, the upper bound is improved and the result is dimension-free. Surprisingly, the sample complexity undergoes a phase transition from parametric to nonparametric rate when $ε$ exceeds $1/2$. For noisy population recovery, the sharp sample complexity turns out to be dimension-dependent and scales as $\exp(Θ(d^1/3 \log^2/3(1/δ)))$ except for the trivial cases of $ε=0,1/2$ or $1$. For both models, our estimators simply compute the empirical mean of a certain function, which is found by pre-solving a linear program (LP). Curiously, the dual LP can be understood as Le Cam’s method for lower-bounding the minimax risk, thus establishing the statistical optimality of the proposed estimators. The value of the LP is determined by complex-analytic methods.

APA


Polyanskiy, Y., Suresh, A.T. & Wu, Y.. (2017). Sample complexity of population recovery. Proceedings of the 2017 Conference on Learning Theory, in Proceedings of Machine Learning Research 65:1589-1618 Available from https://proceedings.mlr.press/v65/polyanskiy17a.html.

Related Material

Download PDF