Sample complexity of population recovery
[edit]
Proceedings of the 2017 Conference on Learning Theory, PMLR 65:15891618, 2017.
Abstract
The problem of population recovery refers to estimating a distribution based on incomplete or corrupted samples. Consider a random poll of sample size $n$ conducted on a population of individuals, where each pollee is asked to answer $d$ binary questions. We consider one of the two polling impediments: \beginitemize \item in lossy population recovery, a pollee may skip each question with probability $ε$; \item in noisy population recovery, a pollee may lie on each question with probability $ε$. \enditemize Given $n$ lossy or noisy samples, the goal is to estimate the probabilities of all $2^d$ binary vectors simultaneously within accuracy $δ$ with high probability. This paper settles the sample complexity of population recovery. For lossy model, the optimal sample complexity is $\tildeΘ(δ^ 2\max{\fracε1ε,1})$, improving the state of the art by Moitra and Saks in several ways: a lower bound is established, the upper bound is improved and the result is dimensionfree. Surprisingly, the sample complexity undergoes a phase transition from parametric to nonparametric rate when $ε$ exceeds $1/2$. For noisy population recovery, the sharp sample complexity turns out to be dimensiondependent and scales as $\exp(Θ(d^1/3 \log^2/3(1/δ)))$ except for the trivial cases of $ε=0,1/2$ or $1$. For both models, our estimators simply compute the empirical mean of a certain function, which is found by presolving a linear program (LP). Curiously, the dual LP can be understood as Le Cam’s method for lowerbounding the minimax risk, thus establishing the statistical optimality of the proposed estimators. The value of the LP is determined by complexanalytic methods.
Related Material


