[edit]
Bessel Smoothing and Multi-Distribution Property Estimation
Proceedings of Thirty Third Conference on Learning Theory, PMLR 125:1817-1876, 2020.
Abstract
We consider a basic problem in statistical learning: estimating properties of multiple discrete distributions. Denoting by $\Delta_k$ the standard simplex over $[k]:=\{0,1,\ldots, k\}$, a property of $d$ distributions is a mapping from $\Delta_k^d$ to $\mathbb R$. These properties include well-known distribution characteristics such as Shannon entropy and support size ($d=1$), and many important divergence measures between distributions ($d=2$). The primary problem being considered is to learn the property value of an $\emph{unknown}$ $d$-tuple of distributions from its sample. The study of such problems dates back to the works of Efron and Thisted (1976b); Thisted and Efron (1987); Good (1953b); Carlton (1969), and has been pushed forward steadily during the past decades. Surprisingly, before our work, the general landscape of this fundamental learning problem was insufficiently understood, and nearly all the existing results are for the special case $d\le 2$. Our first main result provides a near-linear-time computable algorithm that, given independent samples from any collection of distributions and for a broad class of multi-distribution properties, learns the property as well as the empirical plug-in estimator that uses samples with logarithmic-factor larger sizes. As a corollary of this, for any $\varepsilon>0$ and fixed $d\in \mathbb Z^+$, a $d$-distribution property over $[k]$ that is Lipschitz and additively separable can be learned to an accuracy of $\varepsilon$ using a sample of size $\mathcal{O}(k/(\varepsilon^3\sqrt{\log k}))$, with high probability. Our second result addresses a closely related problem– tolerant independence testing: One receives samples from the unknown joint and marginal distributions, and attempts to infer the $\ell_1$ distance between the joint distribution and the product distribution of the marginals. We show that this testing problem also admits a sample complexity sub-linear in the alphabet sizes, demonstrating the broad applicability of our approach.