The Price of Tolerance in Distribution Testing

Clement L Canonne; Ayush Jain; Gautam Kamath; Jerry Li

The Price of Tolerance in Distribution Testing

Clement L Canonne, Ayush Jain, Gautam Kamath, Jerry Li

Proceedings of Thirty Fifth Conference on Learning Theory, PMLR 178:573-624, 2022.

Abstract

We revisit the problem of tolerant distribution testing. That is, given samples from an unknown distribution

$p$ over

$\{1, …, n\}$ , is it

$\varepsilon_1$ -close to or

$\varepsilon_2$ -far from a reference distribution

$q$ (in total variation distance)? Despite significant interest over the past decade, this problem is well understood only in the extreme cases. In the noiseless setting (i.e.,

$\varepsilon_1 = 0$ ) the sample complexity is

$\Theta(\sqrt{n})$ , strongly sublinear in the domain size. At the other end of the spectrum, when

$\varepsilon_1 = \varepsilon_2/2$ , the sample complexity jumps to the barely sublinear

$\Theta(n/\log n)$ . However, very little is known about the intermediate regime. We fully characterize the price of tolerance in distribution testing as a function of

$n$ ,

$\varepsilon_1$ ,

$\varepsilon_2$ , up to a single

$\log n$ factor. Specifically, we show the sample complexity to be

$\tilde \Theta\mleft(\frac{\sqrt{n}}{\ve_2^{2}} + \frac{n}{\log n} \cdot \max \mleft\{\frac{\ve_1}{\ve_2^2},\mleft(\frac{\ve_1}{\ve_2^2}\mright)^{\!\!2}\mright\}\mright),$ {providing} a smooth tradeoff between the two previously known cases. We also provide a similar characterization for the problem of tolerant equivalence testing, where both

$p$ and

$q$ are unknown. Surprisingly, in both cases, the main quantity dictating the sample complexity is the ratio

$\varepsilon_1/\varepsilon_2^2$ , and not the more intuitive

$\varepsilon_1/\varepsilon_2$ . Of particular technical interest is our lower bound framework, which involves novel approximation-theoretic tools required to handle the asymmetry between

$\varepsilon_1$ and

$\varepsilon_2$ , a challenge absent from previous works.

Cite this Paper

BibTeX


@InProceedings{pmlr-v178-canonne22a,
  title = 	 {The Price of Tolerance in Distribution Testing},
  author =       {Canonne, Clement L and Jain, Ayush and Kamath, Gautam and Li, Jerry},
  booktitle = 	 {Proceedings of Thirty Fifth Conference on Learning Theory},
  pages = 	 {573--624},
  year = 	 {2022},
  editor = 	 {Loh, Po-Ling and Raginsky, Maxim},
  volume = 	 {178},
  series = 	 {Proceedings of Machine Learning Research},
  month = 	 {02--05 Jul},
  publisher =    {PMLR},
  pdf = 	 {https://proceedings.mlr.press/v178/canonne22a/canonne22a.pdf},
  url = 	 {https://proceedings.mlr.press/v178/canonne22a.html},
  abstract = 	 {We revisit the problem of tolerant distribution testing. That is, given samples from an unknown distribution $p$ over $\{1, …, n\}$, is it $\varepsilon_1$-close to or $\varepsilon_2$-far from a reference distribution $q$ (in total variation distance)? Despite significant interest over the past decade, this problem is well understood only in the extreme cases. In the noiseless setting (i.e., $\varepsilon_1 = 0$) the sample complexity is $\Theta(\sqrt{n})$, strongly sublinear in the domain size. At the other end of the spectrum, when $\varepsilon_1 = \varepsilon_2/2$, the sample complexity jumps to the barely sublinear $\Theta(n/\log n)$. However, very little is known about the intermediate regime. We fully characterize the price of tolerance in distribution testing as a function of $n$, $\varepsilon_1$, $\varepsilon_2$, up to a single $\log n$ factor. Specifically, we show the sample complexity to be  \[\tilde \Theta\mleft(\frac{\sqrt{n}}{\ve_2^{2}} + \frac{n}{\log n} \cdot \max \mleft\{\frac{\ve_1}{\ve_2^2},\mleft(\frac{\ve_1}{\ve_2^2}\mright)^{\!\!2}\mright\}\mright),\]{providing} a smooth tradeoff between the two previously known cases. We also provide a similar characterization for the problem of tolerant equivalence testing, where both $p$ and $q$ are unknown. Surprisingly, in both cases, the main quantity dictating the sample complexity is the ratio $\varepsilon_1/\varepsilon_2^2$, and not the more intuitive $\varepsilon_1/\varepsilon_2$. Of particular technical interest is our lower bound framework, which involves novel approximation-theoretic tools required to handle the asymmetry between $\varepsilon_1$ and $\varepsilon_2$, a challenge absent from previous works.}
}

Endnote

%0 Conference Paper
%T The Price of Tolerance in Distribution Testing
%A Clement L Canonne
%A Ayush Jain
%A Gautam Kamath
%A Jerry Li
%B Proceedings of Thirty Fifth Conference on Learning Theory
%C Proceedings of Machine Learning Research
%D 2022
%E Po-Ling Loh
%E Maxim Raginsky	
%F pmlr-v178-canonne22a
%I PMLR
%P 573--624
%U https://proceedings.mlr.press/v178/canonne22a.html
%V 178
%X We revisit the problem of tolerant distribution testing. That is, given samples from an unknown distribution $p$ over $\{1, …, n\}$, is it $\varepsilon_1$-close to or $\varepsilon_2$-far from a reference distribution $q$ (in total variation distance)? Despite significant interest over the past decade, this problem is well understood only in the extreme cases. In the noiseless setting (i.e., $\varepsilon_1 = 0$) the sample complexity is $\Theta(\sqrt{n})$, strongly sublinear in the domain size. At the other end of the spectrum, when $\varepsilon_1 = \varepsilon_2/2$, the sample complexity jumps to the barely sublinear $\Theta(n/\log n)$. However, very little is known about the intermediate regime. We fully characterize the price of tolerance in distribution testing as a function of $n$, $\varepsilon_1$, $\varepsilon_2$, up to a single $\log n$ factor. Specifically, we show the sample complexity to be  \[\tilde \Theta\mleft(\frac{\sqrt{n}}{\ve_2^{2}} + \frac{n}{\log n} \cdot \max \mleft\{\frac{\ve_1}{\ve_2^2},\mleft(\frac{\ve_1}{\ve_2^2}\mright)^{\!\!2}\mright\}\mright),\]{providing} a smooth tradeoff between the two previously known cases. We also provide a similar characterization for the problem of tolerant equivalence testing, where both $p$ and $q$ are unknown. Surprisingly, in both cases, the main quantity dictating the sample complexity is the ratio $\varepsilon_1/\varepsilon_2^2$, and not the more intuitive $\varepsilon_1/\varepsilon_2$. Of particular technical interest is our lower bound framework, which involves novel approximation-theoretic tools required to handle the asymmetry between $\varepsilon_1$ and $\varepsilon_2$, a challenge absent from previous works.

APA


Canonne, C.L., Jain, A., Kamath, G. & Li, J.. (2022). The Price of Tolerance in Distribution Testing. Proceedings of Thirty Fifth Conference on Learning Theory, in Proceedings of Machine Learning Research 178:573-624 Available from https://proceedings.mlr.press/v178/canonne22a.html.

Related Material

Download PDF