[edit]
The Price of Tolerance in Distribution Testing
Proceedings of Thirty Fifth Conference on Learning Theory, PMLR 178:573-624, 2022.
Abstract
We revisit the problem of tolerant distribution testing. That is, given samples from an unknown distribution p over {1,…,n}, is it ε1-close to or ε2-far from a reference distribution q (in total variation distance)? Despite significant interest over the past decade, this problem is well understood only in the extreme cases. In the noiseless setting (i.e., ε1=0) the sample complexity is Θ(√n), strongly sublinear in the domain size. At the other end of the spectrum, when ε1=ε2/2, the sample complexity jumps to the barely sublinear Θ(n/logn). However, very little is known about the intermediate regime. We fully characterize the price of tolerance in distribution testing as a function of n, ε1, ε2, up to a single logn factor. Specifically, we show the sample complexity to be ˜Θ\mleft(√n\ve22+nlogn⋅max{providing} a smooth tradeoff between the two previously known cases. We also provide a similar characterization for the problem of tolerant equivalence testing, where both p and q are unknown. Surprisingly, in both cases, the main quantity dictating the sample complexity is the ratio \varepsilon_1/\varepsilon_2^2, and not the more intuitive \varepsilon_1/\varepsilon_2. Of particular technical interest is our lower bound framework, which involves novel approximation-theoretic tools required to handle the asymmetry between \varepsilon_1 and \varepsilon_2, a challenge absent from previous works.