Probability–Revealing Samples

Krzysztof Onak, Xiaorui Sun
Proceedings of the Twenty-First International Conference on Artificial Intelligence and Statistics, PMLR 84:2018-2026, 2018.

Abstract

In the most popular distribution testing and parameter estimation model, one can obtain information about an underlying distribution D via independent samples from D. We introduce a model in which every sample comes with the information about the probability of selecting it. In this setting, we give algorithms for problems such as testing if two distributions are (approximately) identical, estimating the total variation distance between distributions, and estimating the support size. The sample complexity of all of our algorithms is optimal up to a constant factor for sufficiently large support size. The running times of our algorithms are near-linear in the number of samples collected. Additionally, our algorithms are robust to small multiplicative errors in probability estimates. The complexity of our model lies strictly between the complexity of the model where only independent samples are provided and the complexity of the model where additionally arbitrary probability queries are allowed. Our model finds applications where once a given element is sampled, it is easier to estimate its probability. We describe two scenarios in which all occurrences of each element are easy to explore once at least one copy of the element is detected.

Cite this Paper


BibTeX
@InProceedings{pmlr-v84-onak18a, title = {Probability–Revealing Samples}, author = {Onak, Krzysztof and Sun, Xiaorui}, booktitle = {Proceedings of the Twenty-First International Conference on Artificial Intelligence and Statistics}, pages = {2018--2026}, year = {2018}, editor = {Storkey, Amos and Perez-Cruz, Fernando}, volume = {84}, series = {Proceedings of Machine Learning Research}, month = {09--11 Apr}, publisher = {PMLR}, pdf = {http://proceedings.mlr.press/v84/onak18a/onak18a.pdf}, url = {https://proceedings.mlr.press/v84/onak18a.html}, abstract = {In the most popular distribution testing and parameter estimation model, one can obtain information about an underlying distribution D via independent samples from D. We introduce a model in which every sample comes with the information about the probability of selecting it. In this setting, we give algorithms for problems such as testing if two distributions are (approximately) identical, estimating the total variation distance between distributions, and estimating the support size. The sample complexity of all of our algorithms is optimal up to a constant factor for sufficiently large support size. The running times of our algorithms are near-linear in the number of samples collected. Additionally, our algorithms are robust to small multiplicative errors in probability estimates. The complexity of our model lies strictly between the complexity of the model where only independent samples are provided and the complexity of the model where additionally arbitrary probability queries are allowed. Our model finds applications where once a given element is sampled, it is easier to estimate its probability. We describe two scenarios in which all occurrences of each element are easy to explore once at least one copy of the element is detected.} }
Endnote
%0 Conference Paper %T Probability–Revealing Samples %A Krzysztof Onak %A Xiaorui Sun %B Proceedings of the Twenty-First International Conference on Artificial Intelligence and Statistics %C Proceedings of Machine Learning Research %D 2018 %E Amos Storkey %E Fernando Perez-Cruz %F pmlr-v84-onak18a %I PMLR %P 2018--2026 %U https://proceedings.mlr.press/v84/onak18a.html %V 84 %X In the most popular distribution testing and parameter estimation model, one can obtain information about an underlying distribution D via independent samples from D. We introduce a model in which every sample comes with the information about the probability of selecting it. In this setting, we give algorithms for problems such as testing if two distributions are (approximately) identical, estimating the total variation distance between distributions, and estimating the support size. The sample complexity of all of our algorithms is optimal up to a constant factor for sufficiently large support size. The running times of our algorithms are near-linear in the number of samples collected. Additionally, our algorithms are robust to small multiplicative errors in probability estimates. The complexity of our model lies strictly between the complexity of the model where only independent samples are provided and the complexity of the model where additionally arbitrary probability queries are allowed. Our model finds applications where once a given element is sampled, it is easier to estimate its probability. We describe two scenarios in which all occurrences of each element are easy to explore once at least one copy of the element is detected.
APA
Onak, K. & Sun, X.. (2018). Probability–Revealing Samples. Proceedings of the Twenty-First International Conference on Artificial Intelligence and Statistics, in Proceedings of Machine Learning Research 84:2018-2026 Available from https://proceedings.mlr.press/v84/onak18a.html.

Related Material