Estimating the unseen from multiple populations

Aditi Raghunathan, Gregory Valiant, James Zou
Proceedings of the 34th International Conference on Machine Learning, PMLR 70:2855-2863, 2017.

Abstract

Given samples from a distribution, how many new elements should we expect to find if we keep on sampling this distribution? This is an important and actively studied problem, with many applications ranging from species estimation to genomics. We generalize this extrapolation and related unseen estimation problems to the multiple population setting, where population $j$ has an unknown distribution $D_j$ from which we observe $n_j$ samples. We derive an optimal estimator for the total number of elements we expect to find among new samples across the populations. Surprisingly, we prove that our estimator’s accuracy is independent of the number of populations. We also develop an efficient optimization algorithm to solve the more general problem of estimating multi-population frequency distributions. We validate our methods and theory through extensive experiments. Finally, on a real dataset of human genomes across multiple ancestries, we demonstrate how our approach for unseen estimation can enable cohort designs that can discover interesting mutations with greater efficiency.

Cite this Paper


BibTeX
@InProceedings{pmlr-v70-raghunathan17a, title = {Estimating the unseen from multiple populations}, author = {Aditi Raghunathan and Gregory Valiant and James Zou}, booktitle = {Proceedings of the 34th International Conference on Machine Learning}, pages = {2855--2863}, year = {2017}, editor = {Precup, Doina and Teh, Yee Whye}, volume = {70}, series = {Proceedings of Machine Learning Research}, month = {06--11 Aug}, publisher = {PMLR}, pdf = {http://proceedings.mlr.press/v70/raghunathan17a/raghunathan17a.pdf}, url = {https://proceedings.mlr.press/v70/raghunathan17a.html}, abstract = {Given samples from a distribution, how many new elements should we expect to find if we keep on sampling this distribution? This is an important and actively studied problem, with many applications ranging from species estimation to genomics. We generalize this extrapolation and related unseen estimation problems to the multiple population setting, where population $j$ has an unknown distribution $D_j$ from which we observe $n_j$ samples. We derive an optimal estimator for the total number of elements we expect to find among new samples across the populations. Surprisingly, we prove that our estimator’s accuracy is independent of the number of populations. We also develop an efficient optimization algorithm to solve the more general problem of estimating multi-population frequency distributions. We validate our methods and theory through extensive experiments. Finally, on a real dataset of human genomes across multiple ancestries, we demonstrate how our approach for unseen estimation can enable cohort designs that can discover interesting mutations with greater efficiency.} }
Endnote
%0 Conference Paper %T Estimating the unseen from multiple populations %A Aditi Raghunathan %A Gregory Valiant %A James Zou %B Proceedings of the 34th International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2017 %E Doina Precup %E Yee Whye Teh %F pmlr-v70-raghunathan17a %I PMLR %P 2855--2863 %U https://proceedings.mlr.press/v70/raghunathan17a.html %V 70 %X Given samples from a distribution, how many new elements should we expect to find if we keep on sampling this distribution? This is an important and actively studied problem, with many applications ranging from species estimation to genomics. We generalize this extrapolation and related unseen estimation problems to the multiple population setting, where population $j$ has an unknown distribution $D_j$ from which we observe $n_j$ samples. We derive an optimal estimator for the total number of elements we expect to find among new samples across the populations. Surprisingly, we prove that our estimator’s accuracy is independent of the number of populations. We also develop an efficient optimization algorithm to solve the more general problem of estimating multi-population frequency distributions. We validate our methods and theory through extensive experiments. Finally, on a real dataset of human genomes across multiple ancestries, we demonstrate how our approach for unseen estimation can enable cohort designs that can discover interesting mutations with greater efficiency.
APA
Raghunathan, A., Valiant, G. & Zou, J.. (2017). Estimating the unseen from multiple populations. Proceedings of the 34th International Conference on Machine Learning, in Proceedings of Machine Learning Research 70:2855-2863 Available from https://proceedings.mlr.press/v70/raghunathan17a.html.

Related Material