Scalable Simple Random Sampling and Stratified Sampling

Xiangrui Meng
Proceedings of the 30th International Conference on Machine Learning, PMLR 28(3):531-539, 2013.

Abstract

Analyzing data sets of billions of records has now become a regular task in many companies and institutions. In the statistical analysis of those massive data sets, sampling generally plays a very important role. In this work, we describe a scalable simple random sampling algorithm, named ScaSRS, which uses probabilistic thresholds to decide on the fly whether to accept, reject, or wait-list an item independently of others. We prove, with high probability, it succeeds and needs only O(\sqrtk) storage, where k is the sample size. ScaSRS extends naturally to a scalable stratified sampling algorithm, which is favorable for heterogeneous data sets. The proposed algorithms, when implemented in MapReduce, can effectively reduce the size of intermediate output and greatly improve load balancing. Empirical evaluation on large-scale data sets clearly demonstrates their superiority.

Cite this Paper


BibTeX
@InProceedings{pmlr-v28-meng13a, title = {Scalable Simple Random Sampling and Stratified Sampling}, author = {Meng, Xiangrui}, booktitle = {Proceedings of the 30th International Conference on Machine Learning}, pages = {531--539}, year = {2013}, editor = {Dasgupta, Sanjoy and McAllester, David}, volume = {28}, number = {3}, series = {Proceedings of Machine Learning Research}, address = {Atlanta, Georgia, USA}, month = {17--19 Jun}, publisher = {PMLR}, pdf = {http://proceedings.mlr.press/v28/meng13a.pdf}, url = {https://proceedings.mlr.press/v28/meng13a.html}, abstract = {Analyzing data sets of billions of records has now become a regular task in many companies and institutions. In the statistical analysis of those massive data sets, sampling generally plays a very important role. In this work, we describe a scalable simple random sampling algorithm, named ScaSRS, which uses probabilistic thresholds to decide on the fly whether to accept, reject, or wait-list an item independently of others. We prove, with high probability, it succeeds and needs only O(\sqrtk) storage, where k is the sample size. ScaSRS extends naturally to a scalable stratified sampling algorithm, which is favorable for heterogeneous data sets. The proposed algorithms, when implemented in MapReduce, can effectively reduce the size of intermediate output and greatly improve load balancing. Empirical evaluation on large-scale data sets clearly demonstrates their superiority.} }
Endnote
%0 Conference Paper %T Scalable Simple Random Sampling and Stratified Sampling %A Xiangrui Meng %B Proceedings of the 30th International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2013 %E Sanjoy Dasgupta %E David McAllester %F pmlr-v28-meng13a %I PMLR %P 531--539 %U https://proceedings.mlr.press/v28/meng13a.html %V 28 %N 3 %X Analyzing data sets of billions of records has now become a regular task in many companies and institutions. In the statistical analysis of those massive data sets, sampling generally plays a very important role. In this work, we describe a scalable simple random sampling algorithm, named ScaSRS, which uses probabilistic thresholds to decide on the fly whether to accept, reject, or wait-list an item independently of others. We prove, with high probability, it succeeds and needs only O(\sqrtk) storage, where k is the sample size. ScaSRS extends naturally to a scalable stratified sampling algorithm, which is favorable for heterogeneous data sets. The proposed algorithms, when implemented in MapReduce, can effectively reduce the size of intermediate output and greatly improve load balancing. Empirical evaluation on large-scale data sets clearly demonstrates their superiority.
RIS
TY - CPAPER TI - Scalable Simple Random Sampling and Stratified Sampling AU - Xiangrui Meng BT - Proceedings of the 30th International Conference on Machine Learning DA - 2013/05/26 ED - Sanjoy Dasgupta ED - David McAllester ID - pmlr-v28-meng13a PB - PMLR DP - Proceedings of Machine Learning Research VL - 28 IS - 3 SP - 531 EP - 539 L1 - http://proceedings.mlr.press/v28/meng13a.pdf UR - https://proceedings.mlr.press/v28/meng13a.html AB - Analyzing data sets of billions of records has now become a regular task in many companies and institutions. In the statistical analysis of those massive data sets, sampling generally plays a very important role. In this work, we describe a scalable simple random sampling algorithm, named ScaSRS, which uses probabilistic thresholds to decide on the fly whether to accept, reject, or wait-list an item independently of others. We prove, with high probability, it succeeds and needs only O(\sqrtk) storage, where k is the sample size. ScaSRS extends naturally to a scalable stratified sampling algorithm, which is favorable for heterogeneous data sets. The proposed algorithms, when implemented in MapReduce, can effectively reduce the size of intermediate output and greatly improve load balancing. Empirical evaluation on large-scale data sets clearly demonstrates their superiority. ER -
APA
Meng, X.. (2013). Scalable Simple Random Sampling and Stratified Sampling. Proceedings of the 30th International Conference on Machine Learning, in Proceedings of Machine Learning Research 28(3):531-539 Available from https://proceedings.mlr.press/v28/meng13a.html.

Related Material