Scalable Simple Random Sampling and Stratified Sampling

Xiangrui Meng

Scalable Simple Random Sampling and Stratified Sampling

Xiangrui Meng

Proceedings of the 30th International Conference on Machine Learning, PMLR 28(3):531-539, 2013.

Abstract

Analyzing data sets of billions of records has now become a regular task in many companies and institutions. In the statistical analysis of those massive data sets, sampling generally plays a very important role. In this work, we describe a scalable simple random sampling algorithm, named ScaSRS, which uses probabilistic thresholds to decide on the fly whether to accept, reject, or wait-list an item independently of others. We prove, with high probability, it succeeds and needs only O(\sqrtk) storage, where k is the sample size. ScaSRS extends naturally to a scalable stratified sampling algorithm, which is favorable for heterogeneous data sets. The proposed algorithms, when implemented in MapReduce, can effectively reduce the size of intermediate output and greatly improve load balancing. Empirical evaluation on large-scale data sets clearly demonstrates their superiority.

Cite this Paper

BibTeX


@InProceedings{pmlr-v28-meng13a,
  title = 	 {Scalable Simple Random Sampling and Stratified Sampling},
  author = 	 {Meng, Xiangrui},
  booktitle = 	 {Proceedings of the 30th International Conference on Machine Learning},
  pages = 	 {531--539},
  year = 	 {2013},
  editor = 	 {Dasgupta, Sanjoy and McAllester, David},
  volume = 	 {28},
  number =       {3},
  series = 	 {Proceedings of Machine Learning Research},
  address = 	 {Atlanta, Georgia, USA},
  month = 	 {17--19 Jun},
  publisher =    {PMLR},
  pdf = 	 {http://proceedings.mlr.press/v28/meng13a.pdf},
  url = 	 {https://proceedings.mlr.press/v28/meng13a.html},
  abstract = 	 {Analyzing data sets of billions of records has now become a regular task in    many companies and institutions.    In the statistical analysis of those massive data sets, sampling generally    plays a very important role.    In this work, we describe a scalable simple random sampling algorithm, named    ScaSRS, which uses probabilistic thresholds to decide on the fly whether to    accept, reject, or wait-list an item independently of others.    We prove, with high probability, it succeeds and needs only O(\sqrtk)    storage, where k is the sample size.    ScaSRS extends naturally to a scalable stratified sampling algorithm, which is    favorable for heterogeneous data sets.    The proposed algorithms, when implemented in MapReduce, can effectively reduce    the size of intermediate output and greatly improve load balancing.    Empirical evaluation on large-scale data sets clearly demonstrates their    superiority.}
}

Endnote

%0 Conference Paper
%T Scalable Simple Random Sampling and Stratified Sampling
%A Xiangrui Meng
%B Proceedings of the 30th International Conference on Machine Learning
%C Proceedings of Machine Learning Research
%D 2013
%E Sanjoy Dasgupta
%E David McAllester	
%F pmlr-v28-meng13a
%I PMLR
%P 531--539
%U https://proceedings.mlr.press/v28/meng13a.html
%V 28
%N 3
%X Analyzing data sets of billions of records has now become a regular task in    many companies and institutions.    In the statistical analysis of those massive data sets, sampling generally    plays a very important role.    In this work, we describe a scalable simple random sampling algorithm, named    ScaSRS, which uses probabilistic thresholds to decide on the fly whether to    accept, reject, or wait-list an item independently of others.    We prove, with high probability, it succeeds and needs only O(\sqrtk)    storage, where k is the sample size.    ScaSRS extends naturally to a scalable stratified sampling algorithm, which is    favorable for heterogeneous data sets.    The proposed algorithms, when implemented in MapReduce, can effectively reduce    the size of intermediate output and greatly improve load balancing.    Empirical evaluation on large-scale data sets clearly demonstrates their    superiority.

RIS


TY  - CPAPER
TI  - Scalable Simple Random Sampling and Stratified Sampling
AU  - Xiangrui Meng
BT  - Proceedings of the 30th International Conference on Machine Learning
DA  - 2013/05/26
ED  - Sanjoy Dasgupta
ED  - David McAllester	
ID  - pmlr-v28-meng13a
PB  - PMLR
DP  - Proceedings of Machine Learning Research
VL  - 28
IS  - 3
SP  - 531
EP  - 539
L1  - http://proceedings.mlr.press/v28/meng13a.pdf
UR  - https://proceedings.mlr.press/v28/meng13a.html
AB  - Analyzing data sets of billions of records has now become a regular task in    many companies and institutions.    In the statistical analysis of those massive data sets, sampling generally    plays a very important role.    In this work, we describe a scalable simple random sampling algorithm, named    ScaSRS, which uses probabilistic thresholds to decide on the fly whether to    accept, reject, or wait-list an item independently of others.    We prove, with high probability, it succeeds and needs only O(\sqrtk)    storage, where k is the sample size.    ScaSRS extends naturally to a scalable stratified sampling algorithm, which is    favorable for heterogeneous data sets.    The proposed algorithms, when implemented in MapReduce, can effectively reduce    the size of intermediate output and greatly improve load balancing.    Empirical evaluation on large-scale data sets clearly demonstrates their    superiority.
ER  -

APA


Meng, X.. (2013). Scalable Simple Random Sampling and Stratified Sampling. Proceedings of the 30th International Conference on Machine Learning, in Proceedings of Machine Learning Research 28(3):531-539 Available from https://proceedings.mlr.press/v28/meng13a.html.

Related Material

Download PDF