Sequential Random Sampling Revisited: Hidden Shuffle Method

Michael Shekelyan; Graham Cormode

Sequential Random Sampling Revisited: Hidden Shuffle Method

Michael Shekelyan, Graham Cormode

Proceedings of The 24th International Conference on Artificial Intelligence and Statistics, PMLR 130:3628-3636, 2021.

Abstract

Random sampling (without replacement) is ubiquitously employed to obtain a representative subset of the data. Unlike common methods, sequential methods report samples in ascending order of index without keeping track of previous samples. This enables lightweight iterators that can jump directly from one sampled position to the next. Previously, sequential methods focused on drawing from the distribution of gap sizes, which requires intricate algorithms that are difficult to validate and can be slow in the worst-case. This can be avoided by a new method, the Hidden Shuffle. The name mirrors the fact that although the algorithm does not resemble shuffling, its correctness can be proven by conceptualising the sampling process as a random shuffle. The Hidden Shuffle algorithm stores just a handful of values, can be implemented in few lines of code, offers strong worst-case guarantees and is shown to be faster than state-of-the-art methods while using comparably few random variates.

Cite this Paper

BibTeX

@InProceedings{pmlr-v130-shekelyan21a,
  title = 	 { Sequential Random Sampling Revisited: Hidden Shuffle Method },
  author =       {Shekelyan, Michael and Cormode, Graham},
  booktitle = 	 {Proceedings of The 24th International Conference on Artificial Intelligence and Statistics},
  pages = 	 {3628--3636},
  year = 	 {2021},
  editor = 	 {Banerjee, Arindam and Fukumizu, Kenji},
  volume = 	 {130},
  series = 	 {Proceedings of Machine Learning Research},
  month = 	 {13--15 Apr},
  publisher =    {PMLR},
  pdf = 	 {http://proceedings.mlr.press/v130/shekelyan21a/shekelyan21a.pdf},
  url = 	 {https://proceedings.mlr.press/v130/shekelyan21a.html},
  abstract = 	 { Random sampling (without replacement) is ubiquitously employed to obtain a representative subset of the data. Unlike common methods, sequential methods report samples in ascending order of index without keeping track of previous samples. This enables lightweight iterators that can jump directly from one sampled position to the next. Previously, sequential methods focused on drawing from the distribution of gap sizes, which requires intricate algorithms that are difficult to validate and can be slow in the worst-case. This can be avoided by a new method, the Hidden Shuffle. The name mirrors the fact that although the algorithm does not resemble shuffling, its correctness can be proven by conceptualising the sampling process as a random shuffle. The Hidden Shuffle algorithm stores just a handful of values, can be implemented in few lines of code, offers strong worst-case guarantees and is shown to be faster than state-of-the-art methods while using comparably few random variates. }
}

Endnote

%0 Conference Paper
%T  Sequential Random Sampling Revisited: Hidden Shuffle Method 
%A Michael Shekelyan
%A Graham Cormode
%B Proceedings of The 24th International Conference on Artificial Intelligence and Statistics
%C Proceedings of Machine Learning Research
%D 2021
%E Arindam Banerjee
%E Kenji Fukumizu	
%F pmlr-v130-shekelyan21a
%I PMLR
%P 3628--3636
%U https://proceedings.mlr.press/v130/shekelyan21a.html
%V 130
%X  Random sampling (without replacement) is ubiquitously employed to obtain a representative subset of the data. Unlike common methods, sequential methods report samples in ascending order of index without keeping track of previous samples. This enables lightweight iterators that can jump directly from one sampled position to the next. Previously, sequential methods focused on drawing from the distribution of gap sizes, which requires intricate algorithms that are difficult to validate and can be slow in the worst-case. This can be avoided by a new method, the Hidden Shuffle. The name mirrors the fact that although the algorithm does not resemble shuffling, its correctness can be proven by conceptualising the sampling process as a random shuffle. The Hidden Shuffle algorithm stores just a handful of values, can be implemented in few lines of code, offers strong worst-case guarantees and is shown to be faster than state-of-the-art methods while using comparably few random variates.

APA

Shekelyan, M. & Cormode, G.. (2021).  Sequential Random Sampling Revisited: Hidden Shuffle Method . Proceedings of The 24th International Conference on Artificial Intelligence and Statistics, in Proceedings of Machine Learning Research 130:3628-3636 Available from https://proceedings.mlr.press/v130/shekelyan21a.html.

Related Material

Download PDF