Stratified Sampling Meets Machine Learning

Edo Liberty; Kevin Lang; Konstantin Shmakov

Stratified Sampling Meets Machine Learning

Edo Liberty, Kevin Lang, Konstantin Shmakov

Proceedings of The 33rd International Conference on Machine Learning, PMLR 48:2320-2329, 2016.

Abstract

This paper solves a specialized regression problem to obtain sampling probabilities for records in databases. The goal is to sample a small set of records over which evaluating aggregate queries can be done both efficiently and accurately. We provide a principled and provable solution for this problem; it is parameterless and requires no data insights. Unlike standard regression problems, the loss is inversely proportional to the regressed-to values. Moreover, a cost zero solution always exists and can only be excluded by hard budget constraints. A unique form of regularization is also needed. We provide an efficient and simple regularized Empirical Risk Minimization (ERM) algorithm along with a theoretical generalization result. Our extensive experimental results significantly improve over both uniform sampling and standard stratified sampling which are de-facto the industry standards.

Cite this Paper

BibTeX


@InProceedings{pmlr-v48-liberty16,
  title = 	 {Stratified Sampling Meets Machine Learning},
  author = 	 {Liberty, Edo and Lang, Kevin and Shmakov, Konstantin},
  booktitle = 	 {Proceedings of The 33rd International Conference on Machine Learning},
  pages = 	 {2320--2329},
  year = 	 {2016},
  editor = 	 {Balcan, Maria Florina and Weinberger, Kilian Q.},
  volume = 	 {48},
  series = 	 {Proceedings of Machine Learning Research},
  address = 	 {New York, New York, USA},
  month = 	 {20--22 Jun},
  publisher =    {PMLR},
  pdf = 	 {http://proceedings.mlr.press/v48/liberty16.pdf},
  url = 	 {https://proceedings.mlr.press/v48/liberty16.html},
  abstract = 	 {This paper solves a specialized regression problem to obtain sampling probabilities for records in databases. The goal is to sample a small set of records over which evaluating aggregate queries can be done both efficiently and accurately. We provide a principled and provable solution for this problem; it is parameterless and requires no data insights. Unlike standard regression problems, the loss is inversely proportional to the regressed-to values. Moreover, a cost zero solution always exists and can only be excluded by hard budget constraints. A unique form of regularization is also needed. We provide an efficient and simple regularized Empirical Risk Minimization (ERM) algorithm along with a theoretical generalization result. Our extensive experimental results significantly improve over both uniform sampling and standard stratified sampling which are de-facto the industry standards.}
}

Endnote

%0 Conference Paper
%T Stratified Sampling Meets Machine Learning
%A Edo Liberty
%A Kevin Lang
%A Konstantin Shmakov
%B Proceedings of The 33rd International Conference on Machine Learning
%C Proceedings of Machine Learning Research
%D 2016
%E Maria Florina Balcan
%E Kilian Q. Weinberger	
%F pmlr-v48-liberty16
%I PMLR
%P 2320--2329
%U https://proceedings.mlr.press/v48/liberty16.html
%V 48
%X This paper solves a specialized regression problem to obtain sampling probabilities for records in databases. The goal is to sample a small set of records over which evaluating aggregate queries can be done both efficiently and accurately. We provide a principled and provable solution for this problem; it is parameterless and requires no data insights. Unlike standard regression problems, the loss is inversely proportional to the regressed-to values. Moreover, a cost zero solution always exists and can only be excluded by hard budget constraints. A unique form of regularization is also needed. We provide an efficient and simple regularized Empirical Risk Minimization (ERM) algorithm along with a theoretical generalization result. Our extensive experimental results significantly improve over both uniform sampling and standard stratified sampling which are de-facto the industry standards.

RIS


TY  - CPAPER
TI  - Stratified Sampling Meets Machine Learning
AU  - Edo Liberty
AU  - Kevin Lang
AU  - Konstantin Shmakov
BT  - Proceedings of The 33rd International Conference on Machine Learning
DA  - 2016/06/11
ED  - Maria Florina Balcan
ED  - Kilian Q. Weinberger	
ID  - pmlr-v48-liberty16
PB  - PMLR
DP  - Proceedings of Machine Learning Research
VL  - 48
SP  - 2320
EP  - 2329
L1  - http://proceedings.mlr.press/v48/liberty16.pdf
UR  - https://proceedings.mlr.press/v48/liberty16.html
AB  - This paper solves a specialized regression problem to obtain sampling probabilities for records in databases. The goal is to sample a small set of records over which evaluating aggregate queries can be done both efficiently and accurately. We provide a principled and provable solution for this problem; it is parameterless and requires no data insights. Unlike standard regression problems, the loss is inversely proportional to the regressed-to values. Moreover, a cost zero solution always exists and can only be excluded by hard budget constraints. A unique form of regularization is also needed. We provide an efficient and simple regularized Empirical Risk Minimization (ERM) algorithm along with a theoretical generalization result. Our extensive experimental results significantly improve over both uniform sampling and standard stratified sampling which are de-facto the industry standards.
ER  -

APA


Liberty, E., Lang, K. & Shmakov, K.. (2016). Stratified Sampling Meets Machine Learning. Proceedings of The 33rd International Conference on Machine Learning, in Proceedings of Machine Learning Research 48:2320-2329 Available from https://proceedings.mlr.press/v48/liberty16.html.

Related Material

Download PDF