Coresets for Nonparametric Estimation - the Case of DP-Means

Olivier Bachem; Mario Lucic; Andreas Krause

Coresets for Nonparametric Estimation - the Case of DP-Means

Olivier Bachem, Mario Lucic, Andreas Krause

Proceedings of the 32nd International Conference on Machine Learning, PMLR 37:209-217, 2015.

Abstract

Scalable training of Bayesian nonparametric models is a notoriously difficult challenge. We explore the use of coresets - a data summarization technique originating from computational geometry - for this task. Coresets are weighted subsets of the data such that models trained on these coresets are provably competitive with models trained on the full dataset. Coresets sublinear in the dataset size allow for fast approximate inference with provable guarantees. Existing constructions, however, are limited to parametric problems. Using novel techniques in coreset construction we show the existence of coresets for DP-Means - a prototypical nonparametric clustering problem - and provide a practical construction algorithm. We empirically demonstrate that our algorithm allows us to efficiently trade off computation time and approximation error and thus scale DP-Means to large datasets. For instance, with coresets we can obtain a computational speedup of 45x at an approximation error of only 2.4% compared to solving on the full data set. In contrast, for the same subsample size, the “naive” approach of uniformly subsampling the data incurs an approximation error of 22.5%.

Cite this Paper

BibTeX


@InProceedings{pmlr-v37-bachem15,
  title = 	 {Coresets for Nonparametric Estimation - the Case of DP-Means},
  author = 	 {Bachem, Olivier and Lucic, Mario and Krause, Andreas},
  booktitle = 	 {Proceedings of the 32nd International Conference on Machine Learning},
  pages = 	 {209--217},
  year = 	 {2015},
  editor = 	 {Bach, Francis and Blei, David},
  volume = 	 {37},
  series = 	 {Proceedings of Machine Learning Research},
  address = 	 {Lille, France},
  month = 	 {07--09 Jul},
  publisher =    {PMLR},
  pdf = 	 {http://proceedings.mlr.press/v37/bachem15.pdf},
  url = 	 {https://proceedings.mlr.press/v37/bachem15.html},
  abstract = 	 {Scalable training of Bayesian nonparametric models is a notoriously difficult challenge. We explore the use of coresets - a data summarization technique originating from computational geometry - for this task. Coresets are weighted subsets of the data such that models trained on these coresets are provably competitive with models trained on the full dataset. Coresets sublinear in the dataset size allow for fast approximate inference with provable guarantees. Existing constructions, however, are limited to parametric problems. Using novel techniques in coreset construction we show the existence of coresets for DP-Means - a prototypical nonparametric clustering problem - and provide a practical construction algorithm. We empirically demonstrate that our algorithm allows us to efficiently trade off computation time and approximation error and thus scale DP-Means to large datasets. For instance, with coresets we can obtain a computational speedup of 45x at an approximation error of only 2.4% compared to solving on the full data set. In contrast, for the same subsample size, the “naive” approach of uniformly subsampling the data incurs an approximation error of 22.5%.}
}

Endnote

%0 Conference Paper
%T Coresets for Nonparametric Estimation - the Case of DP-Means
%A Olivier Bachem
%A Mario Lucic
%A Andreas Krause
%B Proceedings of the 32nd International Conference on Machine Learning
%C Proceedings of Machine Learning Research
%D 2015
%E Francis Bach
%E David Blei	
%F pmlr-v37-bachem15
%I PMLR
%P 209--217
%U https://proceedings.mlr.press/v37/bachem15.html
%V 37
%X Scalable training of Bayesian nonparametric models is a notoriously difficult challenge. We explore the use of coresets - a data summarization technique originating from computational geometry - for this task. Coresets are weighted subsets of the data such that models trained on these coresets are provably competitive with models trained on the full dataset. Coresets sublinear in the dataset size allow for fast approximate inference with provable guarantees. Existing constructions, however, are limited to parametric problems. Using novel techniques in coreset construction we show the existence of coresets for DP-Means - a prototypical nonparametric clustering problem - and provide a practical construction algorithm. We empirically demonstrate that our algorithm allows us to efficiently trade off computation time and approximation error and thus scale DP-Means to large datasets. For instance, with coresets we can obtain a computational speedup of 45x at an approximation error of only 2.4% compared to solving on the full data set. In contrast, for the same subsample size, the “naive” approach of uniformly subsampling the data incurs an approximation error of 22.5%.

RIS


TY  - CPAPER
TI  - Coresets for Nonparametric Estimation - the Case of DP-Means
AU  - Olivier Bachem
AU  - Mario Lucic
AU  - Andreas Krause
BT  - Proceedings of the 32nd International Conference on Machine Learning
DA  - 2015/06/01
ED  - Francis Bach
ED  - David Blei	
ID  - pmlr-v37-bachem15
PB  - PMLR
DP  - Proceedings of Machine Learning Research
VL  - 37
SP  - 209
EP  - 217
L1  - http://proceedings.mlr.press/v37/bachem15.pdf
UR  - https://proceedings.mlr.press/v37/bachem15.html
AB  - Scalable training of Bayesian nonparametric models is a notoriously difficult challenge. We explore the use of coresets - a data summarization technique originating from computational geometry - for this task. Coresets are weighted subsets of the data such that models trained on these coresets are provably competitive with models trained on the full dataset. Coresets sublinear in the dataset size allow for fast approximate inference with provable guarantees. Existing constructions, however, are limited to parametric problems. Using novel techniques in coreset construction we show the existence of coresets for DP-Means - a prototypical nonparametric clustering problem - and provide a practical construction algorithm. We empirically demonstrate that our algorithm allows us to efficiently trade off computation time and approximation error and thus scale DP-Means to large datasets. For instance, with coresets we can obtain a computational speedup of 45x at an approximation error of only 2.4% compared to solving on the full data set. In contrast, for the same subsample size, the “naive” approach of uniformly subsampling the data incurs an approximation error of 22.5%.
ER  -

APA


Bachem, O., Lucic, M. & Krause, A.. (2015). Coresets for Nonparametric Estimation - the Case of DP-Means. Proceedings of the 32nd International Conference on Machine Learning, in Proceedings of Machine Learning Research 37:209-217 Available from https://proceedings.mlr.press/v37/bachem15.html.

Coresets for Nonparametric Estimation - the Case of DP-Means

Abstract

Cite this Paper

Related Material