Strong Coresets for Hard and Soft Bregman Clustering with Applications to Exponential Family Mixtures

Mario Lucic; Olivier Bachem; Andreas Krause

Strong Coresets for Hard and Soft Bregman Clustering with Applications to Exponential Family Mixtures

Mario Lucic, Olivier Bachem, Andreas Krause

Proceedings of the 19th International Conference on Artificial Intelligence and Statistics, PMLR 51:1-9, 2016.

Abstract

Coresets are efficient representations of data sets such that models trained on the coreset are provably competitive with models trained on the original data set. As such, they have been successfully used to scale up clustering models such as K-Means and Gaussian mixture models to massive data sets. However, until now, the algorithms and the corresponding theory were usually specific to each clustering problem. We propose a single, practical algorithm to construct strong coresets for a large class of hard and soft clustering problems based on Bregman divergences. This class includes hard clustering with popular distortion measures such as the Squared Euclidean distance, the Mahalanobis distance, KL-divergence and Itakura-Saito distance. The corresponding soft clustering problems are directly related to popular mixture models due to a dual relationship between Bregman divergences and Exponential family distributions. Our theoretical results further imply a randomized polynomial-time approximation scheme for hard clustering. We demonstrate the practicality of the proposed algorithm in an empirical evaluation.

Cite this Paper

BibTeX


@InProceedings{pmlr-v51-lucic16,
  title = 	 {Strong Coresets for Hard and Soft Bregman Clustering with Applications to Exponential Family Mixtures},
  author = 	 {Lucic, Mario and Bachem, Olivier and Krause, Andreas},
  booktitle = 	 {Proceedings of the 19th International Conference on Artificial Intelligence and Statistics},
  pages = 	 {1--9},
  year = 	 {2016},
  editor = 	 {Gretton, Arthur and Robert, Christian C.},
  volume = 	 {51},
  series = 	 {Proceedings of Machine Learning Research},
  address = 	 {Cadiz, Spain},
  month = 	 {09--11 May},
  publisher =    {PMLR},
  pdf = 	 {http://proceedings.mlr.press/v51/lucic16.pdf},
  url = 	 {https://proceedings.mlr.press/v51/lucic16.html},
  abstract = 	 {Coresets are efficient representations of data sets such that models trained on the coreset are provably competitive with models trained on the original data set. As such, they have been successfully used to scale up clustering models such as K-Means and Gaussian mixture models to massive data sets. However, until now, the algorithms and the corresponding theory were usually specific to each clustering problem.  We propose a single, practical algorithm to construct strong coresets for a large class of hard and soft clustering problems based on Bregman divergences. This class includes hard clustering with popular distortion measures such as the Squared Euclidean distance, the Mahalanobis distance, KL-divergence and Itakura-Saito distance. The corresponding soft clustering problems are directly related to popular mixture models due to a dual relationship between Bregman divergences and Exponential family distributions. Our theoretical results further imply a randomized polynomial-time approximation scheme for hard clustering. We demonstrate the practicality of the proposed algorithm in an empirical evaluation.}
}

Endnote

%0 Conference Paper
%T Strong Coresets for Hard and Soft Bregman Clustering with Applications to Exponential Family Mixtures
%A Mario Lucic
%A Olivier Bachem
%A Andreas Krause
%B Proceedings of the 19th International Conference on Artificial Intelligence and Statistics
%C Proceedings of Machine Learning Research
%D 2016
%E Arthur Gretton
%E Christian C. Robert	
%F pmlr-v51-lucic16
%I PMLR
%P 1--9
%U https://proceedings.mlr.press/v51/lucic16.html
%V 51
%X Coresets are efficient representations of data sets such that models trained on the coreset are provably competitive with models trained on the original data set. As such, they have been successfully used to scale up clustering models such as K-Means and Gaussian mixture models to massive data sets. However, until now, the algorithms and the corresponding theory were usually specific to each clustering problem.  We propose a single, practical algorithm to construct strong coresets for a large class of hard and soft clustering problems based on Bregman divergences. This class includes hard clustering with popular distortion measures such as the Squared Euclidean distance, the Mahalanobis distance, KL-divergence and Itakura-Saito distance. The corresponding soft clustering problems are directly related to popular mixture models due to a dual relationship between Bregman divergences and Exponential family distributions. Our theoretical results further imply a randomized polynomial-time approximation scheme for hard clustering. We demonstrate the practicality of the proposed algorithm in an empirical evaluation.

RIS


TY  - CPAPER
TI  - Strong Coresets for Hard and Soft Bregman Clustering with Applications to Exponential Family Mixtures
AU  - Mario Lucic
AU  - Olivier Bachem
AU  - Andreas Krause
BT  - Proceedings of the 19th International Conference on Artificial Intelligence and Statistics
DA  - 2016/05/02
ED  - Arthur Gretton
ED  - Christian C. Robert	
ID  - pmlr-v51-lucic16
PB  - PMLR
DP  - Proceedings of Machine Learning Research
VL  - 51
SP  - 1
EP  - 9
L1  - http://proceedings.mlr.press/v51/lucic16.pdf
UR  - https://proceedings.mlr.press/v51/lucic16.html
AB  - Coresets are efficient representations of data sets such that models trained on the coreset are provably competitive with models trained on the original data set. As such, they have been successfully used to scale up clustering models such as K-Means and Gaussian mixture models to massive data sets. However, until now, the algorithms and the corresponding theory were usually specific to each clustering problem.  We propose a single, practical algorithm to construct strong coresets for a large class of hard and soft clustering problems based on Bregman divergences. This class includes hard clustering with popular distortion measures such as the Squared Euclidean distance, the Mahalanobis distance, KL-divergence and Itakura-Saito distance. The corresponding soft clustering problems are directly related to popular mixture models due to a dual relationship between Bregman divergences and Exponential family distributions. Our theoretical results further imply a randomized polynomial-time approximation scheme for hard clustering. We demonstrate the practicality of the proposed algorithm in an empirical evaluation.
ER  -

APA


Lucic, M., Bachem, O. & Krause, A.. (2016). Strong Coresets for Hard and Soft Bregman Clustering with Applications to Exponential Family Mixtures. Proceedings of the 19th International Conference on Artificial Intelligence and Statistics, in Proceedings of Machine Learning Research 51:1-9 Available from https://proceedings.mlr.press/v51/lucic16.html.

Strong Coresets for Hard and Soft Bregman Clustering with Applications to Exponential Family Mixtures

Abstract

Cite this Paper

Related Material