Discrepancy, Coresets, and Sketches in Machine Learning

Zohar Karnin, Edo Liberty
Proceedings of the Thirty-Second Conference on Learning Theory, PMLR 99:1975-1993, 2019.

Abstract

This paper defines the notion of class discrepancy for families of functions. It shows that low discrepancy classes admit small offline and streaming coresets. We provide general techniques for bounding the class discrepancy of machine learning problems. As corollaries of the general technique we bound the discrepancy of logistic regression, sigmoid activation loss, matrix covariance, kernel density and any analytic function of the dot product or the squared distance. Our result resolves a long-standing open problem regarding the coreset complexity of Gaussian kernel density estimation. We provide two more related but independent results. First, an exponential improvement of the widely used merge-and-reduce trick which gives improved streaming sketches for any low discrepancy problem. Second, an extremely simple deterministic algorithm for finding low discrepancy sequences (and therefore coresets) for any positive semi-definite kernel. This paper establishes some explicit connections between class discrepancy, coreset complexity, learnability, and streaming algorithms.

Cite this Paper


BibTeX
@InProceedings{pmlr-v99-karnin19a, title = {Discrepancy, Coresets, and Sketches in Machine Learning}, author = {Karnin, Zohar and Liberty, Edo}, booktitle = {Proceedings of the Thirty-Second Conference on Learning Theory}, pages = {1975--1993}, year = {2019}, editor = {Beygelzimer, Alina and Hsu, Daniel}, volume = {99}, series = {Proceedings of Machine Learning Research}, month = {25--28 Jun}, publisher = {PMLR}, pdf = {http://proceedings.mlr.press/v99/karnin19a/karnin19a.pdf}, url = {https://proceedings.mlr.press/v99/karnin19a.html}, abstract = {This paper defines the notion of class discrepancy for families of functions. It shows that low discrepancy classes admit small offline and streaming coresets. We provide general techniques for bounding the class discrepancy of machine learning problems. As corollaries of the general technique we bound the discrepancy of logistic regression, sigmoid activation loss, matrix covariance, kernel density and any analytic function of the dot product or the squared distance. Our result resolves a long-standing open problem regarding the coreset complexity of Gaussian kernel density estimation. We provide two more related but independent results. First, an exponential improvement of the widely used merge-and-reduce trick which gives improved streaming sketches for any low discrepancy problem. Second, an extremely simple deterministic algorithm for finding low discrepancy sequences (and therefore coresets) for any positive semi-definite kernel. This paper establishes some explicit connections between class discrepancy, coreset complexity, learnability, and streaming algorithms.} }
Endnote
%0 Conference Paper %T Discrepancy, Coresets, and Sketches in Machine Learning %A Zohar Karnin %A Edo Liberty %B Proceedings of the Thirty-Second Conference on Learning Theory %C Proceedings of Machine Learning Research %D 2019 %E Alina Beygelzimer %E Daniel Hsu %F pmlr-v99-karnin19a %I PMLR %P 1975--1993 %U https://proceedings.mlr.press/v99/karnin19a.html %V 99 %X This paper defines the notion of class discrepancy for families of functions. It shows that low discrepancy classes admit small offline and streaming coresets. We provide general techniques for bounding the class discrepancy of machine learning problems. As corollaries of the general technique we bound the discrepancy of logistic regression, sigmoid activation loss, matrix covariance, kernel density and any analytic function of the dot product or the squared distance. Our result resolves a long-standing open problem regarding the coreset complexity of Gaussian kernel density estimation. We provide two more related but independent results. First, an exponential improvement of the widely used merge-and-reduce trick which gives improved streaming sketches for any low discrepancy problem. Second, an extremely simple deterministic algorithm for finding low discrepancy sequences (and therefore coresets) for any positive semi-definite kernel. This paper establishes some explicit connections between class discrepancy, coreset complexity, learnability, and streaming algorithms.
APA
Karnin, Z. & Liberty, E.. (2019). Discrepancy, Coresets, and Sketches in Machine Learning. Proceedings of the Thirty-Second Conference on Learning Theory, in Proceedings of Machine Learning Research 99:1975-1993 Available from https://proceedings.mlr.press/v99/karnin19a.html.

Related Material