Learning and Data Selection in Big Datasets

Hossein Shokri Ghadikolaei, Hadi Ghauch, Carlo Fischione, Mikael Skoglund
Proceedings of the 36th International Conference on Machine Learning, PMLR 97:2191-2200, 2019.

Abstract

Finding a dataset of minimal cardinality to characterize the optimal parameters of a model is of paramount importance in machine learning and distributed optimization over a network. This paper investigates the compressibility of large datasets. More specifically, we propose a framework that jointly learns the input-output mapping as well as the most representative samples of the dataset (sufficient dataset). Our analytical results show that the cardinality of the sufficient dataset increases sub-linearly with respect to the original dataset size. Numerical evaluations of real datasets reveal a large compressibility, up to 95%, without a noticeable drop in the learnability performance, measured by the generalization error.

Cite this Paper


BibTeX
@InProceedings{pmlr-v97-ghadikolaei19a, title = {Learning and Data Selection in Big Datasets}, author = {Ghadikolaei, Hossein Shokri and Ghauch, Hadi and Fischione, Carlo and Skoglund, Mikael}, booktitle = {Proceedings of the 36th International Conference on Machine Learning}, pages = {2191--2200}, year = {2019}, editor = {Chaudhuri, Kamalika and Salakhutdinov, Ruslan}, volume = {97}, series = {Proceedings of Machine Learning Research}, month = {09--15 Jun}, publisher = {PMLR}, pdf = {http://proceedings.mlr.press/v97/ghadikolaei19a/ghadikolaei19a.pdf}, url = {https://proceedings.mlr.press/v97/ghadikolaei19a.html}, abstract = {Finding a dataset of minimal cardinality to characterize the optimal parameters of a model is of paramount importance in machine learning and distributed optimization over a network. This paper investigates the compressibility of large datasets. More specifically, we propose a framework that jointly learns the input-output mapping as well as the most representative samples of the dataset (sufficient dataset). Our analytical results show that the cardinality of the sufficient dataset increases sub-linearly with respect to the original dataset size. Numerical evaluations of real datasets reveal a large compressibility, up to 95%, without a noticeable drop in the learnability performance, measured by the generalization error.} }
Endnote
%0 Conference Paper %T Learning and Data Selection in Big Datasets %A Hossein Shokri Ghadikolaei %A Hadi Ghauch %A Carlo Fischione %A Mikael Skoglund %B Proceedings of the 36th International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2019 %E Kamalika Chaudhuri %E Ruslan Salakhutdinov %F pmlr-v97-ghadikolaei19a %I PMLR %P 2191--2200 %U https://proceedings.mlr.press/v97/ghadikolaei19a.html %V 97 %X Finding a dataset of minimal cardinality to characterize the optimal parameters of a model is of paramount importance in machine learning and distributed optimization over a network. This paper investigates the compressibility of large datasets. More specifically, we propose a framework that jointly learns the input-output mapping as well as the most representative samples of the dataset (sufficient dataset). Our analytical results show that the cardinality of the sufficient dataset increases sub-linearly with respect to the original dataset size. Numerical evaluations of real datasets reveal a large compressibility, up to 95%, without a noticeable drop in the learnability performance, measured by the generalization error.
APA
Ghadikolaei, H.S., Ghauch, H., Fischione, C. & Skoglund, M.. (2019). Learning and Data Selection in Big Datasets. Proceedings of the 36th International Conference on Machine Learning, in Proceedings of Machine Learning Research 97:2191-2200 Available from https://proceedings.mlr.press/v97/ghadikolaei19a.html.

Related Material