Learning and Data Selection in Big Datasets

Hossein Shokri Ghadikolaei; Hadi Ghauch; Carlo Fischione; Mikael Skoglund

Learning and Data Selection in Big Datasets

Hossein Shokri Ghadikolaei, Hadi Ghauch, Carlo Fischione, Mikael Skoglund

Proceedings of the 36th International Conference on Machine Learning, PMLR 97:2191-2200, 2019.

Abstract

Finding a dataset of minimal cardinality to characterize the optimal parameters of a model is of paramount importance in machine learning and distributed optimization over a network. This paper investigates the compressibility of large datasets. More specifically, we propose a framework that jointly learns the input-output mapping as well as the most representative samples of the dataset (sufficient dataset). Our analytical results show that the cardinality of the sufficient dataset increases sub-linearly with respect to the original dataset size. Numerical evaluations of real datasets reveal a large compressibility, up to 95%, without a noticeable drop in the learnability performance, measured by the generalization error.

Cite this Paper

BibTeX


@InProceedings{pmlr-v97-ghadikolaei19a,
  title = 	 {Learning and Data Selection in Big Datasets},
  author =       {Ghadikolaei, Hossein Shokri and Ghauch, Hadi and Fischione, Carlo and Skoglund, Mikael},
  booktitle = 	 {Proceedings of the 36th International Conference on Machine Learning},
  pages = 	 {2191--2200},
  year = 	 {2019},
  editor = 	 {Chaudhuri, Kamalika and Salakhutdinov, Ruslan},
  volume = 	 {97},
  series = 	 {Proceedings of Machine Learning Research},
  month = 	 {09--15 Jun},
  publisher =    {PMLR},
  pdf = 	 {http://proceedings.mlr.press/v97/ghadikolaei19a/ghadikolaei19a.pdf},
  url = 	 {https://proceedings.mlr.press/v97/ghadikolaei19a.html},
  abstract = 	 {Finding a dataset of minimal cardinality to characterize the optimal parameters of a model is of paramount importance in machine learning and distributed optimization over a network. This paper investigates the compressibility of large datasets. More specifically, we propose a framework that jointly learns the input-output mapping as well as the most representative samples of the dataset (sufficient dataset). Our analytical results show that the cardinality of the sufficient dataset increases sub-linearly with respect to the original dataset size. Numerical evaluations of real datasets reveal a large compressibility, up to 95%, without a noticeable drop in the learnability performance, measured by the generalization error.}
}

Endnote

%0 Conference Paper
%T Learning and Data Selection in Big Datasets
%A Hossein Shokri Ghadikolaei
%A Hadi Ghauch
%A Carlo Fischione
%A Mikael Skoglund
%B Proceedings of the 36th International Conference on Machine Learning
%C Proceedings of Machine Learning Research
%D 2019
%E Kamalika Chaudhuri
%E Ruslan Salakhutdinov	
%F pmlr-v97-ghadikolaei19a
%I PMLR
%P 2191--2200
%U https://proceedings.mlr.press/v97/ghadikolaei19a.html
%V 97
%X Finding a dataset of minimal cardinality to characterize the optimal parameters of a model is of paramount importance in machine learning and distributed optimization over a network. This paper investigates the compressibility of large datasets. More specifically, we propose a framework that jointly learns the input-output mapping as well as the most representative samples of the dataset (sufficient dataset). Our analytical results show that the cardinality of the sufficient dataset increases sub-linearly with respect to the original dataset size. Numerical evaluations of real datasets reveal a large compressibility, up to 95%, without a noticeable drop in the learnability performance, measured by the generalization error.

APA


Ghadikolaei, H.S., Ghauch, H., Fischione, C. & Skoglund, M.. (2019). Learning and Data Selection in Big Datasets. Proceedings of the 36th International Conference on Machine Learning, in Proceedings of Machine Learning Research 97:2191-2200 Available from https://proceedings.mlr.press/v97/ghadikolaei19a.html.

Learning and Data Selection in Big Datasets

Abstract

Cite this Paper

Related Material