Automatic Discovery of the Statistical Types of Variables in a Dataset

Isabel Valera; Zoubin Ghahramani

Automatic Discovery of the Statistical Types of Variables in a Dataset

Isabel Valera, Zoubin Ghahramani

Proceedings of the 34th International Conference on Machine Learning, PMLR 70:3521-3529, 2017.

Abstract

A common practice in statistics and machine learning is to assume that the statistical data types (e.g., ordinal, categorical or real-valued) of variables, and usually also the likelihood model, is known. However, as the availability of real-world data increases, this assumption becomes too restrictive. Data are often heterogeneous, complex, and improperly or incompletely documented. Surprisingly, despite their practical importance, there is still a lack of tools to automatically discover the statistical types of, as well as appropriate likelihood (noise) models for, the variables in a dataset. In this paper, we fill this gap by proposing a Bayesian method, which accurately discovers the statistical data types in both synthetic and real data.

Cite this Paper

BibTeX

@InProceedings{pmlr-v70-valera17a,
  title = 	 {Automatic Discovery of the Statistical Types of Variables in a Dataset},
  author =       {Isabel Valera and Zoubin Ghahramani},
  booktitle = 	 {Proceedings of the 34th International Conference on Machine Learning},
  pages = 	 {3521--3529},
  year = 	 {2017},
  editor = 	 {Precup, Doina and Teh, Yee Whye},
  volume = 	 {70},
  series = 	 {Proceedings of Machine Learning Research},
  month = 	 {06--11 Aug},
  publisher =    {PMLR},
  pdf = 	 {http://proceedings.mlr.press/v70/valera17a/valera17a.pdf},
  url = 	 {https://proceedings.mlr.press/v70/valera17a.html},
  abstract = 	 {A common practice in statistics and machine learning is to assume that the statistical data types (e.g., ordinal, categorical or real-valued) of variables, and usually also the likelihood model, is known. However, as the availability of real-world data increases, this assumption becomes too restrictive. Data are often heterogeneous, complex, and improperly or incompletely documented. Surprisingly, despite their practical importance, there is still a lack of tools to automatically discover the statistical types of, as well as appropriate likelihood (noise) models for, the variables in a dataset. In this paper, we fill this gap by proposing a Bayesian method, which accurately discovers the statistical data types in both synthetic and real data.}
}

Endnote

%0 Conference Paper
%T Automatic Discovery of the Statistical Types of Variables in a Dataset
%A Isabel Valera
%A Zoubin Ghahramani
%B Proceedings of the 34th International Conference on Machine Learning
%C Proceedings of Machine Learning Research
%D 2017
%E Doina Precup
%E Yee Whye Teh	
%F pmlr-v70-valera17a
%I PMLR
%P 3521--3529
%U https://proceedings.mlr.press/v70/valera17a.html
%V 70
%X A common practice in statistics and machine learning is to assume that the statistical data types (e.g., ordinal, categorical or real-valued) of variables, and usually also the likelihood model, is known. However, as the availability of real-world data increases, this assumption becomes too restrictive. Data are often heterogeneous, complex, and improperly or incompletely documented. Surprisingly, despite their practical importance, there is still a lack of tools to automatically discover the statistical types of, as well as appropriate likelihood (noise) models for, the variables in a dataset. In this paper, we fill this gap by proposing a Bayesian method, which accurately discovers the statistical data types in both synthetic and real data.

APA

Valera, I. & Ghahramani, Z.. (2017). Automatic Discovery of the Statistical Types of Variables in a Dataset. Proceedings of the 34th International Conference on Machine Learning, in Proceedings of Machine Learning Research 70:3521-3529 Available from https://proceedings.mlr.press/v70/valera17a.html.

Related Material

Download PDF