Automatic Discovery of the Statistical Types of Variables in a Dataset

Isabel Valera, Zoubin Ghahramani
Proceedings of the 34th International Conference on Machine Learning, PMLR 70:3521-3529, 2017.

Abstract

A common practice in statistics and machine learning is to assume that the statistical data types (e.g., ordinal, categorical or real-valued) of variables, and usually also the likelihood model, is known. However, as the availability of real-world data increases, this assumption becomes too restrictive. Data are often heterogeneous, complex, and improperly or incompletely documented. Surprisingly, despite their practical importance, there is still a lack of tools to automatically discover the statistical types of, as well as appropriate likelihood (noise) models for, the variables in a dataset. In this paper, we fill this gap by proposing a Bayesian method, which accurately discovers the statistical data types in both synthetic and real data.

Cite this Paper


BibTeX
@InProceedings{pmlr-v70-valera17a, title = {Automatic Discovery of the Statistical Types of Variables in a Dataset}, author = {Isabel Valera and Zoubin Ghahramani}, booktitle = {Proceedings of the 34th International Conference on Machine Learning}, pages = {3521--3529}, year = {2017}, editor = {Precup, Doina and Teh, Yee Whye}, volume = {70}, series = {Proceedings of Machine Learning Research}, month = {06--11 Aug}, publisher = {PMLR}, pdf = {http://proceedings.mlr.press/v70/valera17a/valera17a.pdf}, url = {https://proceedings.mlr.press/v70/valera17a.html}, abstract = {A common practice in statistics and machine learning is to assume that the statistical data types (e.g., ordinal, categorical or real-valued) of variables, and usually also the likelihood model, is known. However, as the availability of real-world data increases, this assumption becomes too restrictive. Data are often heterogeneous, complex, and improperly or incompletely documented. Surprisingly, despite their practical importance, there is still a lack of tools to automatically discover the statistical types of, as well as appropriate likelihood (noise) models for, the variables in a dataset. In this paper, we fill this gap by proposing a Bayesian method, which accurately discovers the statistical data types in both synthetic and real data.} }
Endnote
%0 Conference Paper %T Automatic Discovery of the Statistical Types of Variables in a Dataset %A Isabel Valera %A Zoubin Ghahramani %B Proceedings of the 34th International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2017 %E Doina Precup %E Yee Whye Teh %F pmlr-v70-valera17a %I PMLR %P 3521--3529 %U https://proceedings.mlr.press/v70/valera17a.html %V 70 %X A common practice in statistics and machine learning is to assume that the statistical data types (e.g., ordinal, categorical or real-valued) of variables, and usually also the likelihood model, is known. However, as the availability of real-world data increases, this assumption becomes too restrictive. Data are often heterogeneous, complex, and improperly or incompletely documented. Surprisingly, despite their practical importance, there is still a lack of tools to automatically discover the statistical types of, as well as appropriate likelihood (noise) models for, the variables in a dataset. In this paper, we fill this gap by proposing a Bayesian method, which accurately discovers the statistical data types in both synthetic and real data.
APA
Valera, I. & Ghahramani, Z.. (2017). Automatic Discovery of the Statistical Types of Variables in a Dataset. Proceedings of the 34th International Conference on Machine Learning, in Proceedings of Machine Learning Research 70:3521-3529 Available from https://proceedings.mlr.press/v70/valera17a.html.

Related Material