Statistical Matching of Discrete Data by Bayesian Networks

Eva Endres, Thomas Augustin
Proceedings of the Eighth International Conference on Probabilistic Graphical Models, PMLR 52:159-170, 2016.

Abstract

Statistical matching (also known as data fusion, data merging, or data integration) is the umbrella term for a collection of methods which serve to combine different data sources. The objective is to obtain joint information about variables which have not jointly been collected in one survey, but on two (or more) surveys with disjoint sets of observation units. Besides specific variables for the different data files, it is indispensable to have common variables which are observed in both data sets and on basis of which the matching can be performed. Several existing statistical matching approaches are based on the assumption of conditional independence of the specific variables given the common variables. Relying on the well-known fact that d-separation is related to conditional independence for a probability distribution which factorizes along a directed acyclic graph, we suggest to use probabilistic graphical models as a powerful tool for statistical matching. In this paper, we describe and discuss first attempts for statistical matching of discrete data by Bayesian networks. The approach is exemplarily applied to data collected within the scope of the German General Social Survey.

Cite this Paper


BibTeX
@InProceedings{pmlr-v52-endres16, title = {Statistical Matching of Discrete Data by {B}ayesian Networks}, author = {Endres, Eva and Augustin, Thomas}, booktitle = {Proceedings of the Eighth International Conference on Probabilistic Graphical Models}, pages = {159--170}, year = {2016}, editor = {Antonucci, Alessandro and Corani, Giorgio and Campos}, Cassio Polpo}, volume = {52}, series = {Proceedings of Machine Learning Research}, address = {Lugano, Switzerland}, month = {06--09 Sep}, publisher = {PMLR}, pdf = {http://proceedings.mlr.press/v52/endres16.pdf}, url = {https://proceedings.mlr.press/v52/endres16.html}, abstract = {Statistical matching (also known as data fusion, data merging, or data integration) is the umbrella term for a collection of methods which serve to combine different data sources. The objective is to obtain joint information about variables which have not jointly been collected in one survey, but on two (or more) surveys with disjoint sets of observation units. Besides specific variables for the different data files, it is indispensable to have common variables which are observed in both data sets and on basis of which the matching can be performed. Several existing statistical matching approaches are based on the assumption of conditional independence of the specific variables given the common variables. Relying on the well-known fact that d-separation is related to conditional independence for a probability distribution which factorizes along a directed acyclic graph, we suggest to use probabilistic graphical models as a powerful tool for statistical matching. In this paper, we describe and discuss first attempts for statistical matching of discrete data by Bayesian networks. The approach is exemplarily applied to data collected within the scope of the German General Social Survey.} }
Endnote
%0 Conference Paper %T Statistical Matching of Discrete Data by Bayesian Networks %A Eva Endres %A Thomas Augustin %B Proceedings of the Eighth International Conference on Probabilistic Graphical Models %C Proceedings of Machine Learning Research %D 2016 %E Alessandro Antonucci %E Giorgio Corani %E Cassio Polpo Campos} %F pmlr-v52-endres16 %I PMLR %P 159--170 %U https://proceedings.mlr.press/v52/endres16.html %V 52 %X Statistical matching (also known as data fusion, data merging, or data integration) is the umbrella term for a collection of methods which serve to combine different data sources. The objective is to obtain joint information about variables which have not jointly been collected in one survey, but on two (or more) surveys with disjoint sets of observation units. Besides specific variables for the different data files, it is indispensable to have common variables which are observed in both data sets and on basis of which the matching can be performed. Several existing statistical matching approaches are based on the assumption of conditional independence of the specific variables given the common variables. Relying on the well-known fact that d-separation is related to conditional independence for a probability distribution which factorizes along a directed acyclic graph, we suggest to use probabilistic graphical models as a powerful tool for statistical matching. In this paper, we describe and discuss first attempts for statistical matching of discrete data by Bayesian networks. The approach is exemplarily applied to data collected within the scope of the German General Social Survey.
RIS
TY - CPAPER TI - Statistical Matching of Discrete Data by Bayesian Networks AU - Eva Endres AU - Thomas Augustin BT - Proceedings of the Eighth International Conference on Probabilistic Graphical Models DA - 2016/08/15 ED - Alessandro Antonucci ED - Giorgio Corani ED - Cassio Polpo Campos} ID - pmlr-v52-endres16 PB - PMLR DP - Proceedings of Machine Learning Research VL - 52 SP - 159 EP - 170 L1 - http://proceedings.mlr.press/v52/endres16.pdf UR - https://proceedings.mlr.press/v52/endres16.html AB - Statistical matching (also known as data fusion, data merging, or data integration) is the umbrella term for a collection of methods which serve to combine different data sources. The objective is to obtain joint information about variables which have not jointly been collected in one survey, but on two (or more) surveys with disjoint sets of observation units. Besides specific variables for the different data files, it is indispensable to have common variables which are observed in both data sets and on basis of which the matching can be performed. Several existing statistical matching approaches are based on the assumption of conditional independence of the specific variables given the common variables. Relying on the well-known fact that d-separation is related to conditional independence for a probability distribution which factorizes along a directed acyclic graph, we suggest to use probabilistic graphical models as a powerful tool for statistical matching. In this paper, we describe and discuss first attempts for statistical matching of discrete data by Bayesian networks. The approach is exemplarily applied to data collected within the scope of the German General Social Survey. ER -
APA
Endres, E. & Augustin, T.. (2016). Statistical Matching of Discrete Data by Bayesian Networks. Proceedings of the Eighth International Conference on Probabilistic Graphical Models, in Proceedings of Machine Learning Research 52:159-170 Available from https://proceedings.mlr.press/v52/endres16.html.

Related Material