SMERED: A Bayesian Approach to Graphical Record Linkage and De-duplication

Rebecca Steorts, Rob Hall, Stephen Fienberg
Proceedings of the Seventeenth International Conference on Artificial Intelligence and Statistics, PMLR 33:922-930, 2014.

Abstract

We propose a novel unsupervised approach for linking records across arbitrarily many files, while simultaneously detecting duplicate records within files. Our key innovation is to represent the pattern of links between records as a bipartite graph, in which records are directly linked to latent true individuals, and only indirectly linked to other records. This flexible new representation of the linkage structure naturally allows us to estimate the attributes of the unique observable people in the population, calculate k-way posterior probabilities of matches across records, and propagate the uncertainty of record linkage into later analyses. Our linkage structure lends itself to an efficient, linear-time, hybrid Markov chain Monte Carlo algorithm, which overcomes many obstacles encountered by previously proposed methods of record linkage, despite the high dimensional parameter space. We assess our results on real and simulated data.

Cite this Paper


BibTeX
@InProceedings{pmlr-v33-steorts14, title = {{SMERED: A Bayesian Approach to Graphical Record Linkage and De-duplication}}, author = {Steorts, Rebecca and Hall, Rob and Fienberg, Stephen}, booktitle = {Proceedings of the Seventeenth International Conference on Artificial Intelligence and Statistics}, pages = {922--930}, year = {2014}, editor = {Kaski, Samuel and Corander, Jukka}, volume = {33}, series = {Proceedings of Machine Learning Research}, address = {Reykjavik, Iceland}, month = {22--25 Apr}, publisher = {PMLR}, pdf = {http://proceedings.mlr.press/v33/steorts14.pdf}, url = {https://proceedings.mlr.press/v33/steorts14.html}, abstract = {We propose a novel unsupervised approach for linking records across arbitrarily many files, while simultaneously detecting duplicate records within files. Our key innovation is to represent the pattern of links between records as a bipartite graph, in which records are directly linked to latent true individuals, and only indirectly linked to other records. This flexible new representation of the linkage structure naturally allows us to estimate the attributes of the unique observable people in the population, calculate k-way posterior probabilities of matches across records, and propagate the uncertainty of record linkage into later analyses. Our linkage structure lends itself to an efficient, linear-time, hybrid Markov chain Monte Carlo algorithm, which overcomes many obstacles encountered by previously proposed methods of record linkage, despite the high dimensional parameter space. We assess our results on real and simulated data.} }
Endnote
%0 Conference Paper %T SMERED: A Bayesian Approach to Graphical Record Linkage and De-duplication %A Rebecca Steorts %A Rob Hall %A Stephen Fienberg %B Proceedings of the Seventeenth International Conference on Artificial Intelligence and Statistics %C Proceedings of Machine Learning Research %D 2014 %E Samuel Kaski %E Jukka Corander %F pmlr-v33-steorts14 %I PMLR %P 922--930 %U https://proceedings.mlr.press/v33/steorts14.html %V 33 %X We propose a novel unsupervised approach for linking records across arbitrarily many files, while simultaneously detecting duplicate records within files. Our key innovation is to represent the pattern of links between records as a bipartite graph, in which records are directly linked to latent true individuals, and only indirectly linked to other records. This flexible new representation of the linkage structure naturally allows us to estimate the attributes of the unique observable people in the population, calculate k-way posterior probabilities of matches across records, and propagate the uncertainty of record linkage into later analyses. Our linkage structure lends itself to an efficient, linear-time, hybrid Markov chain Monte Carlo algorithm, which overcomes many obstacles encountered by previously proposed methods of record linkage, despite the high dimensional parameter space. We assess our results on real and simulated data.
RIS
TY - CPAPER TI - SMERED: A Bayesian Approach to Graphical Record Linkage and De-duplication AU - Rebecca Steorts AU - Rob Hall AU - Stephen Fienberg BT - Proceedings of the Seventeenth International Conference on Artificial Intelligence and Statistics DA - 2014/04/02 ED - Samuel Kaski ED - Jukka Corander ID - pmlr-v33-steorts14 PB - PMLR DP - Proceedings of Machine Learning Research VL - 33 SP - 922 EP - 930 L1 - http://proceedings.mlr.press/v33/steorts14.pdf UR - https://proceedings.mlr.press/v33/steorts14.html AB - We propose a novel unsupervised approach for linking records across arbitrarily many files, while simultaneously detecting duplicate records within files. Our key innovation is to represent the pattern of links between records as a bipartite graph, in which records are directly linked to latent true individuals, and only indirectly linked to other records. This flexible new representation of the linkage structure naturally allows us to estimate the attributes of the unique observable people in the population, calculate k-way posterior probabilities of matches across records, and propagate the uncertainty of record linkage into later analyses. Our linkage structure lends itself to an efficient, linear-time, hybrid Markov chain Monte Carlo algorithm, which overcomes many obstacles encountered by previously proposed methods of record linkage, despite the high dimensional parameter space. We assess our results on real and simulated data. ER -
APA
Steorts, R., Hall, R. & Fienberg, S.. (2014). SMERED: A Bayesian Approach to Graphical Record Linkage and De-duplication. Proceedings of the Seventeenth International Conference on Artificial Intelligence and Statistics, in Proceedings of Machine Learning Research 33:922-930 Available from https://proceedings.mlr.press/v33/steorts14.html.

Related Material