Performance Bounds for Graphical Record Linkage

Rebecca C. Steorts, Mattew Barnes, Willie Neiswanger
Proceedings of the 20th International Conference on Artificial Intelligence and Statistics, PMLR 54:298-306, 2017.

Abstract

Record linkage involves merging records in large, noisy databases to remove duplicate entities. It has become an important area because of its widespread occurrence in bibliometrics, public health, official statistics production, political science, and beyond. Traditional linkage methods directly linking records to one another are computationally infeasible as the number of records grows. As a result, it is increasingly common for researchers to treat linkage as a clustering task, in which each latent entity is associated with one or more noisy database records. We critically assess performance bounds using the Kullback-Leibler (KL) divergence under a Bayesian record linkage framework, making connections to Kolchin partition models. We provide an upper bound using the KL divergence and a lower bound on the minimum probability of misclassifying a latent entity. We give insights for when our bounds hold using simulated data and provide practical user guidance.

Cite this Paper


BibTeX
@InProceedings{pmlr-v54-steorts17a, title = {{Performance Bounds for Graphical Record Linkage}}, author = {Steorts, Rebecca C. and Barnes, Mattew and Neiswanger, Willie}, booktitle = {Proceedings of the 20th International Conference on Artificial Intelligence and Statistics}, pages = {298--306}, year = {2017}, editor = {Singh, Aarti and Zhu, Jerry}, volume = {54}, series = {Proceedings of Machine Learning Research}, month = {20--22 Apr}, publisher = {PMLR}, pdf = {http://proceedings.mlr.press/v54/steorts17a/steorts17a.pdf}, url = {https://proceedings.mlr.press/v54/steorts17a.html}, abstract = {Record linkage involves merging records in large, noisy databases to remove duplicate entities. It has become an important area because of its widespread occurrence in bibliometrics, public health, official statistics production, political science, and beyond. Traditional linkage methods directly linking records to one another are computationally infeasible as the number of records grows. As a result, it is increasingly common for researchers to treat linkage as a clustering task, in which each latent entity is associated with one or more noisy database records. We critically assess performance bounds using the Kullback-Leibler (KL) divergence under a Bayesian record linkage framework, making connections to Kolchin partition models. We provide an upper bound using the KL divergence and a lower bound on the minimum probability of misclassifying a latent entity. We give insights for when our bounds hold using simulated data and provide practical user guidance. } }
Endnote
%0 Conference Paper %T Performance Bounds for Graphical Record Linkage %A Rebecca C. Steorts %A Mattew Barnes %A Willie Neiswanger %B Proceedings of the 20th International Conference on Artificial Intelligence and Statistics %C Proceedings of Machine Learning Research %D 2017 %E Aarti Singh %E Jerry Zhu %F pmlr-v54-steorts17a %I PMLR %P 298--306 %U https://proceedings.mlr.press/v54/steorts17a.html %V 54 %X Record linkage involves merging records in large, noisy databases to remove duplicate entities. It has become an important area because of its widespread occurrence in bibliometrics, public health, official statistics production, political science, and beyond. Traditional linkage methods directly linking records to one another are computationally infeasible as the number of records grows. As a result, it is increasingly common for researchers to treat linkage as a clustering task, in which each latent entity is associated with one or more noisy database records. We critically assess performance bounds using the Kullback-Leibler (KL) divergence under a Bayesian record linkage framework, making connections to Kolchin partition models. We provide an upper bound using the KL divergence and a lower bound on the minimum probability of misclassifying a latent entity. We give insights for when our bounds hold using simulated data and provide practical user guidance.
APA
Steorts, R.C., Barnes, M. & Neiswanger, W.. (2017). Performance Bounds for Graphical Record Linkage. Proceedings of the 20th International Conference on Artificial Intelligence and Statistics, in Proceedings of Machine Learning Research 54:298-306 Available from https://proceedings.mlr.press/v54/steorts17a.html.

Related Material