The Impact of Record Linkage on Learning from Feature Partitioned Data

Richard Nock; Stephen Hardy; Wilko Henecka; Hamish Ivey-Law; Jakub Nabaglo; Giorgio Patrini; Guillaume Smith; Brian Thorne

The Impact of Record Linkage on Learning from Feature Partitioned Data

Richard Nock, Stephen Hardy, Wilko Henecka, Hamish Ivey-Law, Jakub Nabaglo, Giorgio Patrini, Guillaume Smith, Brian Thorne

Proceedings of the 38th International Conference on Machine Learning, PMLR 139:8216-8226, 2021.

Abstract

There has been recently a significant boost to machine learning with distributed data, in particular with the success of federated learning. A common and very challenging setting is that of vertical or feature partitioned data, when multiple data providers hold different features about common entities. In general, training needs to be preceded by record linkage (RL), a step that finds the correspondence between the observations of the datasets. RL is prone to mistakes in the real world. Despite the importance of the problem, there has been so far no formal assessment of the way in which RL errors impact learning models. Work in the area either use heuristics or assume that the optimal RL is known in advance. In this paper, we provide the first assessment of the problem for supervised learning. For wide sets of losses, we provide technical conditions under which the classifier learned after noisy RL converges (with the data size) to the best classifier that would be learned from mistake-free RL. This yields new insights on the way the pipeline RL + ML operates, from the role of large margin classification on dampening the impact of RL mistakes to clues on how to further optimize RL as a preprocessing step to ML. Experiments on a large UCI benchmark validate those formal observations.

Cite this Paper

BibTeX

@InProceedings{pmlr-v139-nock21a,
  title = 	 {The Impact of Record Linkage on Learning from Feature Partitioned Data},
  author =       {Nock, Richard and Hardy, Stephen and Henecka, Wilko and Ivey-Law, Hamish and Nabaglo, Jakub and Patrini, Giorgio and Smith, Guillaume and Thorne, Brian},
  booktitle = 	 {Proceedings of the 38th International Conference on Machine Learning},
  pages = 	 {8216--8226},
  year = 	 {2021},
  editor = 	 {Meila, Marina and Zhang, Tong},
  volume = 	 {139},
  series = 	 {Proceedings of Machine Learning Research},
  month = 	 {18--24 Jul},
  publisher =    {PMLR},
  pdf = 	 {http://proceedings.mlr.press/v139/nock21a/nock21a.pdf},
  url = 	 {https://proceedings.mlr.press/v139/nock21a.html},
  abstract = 	 {There has been recently a significant boost to machine learning with distributed data, in particular with the success of federated learning. A common and very challenging setting is that of vertical or feature partitioned data, when multiple data providers hold different features about common entities. In general, training needs to be preceded by record linkage (RL), a step that finds the correspondence between the observations of the datasets. RL is prone to mistakes in the real world. Despite the importance of the problem, there has been so far no formal assessment of the way in which RL errors impact learning models. Work in the area either use heuristics or assume that the optimal RL is known in advance. In this paper, we provide the first assessment of the problem for supervised learning. For wide sets of losses, we provide technical conditions under which the classifier learned after noisy RL converges (with the data size) to the best classifier that would be learned from mistake-free RL. This yields new insights on the way the pipeline RL + ML operates, from the role of large margin classification on dampening the impact of RL mistakes to clues on how to further optimize RL as a preprocessing step to ML. Experiments on a large UCI benchmark validate those formal observations.}
}

Endnote

%0 Conference Paper
%T The Impact of Record Linkage on Learning from Feature Partitioned Data
%A Richard Nock
%A Stephen Hardy
%A Wilko Henecka
%A Hamish Ivey-Law
%A Jakub Nabaglo
%A Giorgio Patrini
%A Guillaume Smith
%A Brian Thorne
%B Proceedings of the 38th International Conference on Machine Learning
%C Proceedings of Machine Learning Research
%D 2021
%E Marina Meila
%E Tong Zhang	
%F pmlr-v139-nock21a
%I PMLR
%P 8216--8226
%U https://proceedings.mlr.press/v139/nock21a.html
%V 139
%X There has been recently a significant boost to machine learning with distributed data, in particular with the success of federated learning. A common and very challenging setting is that of vertical or feature partitioned data, when multiple data providers hold different features about common entities. In general, training needs to be preceded by record linkage (RL), a step that finds the correspondence between the observations of the datasets. RL is prone to mistakes in the real world. Despite the importance of the problem, there has been so far no formal assessment of the way in which RL errors impact learning models. Work in the area either use heuristics or assume that the optimal RL is known in advance. In this paper, we provide the first assessment of the problem for supervised learning. For wide sets of losses, we provide technical conditions under which the classifier learned after noisy RL converges (with the data size) to the best classifier that would be learned from mistake-free RL. This yields new insights on the way the pipeline RL + ML operates, from the role of large margin classification on dampening the impact of RL mistakes to clues on how to further optimize RL as a preprocessing step to ML. Experiments on a large UCI benchmark validate those formal observations.

APA

Nock, R., Hardy, S., Henecka, W., Ivey-Law, H., Nabaglo, J., Patrini, G., Smith, G. & Thorne, B.. (2021). The Impact of Record Linkage on Learning from Feature Partitioned Data. Proceedings of the 38th International Conference on Machine Learning, in Proceedings of Machine Learning Research 139:8216-8226 Available from https://proceedings.mlr.press/v139/nock21a.html.

The Impact of Record Linkage on Learning from Feature Partitioned Data

Abstract

Cite this Paper

Related Material