BlockBoost: Scalable and Efficient Blocking through Boosting

Thiago Ramos, Rodrigo Loro Schuller, Alex Akira Okuno, Lucas Nissenbaum, Roberto I Oliveira, Paulo Orenstein
Proceedings of The 27th International Conference on Artificial Intelligence and Statistics, PMLR 238:2575-2583, 2024.

Abstract

As datasets grow larger, matching and merging entries from different databases has become a costly task in modern data pipelines. To avoid expensive comparisons between entries, blocking similar items is a popular preprocessing step. In this paper, we introduce BlockBoost, a novel boosting-based method that generates compact binary hash codes for database entries, through which blocking can be performed efficiently. The algorithm is fast and scalable, resulting in computational costs that are orders of magnitude lower than current benchmarks. Unlike existing alternatives, BlockBoost comes with associated feature importance measures for interpretability, and possesses strong theoretical guarantees, including lower bounds on critical performance metrics like recall and reduction ratio. Finally, we show that BlockBoost delivers great empirical results, outperforming state-of-the-art blocking benchmarks in terms of both performance metrics and computational cost.

Cite this Paper


BibTeX
@InProceedings{pmlr-v238-ramos24a, title = { BlockBoost: Scalable and Efficient Blocking through Boosting }, author = {Ramos, Thiago and Loro Schuller, Rodrigo and Akira Okuno, Alex and Nissenbaum, Lucas and I Oliveira, Roberto and Orenstein, Paulo}, booktitle = {Proceedings of The 27th International Conference on Artificial Intelligence and Statistics}, pages = {2575--2583}, year = {2024}, editor = {Dasgupta, Sanjoy and Mandt, Stephan and Li, Yingzhen}, volume = {238}, series = {Proceedings of Machine Learning Research}, month = {02--04 May}, publisher = {PMLR}, pdf = {https://proceedings.mlr.press/v238/ramos24a/ramos24a.pdf}, url = {https://proceedings.mlr.press/v238/ramos24a.html}, abstract = { As datasets grow larger, matching and merging entries from different databases has become a costly task in modern data pipelines. To avoid expensive comparisons between entries, blocking similar items is a popular preprocessing step. In this paper, we introduce BlockBoost, a novel boosting-based method that generates compact binary hash codes for database entries, through which blocking can be performed efficiently. The algorithm is fast and scalable, resulting in computational costs that are orders of magnitude lower than current benchmarks. Unlike existing alternatives, BlockBoost comes with associated feature importance measures for interpretability, and possesses strong theoretical guarantees, including lower bounds on critical performance metrics like recall and reduction ratio. Finally, we show that BlockBoost delivers great empirical results, outperforming state-of-the-art blocking benchmarks in terms of both performance metrics and computational cost. } }
Endnote
%0 Conference Paper %T BlockBoost: Scalable and Efficient Blocking through Boosting %A Thiago Ramos %A Rodrigo Loro Schuller %A Alex Akira Okuno %A Lucas Nissenbaum %A Roberto I Oliveira %A Paulo Orenstein %B Proceedings of The 27th International Conference on Artificial Intelligence and Statistics %C Proceedings of Machine Learning Research %D 2024 %E Sanjoy Dasgupta %E Stephan Mandt %E Yingzhen Li %F pmlr-v238-ramos24a %I PMLR %P 2575--2583 %U https://proceedings.mlr.press/v238/ramos24a.html %V 238 %X As datasets grow larger, matching and merging entries from different databases has become a costly task in modern data pipelines. To avoid expensive comparisons between entries, blocking similar items is a popular preprocessing step. In this paper, we introduce BlockBoost, a novel boosting-based method that generates compact binary hash codes for database entries, through which blocking can be performed efficiently. The algorithm is fast and scalable, resulting in computational costs that are orders of magnitude lower than current benchmarks. Unlike existing alternatives, BlockBoost comes with associated feature importance measures for interpretability, and possesses strong theoretical guarantees, including lower bounds on critical performance metrics like recall and reduction ratio. Finally, we show that BlockBoost delivers great empirical results, outperforming state-of-the-art blocking benchmarks in terms of both performance metrics and computational cost.
APA
Ramos, T., Loro Schuller, R., Akira Okuno, A., Nissenbaum, L., I Oliveira, R. & Orenstein, P.. (2024). BlockBoost: Scalable and Efficient Blocking through Boosting . Proceedings of The 27th International Conference on Artificial Intelligence and Statistics, in Proceedings of Machine Learning Research 238:2575-2583 Available from https://proceedings.mlr.press/v238/ramos24a.html.

Related Material