CARTE: Pretraining and Transfer for Tabular Learning

Myung Jun Kim, Leo Grinsztajn, Gael Varoquaux
Proceedings of the 41st International Conference on Machine Learning, PMLR 235:23843-23866, 2024.

Abstract

Pretrained deep-learning models are the go-to solution for images or text. However, for tabular data the standard is still to train tree-based models. Indeed, transfer learning on tables hits the challenge of data integration: finding correspondences, correspondences in the entries (entity matching) where different words may denote the same entity, correspondences across columns (schema matching), which may come in different orders, names... We propose a neural architecture that does not need such correspondences. As a result, we can pretrain it on background data that has not been matched. The architecture –CARTE for Context Aware Representation of Table Entries– uses a graph representation of tabular (or relational) data to process tables with different columns, string embedding of entries and columns names to model an open vocabulary, and a graph-attentional network to contextualize entries with column names and neighboring entries. An extensive benchmark shows that CARTE facilitates learning, outperforming a solid set of baselines including the best tree-based models. CARTE also enables joint learning across tables with unmatched columns, enhancing a small table with bigger ones. CARTE opens the door to large pretrained models for tabular data.

Cite this Paper


BibTeX
@InProceedings{pmlr-v235-kim24d, title = {{CARTE}: Pretraining and Transfer for Tabular Learning}, author = {Kim, Myung Jun and Grinsztajn, Leo and Varoquaux, Gael}, booktitle = {Proceedings of the 41st International Conference on Machine Learning}, pages = {23843--23866}, year = {2024}, editor = {Salakhutdinov, Ruslan and Kolter, Zico and Heller, Katherine and Weller, Adrian and Oliver, Nuria and Scarlett, Jonathan and Berkenkamp, Felix}, volume = {235}, series = {Proceedings of Machine Learning Research}, month = {21--27 Jul}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v235/main/assets/kim24d/kim24d.pdf}, url = {https://proceedings.mlr.press/v235/kim24d.html}, abstract = {Pretrained deep-learning models are the go-to solution for images or text. However, for tabular data the standard is still to train tree-based models. Indeed, transfer learning on tables hits the challenge of data integration: finding correspondences, correspondences in the entries (entity matching) where different words may denote the same entity, correspondences across columns (schema matching), which may come in different orders, names... We propose a neural architecture that does not need such correspondences. As a result, we can pretrain it on background data that has not been matched. The architecture –CARTE for Context Aware Representation of Table Entries– uses a graph representation of tabular (or relational) data to process tables with different columns, string embedding of entries and columns names to model an open vocabulary, and a graph-attentional network to contextualize entries with column names and neighboring entries. An extensive benchmark shows that CARTE facilitates learning, outperforming a solid set of baselines including the best tree-based models. CARTE also enables joint learning across tables with unmatched columns, enhancing a small table with bigger ones. CARTE opens the door to large pretrained models for tabular data.} }
Endnote
%0 Conference Paper %T CARTE: Pretraining and Transfer for Tabular Learning %A Myung Jun Kim %A Leo Grinsztajn %A Gael Varoquaux %B Proceedings of the 41st International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2024 %E Ruslan Salakhutdinov %E Zico Kolter %E Katherine Heller %E Adrian Weller %E Nuria Oliver %E Jonathan Scarlett %E Felix Berkenkamp %F pmlr-v235-kim24d %I PMLR %P 23843--23866 %U https://proceedings.mlr.press/v235/kim24d.html %V 235 %X Pretrained deep-learning models are the go-to solution for images or text. However, for tabular data the standard is still to train tree-based models. Indeed, transfer learning on tables hits the challenge of data integration: finding correspondences, correspondences in the entries (entity matching) where different words may denote the same entity, correspondences across columns (schema matching), which may come in different orders, names... We propose a neural architecture that does not need such correspondences. As a result, we can pretrain it on background data that has not been matched. The architecture –CARTE for Context Aware Representation of Table Entries– uses a graph representation of tabular (or relational) data to process tables with different columns, string embedding of entries and columns names to model an open vocabulary, and a graph-attentional network to contextualize entries with column names and neighboring entries. An extensive benchmark shows that CARTE facilitates learning, outperforming a solid set of baselines including the best tree-based models. CARTE also enables joint learning across tables with unmatched columns, enhancing a small table with bigger ones. CARTE opens the door to large pretrained models for tabular data.
APA
Kim, M.J., Grinsztajn, L. & Varoquaux, G.. (2024). CARTE: Pretraining and Transfer for Tabular Learning. Proceedings of the 41st International Conference on Machine Learning, in Proceedings of Machine Learning Research 235:23843-23866 Available from https://proceedings.mlr.press/v235/kim24d.html.

Related Material