CARTE: Pretraining and Transfer for Tabular Learning

Myung Jun Kim; Leo Grinsztajn; Gael Varoquaux

CARTE: Pretraining and Transfer for Tabular Learning

Myung Jun Kim, Leo Grinsztajn, Gael Varoquaux

Proceedings of the 41st International Conference on Machine Learning, PMLR 235:23843-23866, 2024.

Abstract

Pretrained deep-learning models are the go-to solution for images or text. However, for tabular data the standard is still to train tree-based models. Indeed, transfer learning on tables hits the challenge of data integration: finding correspondences, correspondences in the entries (entity matching) where different words may denote the same entity, correspondences across columns (schema matching), which may come in different orders, names... We propose a neural architecture that does not need such correspondences. As a result, we can pretrain it on background data that has not been matched. The architecture –CARTE for Context Aware Representation of Table Entries– uses a graph representation of tabular (or relational) data to process tables with different columns, string embedding of entries and columns names to model an open vocabulary, and a graph-attentional network to contextualize entries with column names and neighboring entries. An extensive benchmark shows that CARTE facilitates learning, outperforming a solid set of baselines including the best tree-based models. CARTE also enables joint learning across tables with unmatched columns, enhancing a small table with bigger ones. CARTE opens the door to large pretrained models for tabular data.

Cite this Paper

BibTeX


@InProceedings{pmlr-v235-kim24d,
  title = 	 {{CARTE}: Pretraining and Transfer for Tabular Learning},
  author =       {Kim, Myung Jun and Grinsztajn, Leo and Varoquaux, Gael},
  booktitle = 	 {Proceedings of the 41st International Conference on Machine Learning},
  pages = 	 {23843--23866},
  year = 	 {2024},
  editor = 	 {Salakhutdinov, Ruslan and Kolter, Zico and Heller, Katherine and Weller, Adrian and Oliver, Nuria and Scarlett, Jonathan and Berkenkamp, Felix},
  volume = 	 {235},
  series = 	 {Proceedings of Machine Learning Research},
  month = 	 {21--27 Jul},
  publisher =    {PMLR},
  pdf = 	 {https://raw.githubusercontent.com/mlresearch/v235/main/assets/kim24d/kim24d.pdf},
  url = 	 {https://proceedings.mlr.press/v235/kim24d.html},
  abstract = 	 {Pretrained deep-learning models are the go-to solution for images or text. However, for tabular data the standard is still to train tree-based models. Indeed, transfer learning on tables hits the challenge of data integration: finding correspondences, correspondences in the entries (entity matching) where different words may denote the same entity, correspondences across columns (schema matching), which may come in different orders, names... We propose a neural architecture that does not need such correspondences. As a result, we can pretrain it on background data that has not been matched. The architecture –CARTE for Context Aware Representation of Table Entries– uses a graph representation of tabular (or relational) data to process tables with different columns, string embedding of entries and columns names to model an open vocabulary, and a graph-attentional network to contextualize entries with column names and neighboring entries. An extensive benchmark shows that CARTE facilitates learning, outperforming a solid set of baselines including the best tree-based models. CARTE also enables joint learning across tables with unmatched columns, enhancing a small table with bigger ones. CARTE opens the door to large pretrained models for tabular data.}
}

Endnote

%0 Conference Paper
%T CARTE: Pretraining and Transfer for Tabular Learning
%A Myung Jun Kim
%A Leo Grinsztajn
%A Gael Varoquaux
%B Proceedings of the 41st International Conference on Machine Learning
%C Proceedings of Machine Learning Research
%D 2024
%E Ruslan Salakhutdinov
%E Zico Kolter
%E Katherine Heller
%E Adrian Weller
%E Nuria Oliver
%E Jonathan Scarlett
%E Felix Berkenkamp	
%F pmlr-v235-kim24d
%I PMLR
%P 23843--23866
%U https://proceedings.mlr.press/v235/kim24d.html
%V 235
%X Pretrained deep-learning models are the go-to solution for images or text. However, for tabular data the standard is still to train tree-based models. Indeed, transfer learning on tables hits the challenge of data integration: finding correspondences, correspondences in the entries (entity matching) where different words may denote the same entity, correspondences across columns (schema matching), which may come in different orders, names... We propose a neural architecture that does not need such correspondences. As a result, we can pretrain it on background data that has not been matched. The architecture –CARTE for Context Aware Representation of Table Entries– uses a graph representation of tabular (or relational) data to process tables with different columns, string embedding of entries and columns names to model an open vocabulary, and a graph-attentional network to contextualize entries with column names and neighboring entries. An extensive benchmark shows that CARTE facilitates learning, outperforming a solid set of baselines including the best tree-based models. CARTE also enables joint learning across tables with unmatched columns, enhancing a small table with bigger ones. CARTE opens the door to large pretrained models for tabular data.

APA


Kim, M.J., Grinsztajn, L. & Varoquaux, G.. (2024). CARTE: Pretraining and Transfer for Tabular Learning. Proceedings of the 41st International Conference on Machine Learning, in Proceedings of Machine Learning Research 235:23843-23866 Available from https://proceedings.mlr.press/v235/kim24d.html.

CARTE: Pretraining and Transfer for Tabular Learning

Abstract

Cite this Paper

Related Material