A Multimodal Data Extraction Pipeline with Table Layout Correction

Kecen Yao, Anton Shesternev, Ahmad Pesaranghader, Erin Li
Proceedings of the The 39th Canadian Conference on Artificial Intelligence, PMLR 318:64-76, 2026.

Abstract

Financial documents such as paystubs, invoices, and financial statements contain heterogeneous layouts and visually complex tables, making reliable information extraction challenging for both optical character recognition (OCR) based pipelines and end-to-end vision–language models (VLMs). In this paper, we present a pipeline that unifies layout analysis, one-shot multimodal table correction, and downstream extraction and reasoning without any model fine-tuning. The pipeline converts document images into a hybrid Markdown-HTML representation and applies a multi-modal correction module to rectify layout-level errors in tables, yielding demonstrable improvements in Tree-Edit-Distance-based Similarity (TEDS) scores. Additionally, using this corrected representation, the system performs robust schema-based extraction and document-level question answering. Experimental results across paystub field extraction and Finance question-answering (QA) tasks show that our approach consistently outperforms both OCR-only pipelines and direct VLM baselines. These results demonstrate that incorporating explicit table layout and multimodal table correction provides a scalable and generalizable path toward robust financial document understanding.

Cite this Paper


BibTeX
@InProceedings{pmlr-v318-yao26a, title = {A Multimodal Data Extraction Pipeline with Table Layout Correction}, author = {Yao, Kecen and Shesternev, Anton and Pesaranghader, Ahmad and Li, Erin}, booktitle = {Proceedings of the The 39th Canadian Conference on Artificial Intelligence}, pages = {64--76}, year = {2026}, editor = {Bouzar-Benlabiod, Lydia and Leung, Carson}, volume = {318}, series = {Proceedings of Machine Learning Research}, month = {25--29 May}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v318/main/assets/yao26a/yao26a.pdf}, url = {https://proceedings.mlr.press/v318/yao26a.html}, abstract = {Financial documents such as paystubs, invoices, and financial statements contain heterogeneous layouts and visually complex tables, making reliable information extraction challenging for both optical character recognition (OCR) based pipelines and end-to-end vision–language models (VLMs). In this paper, we present a pipeline that unifies layout analysis, one-shot multimodal table correction, and downstream extraction and reasoning without any model fine-tuning. The pipeline converts document images into a hybrid Markdown-HTML representation and applies a multi-modal correction module to rectify layout-level errors in tables, yielding demonstrable improvements in Tree-Edit-Distance-based Similarity (TEDS) scores. Additionally, using this corrected representation, the system performs robust schema-based extraction and document-level question answering. Experimental results across paystub field extraction and Finance question-answering (QA) tasks show that our approach consistently outperforms both OCR-only pipelines and direct VLM baselines. These results demonstrate that incorporating explicit table layout and multimodal table correction provides a scalable and generalizable path toward robust financial document understanding.} }
Endnote
%0 Conference Paper %T A Multimodal Data Extraction Pipeline with Table Layout Correction %A Kecen Yao %A Anton Shesternev %A Ahmad Pesaranghader %A Erin Li %B Proceedings of the The 39th Canadian Conference on Artificial Intelligence %C Proceedings of Machine Learning Research %D 2026 %E Lydia Bouzar-Benlabiod %E Carson Leung %F pmlr-v318-yao26a %I PMLR %P 64--76 %U https://proceedings.mlr.press/v318/yao26a.html %V 318 %X Financial documents such as paystubs, invoices, and financial statements contain heterogeneous layouts and visually complex tables, making reliable information extraction challenging for both optical character recognition (OCR) based pipelines and end-to-end vision–language models (VLMs). In this paper, we present a pipeline that unifies layout analysis, one-shot multimodal table correction, and downstream extraction and reasoning without any model fine-tuning. The pipeline converts document images into a hybrid Markdown-HTML representation and applies a multi-modal correction module to rectify layout-level errors in tables, yielding demonstrable improvements in Tree-Edit-Distance-based Similarity (TEDS) scores. Additionally, using this corrected representation, the system performs robust schema-based extraction and document-level question answering. Experimental results across paystub field extraction and Finance question-answering (QA) tasks show that our approach consistently outperforms both OCR-only pipelines and direct VLM baselines. These results demonstrate that incorporating explicit table layout and multimodal table correction provides a scalable and generalizable path toward robust financial document understanding.
APA
Yao, K., Shesternev, A., Pesaranghader, A. & Li, E.. (2026). A Multimodal Data Extraction Pipeline with Table Layout Correction. Proceedings of the The 39th Canadian Conference on Artificial Intelligence, in Proceedings of Machine Learning Research 318:64-76 Available from https://proceedings.mlr.press/v318/yao26a.html.

Related Material