Compositional Condition Question Answering in Tabular Understanding

Jun-Peng Jiang, Tao Zhou, De-Chuan Zhan, Han-Jia Ye
Proceedings of the 42nd International Conference on Machine Learning, PMLR 267:27831-27850, 2025.

Abstract

Multimodal Large Language Models (MLLMs) for tabular understanding have made significant progress in tasks such as financial report analysis and public data tests. However, our comprehensive analysis shows that these models are still limited in certain simple scenarios, particularly when handling compositional conditions in QA. Further investigation reveals that the poor performance can be attributed to two main challenges: the visual encoder’s inability to accurately recognize the content of a row, and the model’s tendency to overlook conditions in the question. To address these, we introduce a new Compositional Condition Tabular Understanding method, called CoCoTab. Specifically, to capture the structural relationships within tables, we enhance the visual encoder with additional row and column patches. Moreover, we introduce the conditional tokens between the visual patches and query embeddings, ensuring the model focuses on relevant parts of the table according to the conditions specified in the query. Additionally, we also introduce the Massive Multimodal Tabular Understanding (MMTU) benchmark, which comprehensively assesses the full capabilities of MLLMs in tabular understanding. Our proposed method achieves state-of-the-art performance on both existing tabular understanding benchmarks and MMTU. Our code can be available at https://github.com/LAMDA-Tabular/MMTU.

Cite this Paper


BibTeX
@InProceedings{pmlr-v267-jiang25o, title = {Compositional Condition Question Answering in Tabular Understanding}, author = {Jiang, Jun-Peng and Zhou, Tao and Zhan, De-Chuan and Ye, Han-Jia}, booktitle = {Proceedings of the 42nd International Conference on Machine Learning}, pages = {27831--27850}, year = {2025}, editor = {Singh, Aarti and Fazel, Maryam and Hsu, Daniel and Lacoste-Julien, Simon and Berkenkamp, Felix and Maharaj, Tegan and Wagstaff, Kiri and Zhu, Jerry}, volume = {267}, series = {Proceedings of Machine Learning Research}, month = {13--19 Jul}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v267/main/assets/jiang25o/jiang25o.pdf}, url = {https://proceedings.mlr.press/v267/jiang25o.html}, abstract = {Multimodal Large Language Models (MLLMs) for tabular understanding have made significant progress in tasks such as financial report analysis and public data tests. However, our comprehensive analysis shows that these models are still limited in certain simple scenarios, particularly when handling compositional conditions in QA. Further investigation reveals that the poor performance can be attributed to two main challenges: the visual encoder’s inability to accurately recognize the content of a row, and the model’s tendency to overlook conditions in the question. To address these, we introduce a new Compositional Condition Tabular Understanding method, called CoCoTab. Specifically, to capture the structural relationships within tables, we enhance the visual encoder with additional row and column patches. Moreover, we introduce the conditional tokens between the visual patches and query embeddings, ensuring the model focuses on relevant parts of the table according to the conditions specified in the query. Additionally, we also introduce the Massive Multimodal Tabular Understanding (MMTU) benchmark, which comprehensively assesses the full capabilities of MLLMs in tabular understanding. Our proposed method achieves state-of-the-art performance on both existing tabular understanding benchmarks and MMTU. Our code can be available at https://github.com/LAMDA-Tabular/MMTU.} }
Endnote
%0 Conference Paper %T Compositional Condition Question Answering in Tabular Understanding %A Jun-Peng Jiang %A Tao Zhou %A De-Chuan Zhan %A Han-Jia Ye %B Proceedings of the 42nd International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2025 %E Aarti Singh %E Maryam Fazel %E Daniel Hsu %E Simon Lacoste-Julien %E Felix Berkenkamp %E Tegan Maharaj %E Kiri Wagstaff %E Jerry Zhu %F pmlr-v267-jiang25o %I PMLR %P 27831--27850 %U https://proceedings.mlr.press/v267/jiang25o.html %V 267 %X Multimodal Large Language Models (MLLMs) for tabular understanding have made significant progress in tasks such as financial report analysis and public data tests. However, our comprehensive analysis shows that these models are still limited in certain simple scenarios, particularly when handling compositional conditions in QA. Further investigation reveals that the poor performance can be attributed to two main challenges: the visual encoder’s inability to accurately recognize the content of a row, and the model’s tendency to overlook conditions in the question. To address these, we introduce a new Compositional Condition Tabular Understanding method, called CoCoTab. Specifically, to capture the structural relationships within tables, we enhance the visual encoder with additional row and column patches. Moreover, we introduce the conditional tokens between the visual patches and query embeddings, ensuring the model focuses on relevant parts of the table according to the conditions specified in the query. Additionally, we also introduce the Massive Multimodal Tabular Understanding (MMTU) benchmark, which comprehensively assesses the full capabilities of MLLMs in tabular understanding. Our proposed method achieves state-of-the-art performance on both existing tabular understanding benchmarks and MMTU. Our code can be available at https://github.com/LAMDA-Tabular/MMTU.
APA
Jiang, J., Zhou, T., Zhan, D. & Ye, H.. (2025). Compositional Condition Question Answering in Tabular Understanding. Proceedings of the 42nd International Conference on Machine Learning, in Proceedings of Machine Learning Research 267:27831-27850 Available from https://proceedings.mlr.press/v267/jiang25o.html.

Related Material