Eigen: Expert-Informed Joint Learning Aggregation for High-Fidelity Information Extraction from Document Images

Abhishek Singh, Venkatapathy Subramanian, Ayush Maheshwari, Pradeep Narayan, Devi Prasad Shetty, Ganesh Ramakrishnan
Proceedings of the 3rd Machine Learning for Health Symposium, PMLR 225:559-573, 2023.

Abstract

Information Extraction (IE) from document images is challenging due to the high variability of layout formats. Deep models such as etc . In this work, we propose a novel approach, EIGEN (Expert-Informed Joint Learning aGgrEatioN), which combines rule-based methods with deep learning models using data programming approaches to circumvent the requirement of annotation of large amounts of training data. Specifically, consolidates weak labels induced from multiple heuristics through generative models and use them along with a small number of annotated labels to jointly train a deep model. In our framework, we propose the use of labeling functions that include incorporating contextual information thus capturing the visual and language context of a word for accurate categorization. We empirically show that our framework can significantly improve the performance of state-of-the-art deep models with the availability of very few labeled data instances Source code is available at https://github.com/ayushayush591/EIGEN-High-Fidelity-Extraction-Document-Images .

Cite this Paper


BibTeX
@InProceedings{pmlr-v225-singh23a, title = {Eigen: Expert-Informed Joint Learning Aggregation for High-Fidelity Information Extraction from Document Images}, author = {Singh, Abhishek and Subramanian, Venkatapathy and Maheshwari, Ayush and Narayan, Pradeep and Shetty, Devi Prasad and Ramakrishnan, Ganesh}, booktitle = {Proceedings of the 3rd Machine Learning for Health Symposium}, pages = {559--573}, year = {2023}, editor = {Hegselmann, Stefan and Parziale, Antonio and Shanmugam, Divya and Tang, Shengpu and Asiedu, Mercy Nyamewaa and Chang, Serina and Hartvigsen, Tom and Singh, Harvineet}, volume = {225}, series = {Proceedings of Machine Learning Research}, month = {10 Dec}, publisher = {PMLR}, pdf = {https://proceedings.mlr.press/v225/singh23a/singh23a.pdf}, url = {https://proceedings.mlr.press/v225/singh23a.html}, abstract = {Information Extraction (IE) from document images is challenging due to the high variability of layout formats. Deep models such as etc . In this work, we propose a novel approach, EIGEN (Expert-Informed Joint Learning aGgrEatioN), which combines rule-based methods with deep learning models using data programming approaches to circumvent the requirement of annotation of large amounts of training data. Specifically, consolidates weak labels induced from multiple heuristics through generative models and use them along with a small number of annotated labels to jointly train a deep model. In our framework, we propose the use of labeling functions that include incorporating contextual information thus capturing the visual and language context of a word for accurate categorization. We empirically show that our framework can significantly improve the performance of state-of-the-art deep models with the availability of very few labeled data instances Source code is available at https://github.com/ayushayush591/EIGEN-High-Fidelity-Extraction-Document-Images .} }
Endnote
%0 Conference Paper %T Eigen: Expert-Informed Joint Learning Aggregation for High-Fidelity Information Extraction from Document Images %A Abhishek Singh %A Venkatapathy Subramanian %A Ayush Maheshwari %A Pradeep Narayan %A Devi Prasad Shetty %A Ganesh Ramakrishnan %B Proceedings of the 3rd Machine Learning for Health Symposium %C Proceedings of Machine Learning Research %D 2023 %E Stefan Hegselmann %E Antonio Parziale %E Divya Shanmugam %E Shengpu Tang %E Mercy Nyamewaa Asiedu %E Serina Chang %E Tom Hartvigsen %E Harvineet Singh %F pmlr-v225-singh23a %I PMLR %P 559--573 %U https://proceedings.mlr.press/v225/singh23a.html %V 225 %X Information Extraction (IE) from document images is challenging due to the high variability of layout formats. Deep models such as etc . In this work, we propose a novel approach, EIGEN (Expert-Informed Joint Learning aGgrEatioN), which combines rule-based methods with deep learning models using data programming approaches to circumvent the requirement of annotation of large amounts of training data. Specifically, consolidates weak labels induced from multiple heuristics through generative models and use them along with a small number of annotated labels to jointly train a deep model. In our framework, we propose the use of labeling functions that include incorporating contextual information thus capturing the visual and language context of a word for accurate categorization. We empirically show that our framework can significantly improve the performance of state-of-the-art deep models with the availability of very few labeled data instances Source code is available at https://github.com/ayushayush591/EIGEN-High-Fidelity-Extraction-Document-Images .
APA
Singh, A., Subramanian, V., Maheshwari, A., Narayan, P., Shetty, D.P. & Ramakrishnan, G.. (2023). Eigen: Expert-Informed Joint Learning Aggregation for High-Fidelity Information Extraction from Document Images. Proceedings of the 3rd Machine Learning for Health Symposium, in Proceedings of Machine Learning Research 225:559-573 Available from https://proceedings.mlr.press/v225/singh23a.html.

Related Material