Eigen: Expert-Informed Joint Learning Aggregation for High-Fidelity Information Extraction from Document Images
Proceedings of the 3rd Machine Learning for Health Symposium, PMLR 225:559-573, 2023.
Information Extraction (IE) from document images is challenging due to the high variability of layout formats. Deep models such as etc . In this work, we propose a novel approach, EIGEN (Expert-Informed Joint Learning aGgrEatioN), which combines rule-based methods with deep learning models using data programming approaches to circumvent the requirement of annotation of large amounts of training data. Specifically, consolidates weak labels induced from multiple heuristics through generative models and use them along with a small number of annotated labels to jointly train a deep model. In our framework, we propose the use of labeling functions that include incorporating contextual information thus capturing the visual and language context of a word for accurate categorization. We empirically show that our framework can significantly improve the performance of state-of-the-art deep models with the availability of very few labeled data instances Source code is available at https://github.com/ayushayush591/EIGEN-High-Fidelity-Extraction-Document-Images .