DocQIR-Emb: Document Image Retrieval with Multi-lingual Question Query

Chih-Hui Ho, Giovanna Carreira Marinho, Felipe Viana, Varad Pimpalkhute, Rodolfo Luis Tonoli, Andre Von Zuben
Proceedings of UniReps: the Third Edition of the Workshop on Unifying Representations in Neural Models, PMLR 322:183-198, 2026.

Abstract

Document image retrieval is a fundamental task for improving document understanding, where the goal is to retrieve relevant images in the document and to answer the question from the user. Unlike other text-to-image tasks, which mainly focus on the alignment between image caption and natural image, document image retrieval requires the model to understand the question from user and return related table image or scientific image. The significant domain difference between image caption and user question, as well as natural image and scientific images, prevents the off-the-shelf retrieval model from becoming applicable. To systematically study the degradation, we curate a novel multi-lingual Document Question-Image Retrieval benchmark, DocQIR, that covers questions in 5 different languages. Our preliminary study shows that off-the-shelf retrieval models fail to retrieve documents images when questions in various languages are presented. To address this issue, we proposed a novel architecture, DocQIR-Emb, that leverages a multi-lingual text embedder and a VLM to encode a question and an image into a shared feature space. Since the multi-lingual embedder is trained to align text in different languages, the text embedder is frozen and only the VLM is optimized. Experiments show that DocQIR-Emb outperforms the baseline by at least 40% on the proposed DocQIR dataset and the gain is consistent across table image and scientific image. Different architecture designs are also ablated to demonstrate the effectiveness of DocQIR-Emb.

Cite this Paper


BibTeX
@InProceedings{pmlr-v322-ho26a, title = {Doc{QIR}-Emb: Document Image Retrieval with Multi-lingual Question Query}, author = {Ho, Chih-Hui and Marinho, Giovanna Carreira and Viana, Felipe and Pimpalkhute, Varad and Tonoli, Rodolfo Luis and Zuben, Andre Von}, booktitle = {Proceedings of UniReps: the Third Edition of the Workshop on Unifying Representations in Neural Models}, pages = {183--198}, year = {2026}, editor = {Fumero, Marco and Domine, Clementine and L"ahner, Zorah and Cannistraci, Irene and Zhao, Bo and Williams, Alex}, volume = {322}, series = {Proceedings of Machine Learning Research}, month = {06 Dec}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v322/main/assets/ho26a/ho26a.pdf}, url = {https://proceedings.mlr.press/v322/ho26a.html}, abstract = {Document image retrieval is a fundamental task for improving document understanding, where the goal is to retrieve relevant images in the document and to answer the question from the user. Unlike other text-to-image tasks, which mainly focus on the alignment between image caption and natural image, document image retrieval requires the model to understand the question from user and return related table image or scientific image. The significant domain difference between image caption and user question, as well as natural image and scientific images, prevents the off-the-shelf retrieval model from becoming applicable. To systematically study the degradation, we curate a novel multi-lingual Document Question-Image Retrieval benchmark, DocQIR, that covers questions in 5 different languages. Our preliminary study shows that off-the-shelf retrieval models fail to retrieve documents images when questions in various languages are presented. To address this issue, we proposed a novel architecture, DocQIR-Emb, that leverages a multi-lingual text embedder and a VLM to encode a question and an image into a shared feature space. Since the multi-lingual embedder is trained to align text in different languages, the text embedder is frozen and only the VLM is optimized. Experiments show that DocQIR-Emb outperforms the baseline by at least 40% on the proposed DocQIR dataset and the gain is consistent across table image and scientific image. Different architecture designs are also ablated to demonstrate the effectiveness of DocQIR-Emb.} }
Endnote
%0 Conference Paper %T DocQIR-Emb: Document Image Retrieval with Multi-lingual Question Query %A Chih-Hui Ho %A Giovanna Carreira Marinho %A Felipe Viana %A Varad Pimpalkhute %A Rodolfo Luis Tonoli %A Andre Von Zuben %B Proceedings of UniReps: the Third Edition of the Workshop on Unifying Representations in Neural Models %C Proceedings of Machine Learning Research %D 2026 %E Marco Fumero %E Clementine Domine %E Zorah L"ahner %E Irene Cannistraci %E Bo Zhao %E Alex Williams %F pmlr-v322-ho26a %I PMLR %P 183--198 %U https://proceedings.mlr.press/v322/ho26a.html %V 322 %X Document image retrieval is a fundamental task for improving document understanding, where the goal is to retrieve relevant images in the document and to answer the question from the user. Unlike other text-to-image tasks, which mainly focus on the alignment between image caption and natural image, document image retrieval requires the model to understand the question from user and return related table image or scientific image. The significant domain difference between image caption and user question, as well as natural image and scientific images, prevents the off-the-shelf retrieval model from becoming applicable. To systematically study the degradation, we curate a novel multi-lingual Document Question-Image Retrieval benchmark, DocQIR, that covers questions in 5 different languages. Our preliminary study shows that off-the-shelf retrieval models fail to retrieve documents images when questions in various languages are presented. To address this issue, we proposed a novel architecture, DocQIR-Emb, that leverages a multi-lingual text embedder and a VLM to encode a question and an image into a shared feature space. Since the multi-lingual embedder is trained to align text in different languages, the text embedder is frozen and only the VLM is optimized. Experiments show that DocQIR-Emb outperforms the baseline by at least 40% on the proposed DocQIR dataset and the gain is consistent across table image and scientific image. Different architecture designs are also ablated to demonstrate the effectiveness of DocQIR-Emb.
APA
Ho, C., Marinho, G.C., Viana, F., Pimpalkhute, V., Tonoli, R.L. & Zuben, A.V.. (2026). DocQIR-Emb: Document Image Retrieval with Multi-lingual Question Query. Proceedings of UniReps: the Third Edition of the Workshop on Unifying Representations in Neural Models, in Proceedings of Machine Learning Research 322:183-198 Available from https://proceedings.mlr.press/v322/ho26a.html.

Related Material