Enhancing Multimodal Product Retrieval in E-Commerce by Reversing Typographic Attacks

Janet Jenq, Hongda Shen
Proceedings of UniReps: the Third Edition of the Workshop on Unifying Representations in Neural Models, PMLR 322:10-29, 2026.

Abstract

Multimodal product retrieval systems in e-commerce platforms rely on effectively combining visual and textual signals to improve search relevance and user experience. However, vision-language models such as CLIP are vulnerable to typographic attacks, where misleading or irrelevant text embedded in images skews model predictions. In this work, we propose a novel method that reverses the logic of typographic attacks by overlaying relevant textual content (e.g., titles, descriptions) directly onto product images to perform additional vision-text compression, thereby strengthening image-text alignment and boosting multimodal product retrieval performance. We evaluate our method on three vertical-specific e-commerce datasets (sneakers, handbags, and trading cards) using five state-of-the-art vision foundation models. Our experiments demonstrate consistent improvements in unimodal and multimodal retrieval accuracy across categories and model families. Our findings suggest that visually rendering product metadata is a simple yet effective enhancement for zero-shot multimodal retrieval in e-commerce applications.

Cite this Paper


BibTeX
@InProceedings{pmlr-v322-jenq26a, title = {Enhancing Multimodal Product Retrieval in E-Commerce by Reversing Typographic Attacks}, author = {Jenq, Janet and Shen, Hongda}, booktitle = {Proceedings of UniReps: the Third Edition of the Workshop on Unifying Representations in Neural Models}, pages = {10--29}, year = {2026}, editor = {Fumero, Marco and Domine, Clementine and L"ahner, Zorah and Cannistraci, Irene and Zhao, Bo and Williams, Alex}, volume = {322}, series = {Proceedings of Machine Learning Research}, month = {06 Dec}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v322/main/assets/jenq26a/jenq26a.pdf}, url = {https://proceedings.mlr.press/v322/jenq26a.html}, abstract = {Multimodal product retrieval systems in e-commerce platforms rely on effectively combining visual and textual signals to improve search relevance and user experience. However, vision-language models such as CLIP are vulnerable to typographic attacks, where misleading or irrelevant text embedded in images skews model predictions. In this work, we propose a novel method that reverses the logic of typographic attacks by overlaying relevant textual content (e.g., titles, descriptions) directly onto product images to perform additional vision-text compression, thereby strengthening image-text alignment and boosting multimodal product retrieval performance. We evaluate our method on three vertical-specific e-commerce datasets (sneakers, handbags, and trading cards) using five state-of-the-art vision foundation models. Our experiments demonstrate consistent improvements in unimodal and multimodal retrieval accuracy across categories and model families. Our findings suggest that visually rendering product metadata is a simple yet effective enhancement for zero-shot multimodal retrieval in e-commerce applications.} }
Endnote
%0 Conference Paper %T Enhancing Multimodal Product Retrieval in E-Commerce by Reversing Typographic Attacks %A Janet Jenq %A Hongda Shen %B Proceedings of UniReps: the Third Edition of the Workshop on Unifying Representations in Neural Models %C Proceedings of Machine Learning Research %D 2026 %E Marco Fumero %E Clementine Domine %E Zorah L"ahner %E Irene Cannistraci %E Bo Zhao %E Alex Williams %F pmlr-v322-jenq26a %I PMLR %P 10--29 %U https://proceedings.mlr.press/v322/jenq26a.html %V 322 %X Multimodal product retrieval systems in e-commerce platforms rely on effectively combining visual and textual signals to improve search relevance and user experience. However, vision-language models such as CLIP are vulnerable to typographic attacks, where misleading or irrelevant text embedded in images skews model predictions. In this work, we propose a novel method that reverses the logic of typographic attacks by overlaying relevant textual content (e.g., titles, descriptions) directly onto product images to perform additional vision-text compression, thereby strengthening image-text alignment and boosting multimodal product retrieval performance. We evaluate our method on three vertical-specific e-commerce datasets (sneakers, handbags, and trading cards) using five state-of-the-art vision foundation models. Our experiments demonstrate consistent improvements in unimodal and multimodal retrieval accuracy across categories and model families. Our findings suggest that visually rendering product metadata is a simple yet effective enhancement for zero-shot multimodal retrieval in e-commerce applications.
APA
Jenq, J. & Shen, H.. (2026). Enhancing Multimodal Product Retrieval in E-Commerce by Reversing Typographic Attacks. Proceedings of UniReps: the Third Edition of the Workshop on Unifying Representations in Neural Models, in Proceedings of Machine Learning Research 322:10-29 Available from https://proceedings.mlr.press/v322/jenq26a.html.

Related Material