Enhancing Multimodal Product Retrieval in E-Commerce by Reversing Typographic Attacks

Janet Jenq; Hongda Shen

Enhancing Multimodal Product Retrieval in E-Commerce by Reversing Typographic Attacks

Janet Jenq, Hongda Shen

Proceedings of UniReps: the Third Edition of the Workshop on Unifying Representations in Neural Models, PMLR 322:10-29, 2026.

Abstract

Multimodal product retrieval systems in e-commerce platforms rely on effectively combining visual and textual signals to improve search relevance and user experience. However, vision-language models such as CLIP are vulnerable to typographic attacks, where misleading or irrelevant text embedded in images skews model predictions. In this work, we propose a novel method that reverses the logic of typographic attacks by overlaying relevant textual content (e.g., titles, descriptions) directly onto product images to perform additional vision-text compression, thereby strengthening image-text alignment and boosting multimodal product retrieval performance. We evaluate our method on three vertical-specific e-commerce datasets (sneakers, handbags, and trading cards) using five state-of-the-art vision foundation models. Our experiments demonstrate consistent improvements in unimodal and multimodal retrieval accuracy across categories and model families. Our findings suggest that visually rendering product metadata is a simple yet effective enhancement for zero-shot multimodal retrieval in e-commerce applications.

Cite this Paper

BibTeX

@InProceedings{pmlr-v322-jenq26a,
  title = 	 {Enhancing Multimodal Product Retrieval in E-Commerce by Reversing Typographic Attacks},
  author =       {Jenq, Janet and Shen, Hongda},
  booktitle = 	 {Proceedings of UniReps: the Third Edition of the Workshop on Unifying Representations in Neural Models},
  pages = 	 {10--29},
  year = 	 {2026},
  editor = 	 {Fumero, Marco and Domine, Clementine and L"ahner, Zorah and Cannistraci, Irene and Zhao, Bo and Williams, Alex},
  volume = 	 {322},
  series = 	 {Proceedings of Machine Learning Research},
  month = 	 {06 Dec},
  publisher =    {PMLR},
  pdf = 	 {https://raw.githubusercontent.com/mlresearch/v322/main/assets/jenq26a/jenq26a.pdf},
  url = 	 {https://proceedings.mlr.press/v322/jenq26a.html},
  abstract = 	 {Multimodal product retrieval systems in e-commerce platforms rely on effectively combining visual and textual signals to improve search relevance and user experience. However, vision-language models such as CLIP are vulnerable to typographic attacks, where misleading or irrelevant text embedded in images skews model predictions. In this work, we propose a novel method that reverses the logic of typographic attacks by overlaying relevant textual content (e.g., titles, descriptions) directly onto product images to perform additional vision-text compression, thereby strengthening image-text alignment and boosting multimodal product retrieval performance. We evaluate our method on three vertical-specific e-commerce datasets (sneakers, handbags, and trading cards) using five state-of-the-art vision foundation models. Our experiments demonstrate consistent improvements in unimodal and multimodal retrieval accuracy across categories and model families. Our findings suggest that visually rendering product metadata is a simple yet effective enhancement for zero-shot multimodal retrieval in e-commerce applications.}
}

Endnote

%0 Conference Paper
%T Enhancing Multimodal Product Retrieval in E-Commerce by Reversing Typographic Attacks
%A Janet Jenq
%A Hongda Shen
%B Proceedings of UniReps: the Third Edition of the Workshop on Unifying Representations in Neural Models
%C Proceedings of Machine Learning Research
%D 2026
%E Marco Fumero
%E Clementine Domine
%E Zorah L"ahner
%E Irene Cannistraci
%E Bo Zhao
%E Alex Williams	
%F pmlr-v322-jenq26a
%I PMLR
%P 10--29
%U https://proceedings.mlr.press/v322/jenq26a.html
%V 322
%X Multimodal product retrieval systems in e-commerce platforms rely on effectively combining visual and textual signals to improve search relevance and user experience. However, vision-language models such as CLIP are vulnerable to typographic attacks, where misleading or irrelevant text embedded in images skews model predictions. In this work, we propose a novel method that reverses the logic of typographic attacks by overlaying relevant textual content (e.g., titles, descriptions) directly onto product images to perform additional vision-text compression, thereby strengthening image-text alignment and boosting multimodal product retrieval performance. We evaluate our method on three vertical-specific e-commerce datasets (sneakers, handbags, and trading cards) using five state-of-the-art vision foundation models. Our experiments demonstrate consistent improvements in unimodal and multimodal retrieval accuracy across categories and model families. Our findings suggest that visually rendering product metadata is a simple yet effective enhancement for zero-shot multimodal retrieval in e-commerce applications.

APA

Jenq, J. & Shen, H.. (2026). Enhancing Multimodal Product Retrieval in E-Commerce by Reversing Typographic Attacks. Proceedings of UniReps: the Third Edition of the Workshop on Unifying Representations in Neural Models, in Proceedings of Machine Learning Research 322:10-29 Available from https://proceedings.mlr.press/v322/jenq26a.html.

Enhancing Multimodal Product Retrieval in E-Commerce by Reversing Typographic Attacks

Abstract

Cite this Paper

Related Material