Text2VLM: Adapting Text-Only Datasets to Evaluate Alignment Training in Visual Language Models

Gabriel Downer, Sean Craven, Damian Ruck, Jake Thomas
Proceedings of the 2025 Conference on Applied Machine Learning for Information Security, PMLR 299:28-41, 2025.

Abstract

The increasing integration of Visual Language Models (VLMs) into AI systems necessitates robust model alignment, especially when handling multimodal content that combines text and images. Existing evaluation datasets heavily lean towards text-only prompts, leaving visual vulnerabilities under evaluated. To address this gap, we propose Text2VLM, a novel multi-stage pipeline that adapts text-only datasets into multimodal formats, specifically designed to evaluate the resilience of VLMs against typographic prompt injection attacks. The Text2VLM pipeline identifies harmful content in the original text and converts it into a typographic image, creating a multimodal prompt for VLMs. Also, our evaluation of open-source VLMs highlights their increased susceptibility to prompt injection when visual inputs are introduced, revealing critical weaknesses in the current models’ alignment. This is in addition to a significant performance gap compared to closed-source frontier models. We validate Text2VLM through human evaluations, ensuring the alignment of extracted salient concepts; text summarization and output classification align with human expectations. Text2VLM provides a scalable tool for comprehensive safety assessment, contributing to the development of more robust safety mechanisms for VLMs. By enhancing the evaluation of multimodal vulnerabilities, Text2VLM plays a role in advancing the safe deployment of VLMs in diverse, real-world applications. We have made Text2VLM available for others to use, along with the code to replicate the results in this paper.

Cite this Paper


BibTeX
@InProceedings{pmlr-v299-downer25a, title = {Text2VLM: Adapting Text-Only Datasets to Evaluate Alignment Training in Visual Language Models}, author = {Downer, Gabriel and Craven, Sean and Ruck, Damian and Thomas, Jake}, booktitle = {Proceedings of the 2025 Conference on Applied Machine Learning for Information Security}, pages = {28--41}, year = {2025}, editor = {Raff, Edward and Rudd, Ethan M.}, volume = {299}, series = {Proceedings of Machine Learning Research}, month = {22--24 Oct}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v299/main/assets/downer25a/downer25a.pdf}, url = {https://proceedings.mlr.press/v299/downer25a.html}, abstract = {The increasing integration of Visual Language Models (VLMs) into AI systems necessitates robust model alignment, especially when handling multimodal content that combines text and images. Existing evaluation datasets heavily lean towards text-only prompts, leaving visual vulnerabilities under evaluated. To address this gap, we propose Text2VLM, a novel multi-stage pipeline that adapts text-only datasets into multimodal formats, specifically designed to evaluate the resilience of VLMs against typographic prompt injection attacks. The Text2VLM pipeline identifies harmful content in the original text and converts it into a typographic image, creating a multimodal prompt for VLMs. Also, our evaluation of open-source VLMs highlights their increased susceptibility to prompt injection when visual inputs are introduced, revealing critical weaknesses in the current models’ alignment. This is in addition to a significant performance gap compared to closed-source frontier models. We validate Text2VLM through human evaluations, ensuring the alignment of extracted salient concepts; text summarization and output classification align with human expectations. Text2VLM provides a scalable tool for comprehensive safety assessment, contributing to the development of more robust safety mechanisms for VLMs. By enhancing the evaluation of multimodal vulnerabilities, Text2VLM plays a role in advancing the safe deployment of VLMs in diverse, real-world applications. We have made Text2VLM available for others to use, along with the code to replicate the results in this paper. } }
Endnote
%0 Conference Paper %T Text2VLM: Adapting Text-Only Datasets to Evaluate Alignment Training in Visual Language Models %A Gabriel Downer %A Sean Craven %A Damian Ruck %A Jake Thomas %B Proceedings of the 2025 Conference on Applied Machine Learning for Information Security %C Proceedings of Machine Learning Research %D 2025 %E Edward Raff %E Ethan M. Rudd %F pmlr-v299-downer25a %I PMLR %P 28--41 %U https://proceedings.mlr.press/v299/downer25a.html %V 299 %X The increasing integration of Visual Language Models (VLMs) into AI systems necessitates robust model alignment, especially when handling multimodal content that combines text and images. Existing evaluation datasets heavily lean towards text-only prompts, leaving visual vulnerabilities under evaluated. To address this gap, we propose Text2VLM, a novel multi-stage pipeline that adapts text-only datasets into multimodal formats, specifically designed to evaluate the resilience of VLMs against typographic prompt injection attacks. The Text2VLM pipeline identifies harmful content in the original text and converts it into a typographic image, creating a multimodal prompt for VLMs. Also, our evaluation of open-source VLMs highlights their increased susceptibility to prompt injection when visual inputs are introduced, revealing critical weaknesses in the current models’ alignment. This is in addition to a significant performance gap compared to closed-source frontier models. We validate Text2VLM through human evaluations, ensuring the alignment of extracted salient concepts; text summarization and output classification align with human expectations. Text2VLM provides a scalable tool for comprehensive safety assessment, contributing to the development of more robust safety mechanisms for VLMs. By enhancing the evaluation of multimodal vulnerabilities, Text2VLM plays a role in advancing the safe deployment of VLMs in diverse, real-world applications. We have made Text2VLM available for others to use, along with the code to replicate the results in this paper.
APA
Downer, G., Craven, S., Ruck, D. & Thomas, J.. (2025). Text2VLM: Adapting Text-Only Datasets to Evaluate Alignment Training in Visual Language Models. Proceedings of the 2025 Conference on Applied Machine Learning for Information Security, in Proceedings of Machine Learning Research 299:28-41 Available from https://proceedings.mlr.press/v299/downer25a.html.

Related Material