TextIT: Inference-Time Representation Alignment for Improved Visual Text Generation in Diffusion Models

Abhikhya Tripathy; Aishwarya Agarwal; Srikrishna Karanam; Balaji Vasan Srinivasan

TextIT: Inference-Time Representation Alignment for Improved Visual Text Generation in Diffusion Models

Abhikhya Tripathy, Aishwarya Agarwal, Srikrishna Karanam, Balaji Vasan Srinivasan

Proceedings of UniReps: the Third Edition of the Workshop on Unifying Representations in Neural Models, PMLR 322:418-430, 2026.

Abstract

Recent advances in text-to-image diffusion models have shown remarkable performance in generating realistic images from text descriptions. However, high-quality visual text generation in generated images remains a major challenge. Gibberish text generation is particularly problematic when the model has to generate proper nouns and text that is not commonly present in training data. Unlike existing methods to improve visual text generation which are based on data-intensive and time-consuming fine-tuning approaches, we propose an inference-time representation alignment algorithm, TextIT, that does not need additional data or training. First, we propose an inference-time self-attention manipulation loss that exposes and aligns latent intermediate self-attention (SA) representations governing visual text generation with those of correctly-rendered text. Next, we impose fine-grained control over the generated text by aligning character-wise control points, obtained through self-attention map vectorization, with ground truth character control points. We provide evidence that inference-time representational manipulation enables controllable and interpretable improvements in text-to-image generation, validating our method with character and word-level visual text generation results that retain the overall generative diversity of diffusion models.

Cite this Paper

BibTeX

@InProceedings{pmlr-v322-tripathy26a,
  title = 	 {Text{IT}: Inference-Time Representation Alignment for Improved Visual Text Generation in Diffusion Models},
  author =       {Tripathy, Abhikhya and Agarwal, Aishwarya and Karanam, Srikrishna and Srinivasan, Balaji Vasan},
  booktitle = 	 {Proceedings of UniReps: the Third Edition of the Workshop on Unifying Representations in Neural Models},
  pages = 	 {418--430},
  year = 	 {2026},
  editor = 	 {Fumero, Marco and Domine, Clementine and L"ahner, Zorah and Cannistraci, Irene and Zhao, Bo and Williams, Alex},
  volume = 	 {322},
  series = 	 {Proceedings of Machine Learning Research},
  month = 	 {06 Dec},
  publisher =    {PMLR},
  pdf = 	 {https://raw.githubusercontent.com/mlresearch/v322/main/assets/tripathy26a/tripathy26a.pdf},
  url = 	 {https://proceedings.mlr.press/v322/tripathy26a.html},
  abstract = 	 {Recent advances in text-to-image diffusion models have shown remarkable performance in generating realistic images from text descriptions. However, high-quality visual text generation in generated images remains a major challenge. Gibberish text generation is particularly problematic when the model has to generate proper nouns and text that is not commonly present in training data. Unlike existing methods to improve visual text generation which are based on data-intensive and time-consuming fine-tuning approaches, we propose an inference-time representation alignment algorithm, TextIT, that does not need additional data or training. First, we propose an inference-time self-attention manipulation loss that exposes and aligns latent intermediate self-attention (SA) representations governing visual text generation with those of correctly-rendered text. Next, we impose fine-grained control over the generated text by aligning character-wise control points, obtained through self-attention map vectorization, with ground truth character control points. We provide evidence that inference-time representational manipulation enables controllable and interpretable improvements in text-to-image generation, validating our method with character and word-level visual text generation results that retain the overall generative diversity of diffusion models.}
}

Endnote

%0 Conference Paper
%T TextIT: Inference-Time Representation Alignment for Improved Visual Text Generation in Diffusion Models
%A Abhikhya Tripathy
%A Aishwarya Agarwal
%A Srikrishna Karanam
%A Balaji Vasan Srinivasan
%B Proceedings of UniReps: the Third Edition of the Workshop on Unifying Representations in Neural Models
%C Proceedings of Machine Learning Research
%D 2026
%E Marco Fumero
%E Clementine Domine
%E Zorah L"ahner
%E Irene Cannistraci
%E Bo Zhao
%E Alex Williams	
%F pmlr-v322-tripathy26a
%I PMLR
%P 418--430
%U https://proceedings.mlr.press/v322/tripathy26a.html
%V 322
%X Recent advances in text-to-image diffusion models have shown remarkable performance in generating realistic images from text descriptions. However, high-quality visual text generation in generated images remains a major challenge. Gibberish text generation is particularly problematic when the model has to generate proper nouns and text that is not commonly present in training data. Unlike existing methods to improve visual text generation which are based on data-intensive and time-consuming fine-tuning approaches, we propose an inference-time representation alignment algorithm, TextIT, that does not need additional data or training. First, we propose an inference-time self-attention manipulation loss that exposes and aligns latent intermediate self-attention (SA) representations governing visual text generation with those of correctly-rendered text. Next, we impose fine-grained control over the generated text by aligning character-wise control points, obtained through self-attention map vectorization, with ground truth character control points. We provide evidence that inference-time representational manipulation enables controllable and interpretable improvements in text-to-image generation, validating our method with character and word-level visual text generation results that retain the overall generative diversity of diffusion models.

APA

Tripathy, A., Agarwal, A., Karanam, S. & Srinivasan, B.V.. (2026). TextIT: Inference-Time Representation Alignment for Improved Visual Text Generation in Diffusion Models. Proceedings of UniReps: the Third Edition of the Workshop on Unifying Representations in Neural Models, in Proceedings of Machine Learning Research 322:418-430 Available from https://proceedings.mlr.press/v322/tripathy26a.html.

TextIT: Inference-Time Representation Alignment for Improved Visual Text Generation in Diffusion Models

Abstract

Cite this Paper

Related Material