Cross-Modal Representational Alignment with LLM Priors for Image Generation

Mykola Vysotskyi; Zahar Kohut; Anna-Alina Bondarets; Taras Rumezhak; Volodymyr Karpiv

Cross-Modal Representational Alignment with LLM Priors for Image Generation

Mykola Vysotskyi, Zahar Kohut, Anna-Alina Bondarets, Taras Rumezhak, Volodymyr Karpiv

Proceedings of UniReps: the Third Edition of the Workshop on Unifying Representations in Neural Models, PMLR 322:235-247, 2026.

Abstract

Prior works have investigated the integration of large language models (LLMs) with rectified flow for image synthesis, but systematic studies of remain scarce. In this study, we examine how controlling the interaction between stochastic and semantic inputs during encoding, while integrating them during decoding, influences the alignment between noised latents and LLM hidden states. Our investigation shows that architectural refinements, such as dual-stream encoding and single-stream decoding, can accelerate training and improve image quality relative to LLM-adapted rectified flow baselines by enhancing representational similarity between text and visual domains. We evaluate our approach on standard image benchmarks and observe gains in both training speed and output detail preservation, indicating that structural choices in the integration of LLM features matter for cross-modal representational alignment in generative modeling. Beyond empirical improvements, our findings contribute to understanding how foundation models trained on text can develop representations that align with visual domains, revealing insights into the emergence of similar representational structures across distinct modalities. These results highlight a promising direction at the intersection of LLMs, rectified flow, and cross-modal representational analysis and motivate further explorations into unified representation learning.

Cite this Paper

BibTeX

@InProceedings{pmlr-v322-vysotskyi26a,
  title = 	 {Cross-Modal Representational Alignment with {LLM} Priors for Image Generation},
  author =       {Vysotskyi, Mykola and Kohut, Zahar and Bondarets, Anna-Alina and Rumezhak, Taras and Karpiv, Volodymyr},
  booktitle = 	 {Proceedings of UniReps: the Third Edition of the Workshop on Unifying Representations in Neural Models},
  pages = 	 {235--247},
  year = 	 {2026},
  editor = 	 {Fumero, Marco and Domine, Clementine and L"ahner, Zorah and Cannistraci, Irene and Zhao, Bo and Williams, Alex},
  volume = 	 {322},
  series = 	 {Proceedings of Machine Learning Research},
  month = 	 {06 Dec},
  publisher =    {PMLR},
  pdf = 	 {https://raw.githubusercontent.com/mlresearch/v322/main/assets/vysotskyi26a/vysotskyi26a.pdf},
  url = 	 {https://proceedings.mlr.press/v322/vysotskyi26a.html},
  abstract = 	 {Prior works have investigated the integration of large language models (LLMs) with rectified flow for image synthesis, but systematic studies of remain scarce. In this study, we examine how controlling the interaction between stochastic and semantic inputs during encoding, while integrating them during decoding, influences the alignment between noised latents and LLM hidden states. Our investigation shows that architectural refinements, such as dual-stream encoding and single-stream decoding, can accelerate training and improve image quality relative to LLM-adapted rectified flow baselines by enhancing representational similarity between text and visual domains. We evaluate our approach on standard image benchmarks and observe gains in both training speed and output detail preservation, indicating that structural choices in the integration of LLM features matter for cross-modal representational alignment in generative modeling. Beyond empirical improvements, our findings contribute to understanding how foundation models trained on text can develop representations that align with visual domains, revealing insights into the emergence of similar representational structures across distinct modalities. These results highlight a promising direction at the intersection of LLMs, rectified flow, and cross-modal representational analysis and motivate further explorations into unified representation learning.}
}

Endnote

%0 Conference Paper
%T Cross-Modal Representational Alignment with LLM Priors for Image Generation
%A Mykola Vysotskyi
%A Zahar Kohut
%A Anna-Alina Bondarets
%A Taras Rumezhak
%A Volodymyr Karpiv
%B Proceedings of UniReps: the Third Edition of the Workshop on Unifying Representations in Neural Models
%C Proceedings of Machine Learning Research
%D 2026
%E Marco Fumero
%E Clementine Domine
%E Zorah L"ahner
%E Irene Cannistraci
%E Bo Zhao
%E Alex Williams	
%F pmlr-v322-vysotskyi26a
%I PMLR
%P 235--247
%U https://proceedings.mlr.press/v322/vysotskyi26a.html
%V 322
%X Prior works have investigated the integration of large language models (LLMs) with rectified flow for image synthesis, but systematic studies of remain scarce. In this study, we examine how controlling the interaction between stochastic and semantic inputs during encoding, while integrating them during decoding, influences the alignment between noised latents and LLM hidden states. Our investigation shows that architectural refinements, such as dual-stream encoding and single-stream decoding, can accelerate training and improve image quality relative to LLM-adapted rectified flow baselines by enhancing representational similarity between text and visual domains. We evaluate our approach on standard image benchmarks and observe gains in both training speed and output detail preservation, indicating that structural choices in the integration of LLM features matter for cross-modal representational alignment in generative modeling. Beyond empirical improvements, our findings contribute to understanding how foundation models trained on text can develop representations that align with visual domains, revealing insights into the emergence of similar representational structures across distinct modalities. These results highlight a promising direction at the intersection of LLMs, rectified flow, and cross-modal representational analysis and motivate further explorations into unified representation learning.

APA

Vysotskyi, M., Kohut, Z., Bondarets, A., Rumezhak, T. & Karpiv, V.. (2026). Cross-Modal Representational Alignment with LLM Priors for Image Generation. Proceedings of UniReps: the Third Edition of the Workshop on Unifying Representations in Neural Models, in Proceedings of Machine Learning Research 322:235-247 Available from https://proceedings.mlr.press/v322/vysotskyi26a.html.

Cross-Modal Representational Alignment with LLM Priors for Image Generation

Abstract

Cite this Paper

Related Material