DS-VLM: Diffusion Supervision Vision Language Model

Zhen Sun, Yunhang Shen, Jie Li, Xing Sun, Pingyang Dai, Liujuan Cao, Rongrong Ji
Proceedings of the 42nd International Conference on Machine Learning, PMLR 267:57667-57679, 2025.

Abstract

Vision-Language Models (VLMs) face two critical limitations in visual representation learning: degraded supervision due to information loss during gradient propagation, and the inherent semantic sparsity of textual supervision compared to visual data. We propose the Diffusion Supervision Vision-Language Model (DS-VLM), a plug-and-play framework that introduces diffusion-based direct supervision for vision-language alignment. By reconstructing input images through a diffusion model conditioned on outputs of the visual encoder and the connector, our method establishes a short-path gradient propagation channel from pixel space to visual features. This approach simultaneously preserves high-level semantic alignment through conventional text supervision while enhancing visual feature quality via pixel-level reconstruction constraints. Extensive experiments conducted across various visual encoders and LLMs of different scales demonstrate the effectiveness of our approach.

Cite this Paper


BibTeX
@InProceedings{pmlr-v267-sun25p, title = {{DS}-{VLM}: Diffusion Supervision Vision Language Model}, author = {Sun, Zhen and Shen, Yunhang and Li, Jie and Sun, Xing and Dai, Pingyang and Cao, Liujuan and Ji, Rongrong}, booktitle = {Proceedings of the 42nd International Conference on Machine Learning}, pages = {57667--57679}, year = {2025}, editor = {Singh, Aarti and Fazel, Maryam and Hsu, Daniel and Lacoste-Julien, Simon and Berkenkamp, Felix and Maharaj, Tegan and Wagstaff, Kiri and Zhu, Jerry}, volume = {267}, series = {Proceedings of Machine Learning Research}, month = {13--19 Jul}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v267/main/assets/sun25p/sun25p.pdf}, url = {https://proceedings.mlr.press/v267/sun25p.html}, abstract = {Vision-Language Models (VLMs) face two critical limitations in visual representation learning: degraded supervision due to information loss during gradient propagation, and the inherent semantic sparsity of textual supervision compared to visual data. We propose the Diffusion Supervision Vision-Language Model (DS-VLM), a plug-and-play framework that introduces diffusion-based direct supervision for vision-language alignment. By reconstructing input images through a diffusion model conditioned on outputs of the visual encoder and the connector, our method establishes a short-path gradient propagation channel from pixel space to visual features. This approach simultaneously preserves high-level semantic alignment through conventional text supervision while enhancing visual feature quality via pixel-level reconstruction constraints. Extensive experiments conducted across various visual encoders and LLMs of different scales demonstrate the effectiveness of our approach.} }
Endnote
%0 Conference Paper %T DS-VLM: Diffusion Supervision Vision Language Model %A Zhen Sun %A Yunhang Shen %A Jie Li %A Xing Sun %A Pingyang Dai %A Liujuan Cao %A Rongrong Ji %B Proceedings of the 42nd International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2025 %E Aarti Singh %E Maryam Fazel %E Daniel Hsu %E Simon Lacoste-Julien %E Felix Berkenkamp %E Tegan Maharaj %E Kiri Wagstaff %E Jerry Zhu %F pmlr-v267-sun25p %I PMLR %P 57667--57679 %U https://proceedings.mlr.press/v267/sun25p.html %V 267 %X Vision-Language Models (VLMs) face two critical limitations in visual representation learning: degraded supervision due to information loss during gradient propagation, and the inherent semantic sparsity of textual supervision compared to visual data. We propose the Diffusion Supervision Vision-Language Model (DS-VLM), a plug-and-play framework that introduces diffusion-based direct supervision for vision-language alignment. By reconstructing input images through a diffusion model conditioned on outputs of the visual encoder and the connector, our method establishes a short-path gradient propagation channel from pixel space to visual features. This approach simultaneously preserves high-level semantic alignment through conventional text supervision while enhancing visual feature quality via pixel-level reconstruction constraints. Extensive experiments conducted across various visual encoders and LLMs of different scales demonstrate the effectiveness of our approach.
APA
Sun, Z., Shen, Y., Li, J., Sun, X., Dai, P., Cao, L. & Ji, R.. (2025). DS-VLM: Diffusion Supervision Vision Language Model. Proceedings of the 42nd International Conference on Machine Learning, in Proceedings of Machine Learning Research 267:57667-57679 Available from https://proceedings.mlr.press/v267/sun25p.html.

Related Material