An Empirical Study on Unifying JEPA and Language Supervision for Visual Representation Learning

Shixuan Liu; Daniel A Li; Yiwei Lyu; Akhil Kondepudi; Honglak Lee; Todd C Hollon

An Empirical Study on Unifying JEPA and Language Supervision for Visual Representation Learning

Shixuan Liu, Daniel A Li, Yiwei Lyu, Akhil Kondepudi, Honglak Lee, Todd C Hollon

Proceedings of UniReps: the Third Edition of the Workshop on Unifying Representations in Neural Models, PMLR 322:45-56, 2026.

Abstract

Unified visual representations from language supervision and self-supervision offer the potential to advance general-purpose vision models. In this work, we present an empirical study on unifying joint-embedding predictive architecture (I-JEPA) with language supervision from CLIP for visual representation learning. I-JEPA is unique among self-supervised learning methods in that it is predictive rather than contrastive or generative, enabling faster convergence with less compute while still producing strong representations. Existing works have shown that joint training with language supervision and other visual self-supervision methods yield improved model performance, but combining language supervision with I-JEPA remains unexplored. We introduce CLIPred, a framework that jointly optimizes the two objectives, and systematically evaluate it across zero-shot classification, retrieval, and probing tasks. CLIPred outperforms CLIP-only, I-JEPA-only, and sequentially applying the two, and offers better zero-shot transfer than DINOv2+CLIP with lower training cost, though with trade-offs in probing performance. Our experiments further examine the effects of loss weighting, amount of data used by each objective, and batch size on our framework, We conduct further analysis on design choices of the architecture and the semantics of the patch embeddings generated by CLIPred. This work provides the first comprehensive assessment of combining I-JEPA and CLIP, highlighting both the benefits and limitations of the framework as well as recommendations on when and how to apply the framework.

Cite this Paper

BibTeX

@InProceedings{pmlr-v322-liu26a,
  title = 	 {An Empirical Study on Unifying {JEPA} and Language Supervision for Visual Representation Learning},
  author =       {Liu, Shixuan and Li, Daniel A and Lyu, Yiwei and Kondepudi, Akhil and Lee, Honglak and Hollon, Todd C},
  booktitle = 	 {Proceedings of UniReps: the Third Edition of the Workshop on Unifying Representations in Neural Models},
  pages = 	 {45--56},
  year = 	 {2026},
  editor = 	 {Fumero, Marco and Domine, Clementine and L"ahner, Zorah and Cannistraci, Irene and Zhao, Bo and Williams, Alex},
  volume = 	 {322},
  series = 	 {Proceedings of Machine Learning Research},
  month = 	 {06 Dec},
  publisher =    {PMLR},
  pdf = 	 {https://raw.githubusercontent.com/mlresearch/v322/main/assets/liu26a/liu26a.pdf},
  url = 	 {https://proceedings.mlr.press/v322/liu26a.html},
  abstract = 	 {Unified visual representations from language supervision and self-supervision offer the potential to advance general-purpose vision models. In this work, we present an empirical study on unifying joint-embedding predictive architecture (I-JEPA) with language supervision from CLIP for visual representation learning. I-JEPA is unique among self-supervised learning methods in that it is predictive rather than contrastive or generative, enabling faster convergence with less compute while still producing strong representations. Existing works have shown that joint training with language supervision and other visual self-supervision methods yield improved model performance, but combining language supervision with I-JEPA remains unexplored.  We introduce CLIPred, a framework that jointly optimizes the two objectives, and systematically evaluate it across zero-shot classification, retrieval, and probing tasks. CLIPred outperforms CLIP-only, I-JEPA-only, and sequentially applying the two, and offers better zero-shot transfer than DINOv2+CLIP with lower training cost, though with trade-offs in probing performance. Our experiments further examine the effects of loss weighting, amount of data used by each objective, and batch size on our framework, We conduct further analysis on design choices of the architecture and the semantics of the patch embeddings generated by CLIPred. This work provides the first comprehensive assessment of combining I-JEPA and CLIP, highlighting both the benefits and limitations of the framework as well as recommendations on when and how to apply the framework.}
}

Endnote

%0 Conference Paper
%T An Empirical Study on Unifying JEPA and Language Supervision for Visual Representation Learning
%A Shixuan Liu
%A Daniel A Li
%A Yiwei Lyu
%A Akhil Kondepudi
%A Honglak Lee
%A Todd C Hollon
%B Proceedings of UniReps: the Third Edition of the Workshop on Unifying Representations in Neural Models
%C Proceedings of Machine Learning Research
%D 2026
%E Marco Fumero
%E Clementine Domine
%E Zorah L"ahner
%E Irene Cannistraci
%E Bo Zhao
%E Alex Williams	
%F pmlr-v322-liu26a
%I PMLR
%P 45--56
%U https://proceedings.mlr.press/v322/liu26a.html
%V 322
%X Unified visual representations from language supervision and self-supervision offer the potential to advance general-purpose vision models. In this work, we present an empirical study on unifying joint-embedding predictive architecture (I-JEPA) with language supervision from CLIP for visual representation learning. I-JEPA is unique among self-supervised learning methods in that it is predictive rather than contrastive or generative, enabling faster convergence with less compute while still producing strong representations. Existing works have shown that joint training with language supervision and other visual self-supervision methods yield improved model performance, but combining language supervision with I-JEPA remains unexplored.  We introduce CLIPred, a framework that jointly optimizes the two objectives, and systematically evaluate it across zero-shot classification, retrieval, and probing tasks. CLIPred outperforms CLIP-only, I-JEPA-only, and sequentially applying the two, and offers better zero-shot transfer than DINOv2+CLIP with lower training cost, though with trade-offs in probing performance. Our experiments further examine the effects of loss weighting, amount of data used by each objective, and batch size on our framework, We conduct further analysis on design choices of the architecture and the semantics of the patch embeddings generated by CLIPred. This work provides the first comprehensive assessment of combining I-JEPA and CLIP, highlighting both the benefits and limitations of the framework as well as recommendations on when and how to apply the framework.

APA

Liu, S., Li, D.A., Lyu, Y., Kondepudi, A., Lee, H. & Hollon, T.C.. (2026). An Empirical Study on Unifying JEPA and Language Supervision for Visual Representation Learning. Proceedings of UniReps: the Third Edition of the Workshop on Unifying Representations in Neural Models, in Proceedings of Machine Learning Research 322:45-56 Available from https://proceedings.mlr.press/v322/liu26a.html.

An Empirical Study on Unifying JEPA and Language Supervision for Visual Representation Learning

Abstract

Cite this Paper

Related Material