SemCLIP: A Semantic Memory-Aligned Vision Language Model

Tanveer F Syeda-Mahmood, Niharika S. D’Souza, Ken C. L. Wong, Raziuddin Mahmood, Luyao Shi, Ashutosh Jadhav, Satyananda Kashyap, David Beymer
Proceedings of UniReps: the Third Edition of the Workshop on Unifying Representations in Neural Models, PMLR 322:330-341, 2026.

Abstract

Vision-language models (VLM) bring image and textual representations close together in a joint embedding space, which is useful for tagging and retrieval from content stores. However such associations are not very stable in that a synonymous textual query does not retrieve the same set of images or with a high degree of overlap. This is due to the absence of linkages between semantically related concepts in vision-language models. In contrast, the episodic memory store in the brain has linkages to the semantic conceptual memory subsystem which helps in both the formation and recall of memories. In this paper, we exploit this paradigm to link a VLM to a semantic memory thereby producing a new semantic vision-language model called SemCLIP. Specifically, we develop a semantic memory model for the language of object-naming nouns reflecting their semantic similarity. We then link a vision language model to the semantic memory model through a semantic alignment transform. This leads to a richer and more stable understanding of the concepts by bringing synonymous visual concepts and their associated images closer. Both the semantic memory model and the alignment transform can be learned from word knowledge sources thus avoiding large-scale retraining of VLMs from real-world image-text pairs. The resulting model is shown to outperform existing embedding models for semantic similarity and downstream tasks of retrieval on multiple datasets.

Cite this Paper


BibTeX
@InProceedings{pmlr-v322-syeda-mahmood26a, title = {Sem{CLIP}: A Semantic Memory-Aligned Vision Language Model}, author = {Syeda-Mahmood, Tanveer F and D'Souza, Niharika S. and Wong, Ken C. L. and Mahmood, Raziuddin and Shi, Luyao and Jadhav, Ashutosh and Kashyap, Satyananda and Beymer, David}, booktitle = {Proceedings of UniReps: the Third Edition of the Workshop on Unifying Representations in Neural Models}, pages = {330--341}, year = {2026}, editor = {Fumero, Marco and Domine, Clementine and L"ahner, Zorah and Cannistraci, Irene and Zhao, Bo and Williams, Alex}, volume = {322}, series = {Proceedings of Machine Learning Research}, month = {06 Dec}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v322/main/assets/syeda-mahmood26a/syeda-mahmood26a.pdf}, url = {https://proceedings.mlr.press/v322/syeda-mahmood26a.html}, abstract = {Vision-language models (VLM) bring image and textual representations close together in a joint embedding space, which is useful for tagging and retrieval from content stores. However such associations are not very stable in that a synonymous textual query does not retrieve the same set of images or with a high degree of overlap. This is due to the absence of linkages between semantically related concepts in vision-language models. In contrast, the episodic memory store in the brain has linkages to the semantic conceptual memory subsystem which helps in both the formation and recall of memories. In this paper, we exploit this paradigm to link a VLM to a semantic memory thereby producing a new semantic vision-language model called SemCLIP. Specifically, we develop a semantic memory model for the language of object-naming nouns reflecting their semantic similarity. We then link a vision language model to the semantic memory model through a semantic alignment transform. This leads to a richer and more stable understanding of the concepts by bringing synonymous visual concepts and their associated images closer. Both the semantic memory model and the alignment transform can be learned from word knowledge sources thus avoiding large-scale retraining of VLMs from real-world image-text pairs. The resulting model is shown to outperform existing embedding models for semantic similarity and downstream tasks of retrieval on multiple datasets.} }
Endnote
%0 Conference Paper %T SemCLIP: A Semantic Memory-Aligned Vision Language Model %A Tanveer F Syeda-Mahmood %A Niharika S. D’Souza %A Ken C. L. Wong %A Raziuddin Mahmood %A Luyao Shi %A Ashutosh Jadhav %A Satyananda Kashyap %A David Beymer %B Proceedings of UniReps: the Third Edition of the Workshop on Unifying Representations in Neural Models %C Proceedings of Machine Learning Research %D 2026 %E Marco Fumero %E Clementine Domine %E Zorah L"ahner %E Irene Cannistraci %E Bo Zhao %E Alex Williams %F pmlr-v322-syeda-mahmood26a %I PMLR %P 330--341 %U https://proceedings.mlr.press/v322/syeda-mahmood26a.html %V 322 %X Vision-language models (VLM) bring image and textual representations close together in a joint embedding space, which is useful for tagging and retrieval from content stores. However such associations are not very stable in that a synonymous textual query does not retrieve the same set of images or with a high degree of overlap. This is due to the absence of linkages between semantically related concepts in vision-language models. In contrast, the episodic memory store in the brain has linkages to the semantic conceptual memory subsystem which helps in both the formation and recall of memories. In this paper, we exploit this paradigm to link a VLM to a semantic memory thereby producing a new semantic vision-language model called SemCLIP. Specifically, we develop a semantic memory model for the language of object-naming nouns reflecting their semantic similarity. We then link a vision language model to the semantic memory model through a semantic alignment transform. This leads to a richer and more stable understanding of the concepts by bringing synonymous visual concepts and their associated images closer. Both the semantic memory model and the alignment transform can be learned from word knowledge sources thus avoiding large-scale retraining of VLMs from real-world image-text pairs. The resulting model is shown to outperform existing embedding models for semantic similarity and downstream tasks of retrieval on multiple datasets.
APA
Syeda-Mahmood, T.F., D’Souza, N.S., Wong, K.C.L., Mahmood, R., Shi, L., Jadhav, A., Kashyap, S. & Beymer, D.. (2026). SemCLIP: A Semantic Memory-Aligned Vision Language Model. Proceedings of UniReps: the Third Edition of the Workshop on Unifying Representations in Neural Models, in Proceedings of Machine Learning Research 322:330-341 Available from https://proceedings.mlr.press/v322/syeda-mahmood26a.html.

Related Material