Continual Vision-Language Representation Learning with Off-Diagonal Information

Zixuan Ni; Longhui Wei; Siliang Tang; Yueting Zhuang; Qi Tian

Continual Vision-Language Representation Learning with Off-Diagonal Information

Zixuan Ni, Longhui Wei, Siliang Tang, Yueting Zhuang, Qi Tian

Proceedings of the 40th International Conference on Machine Learning, PMLR 202:26129-26149, 2023.

Abstract

Large-scale multi-modal contrastive learning frameworks like CLIP typically require a large amount of image-text samples for training. However, these samples are always collected continuously in real scenarios. This paper discusses the feasibility of continual CLIP training using streaming data. Unlike continual learning based on self-supervised learning methods for pure images, which is empirically robust against catastrophic forgetting, CLIP’s performance degeneration in the continual setting is significant and non-neglectable. By analyzing the changes in the model’s representation space during continual CLIP training from a spatial geometry perspective, we explore and summarize these spatial variations as Spatial Disorder (SD), which can be divided into Intra-modal Rotation and Inter-modal Deviation. Moreover, we empirically and theoretically demonstrate how SD leads to a performance decline for CLIP on cross-modal retrieval tasks. To alleviate SD, we propose a new continual vision-language representation learning framework Mod-X: Maintain off-diagonal information-matriX. By selectively aligning the off-diagonal information distribution of contrastive matrices, the Mod-X improves the capability of the multi-modal model by maintaining the multi-modal representation space alignment on the old data domain during continuously fitting the new training data domain. Experiments on commonly used datasets with different scales and scopes have demonstrated the effectiveness of our method.

Cite this Paper

BibTeX

@InProceedings{pmlr-v202-ni23c,
  title = 	 {Continual Vision-Language Representation Learning with Off-Diagonal Information},
  author =       {Ni, Zixuan and Wei, Longhui and Tang, Siliang and Zhuang, Yueting and Tian, Qi},
  booktitle = 	 {Proceedings of the 40th International Conference on Machine Learning},
  pages = 	 {26129--26149},
  year = 	 {2023},
  editor = 	 {Krause, Andreas and Brunskill, Emma and Cho, Kyunghyun and Engelhardt, Barbara and Sabato, Sivan and Scarlett, Jonathan},
  volume = 	 {202},
  series = 	 {Proceedings of Machine Learning Research},
  month = 	 {23--29 Jul},
  publisher =    {PMLR},
  pdf = 	 {https://proceedings.mlr.press/v202/ni23c/ni23c.pdf},
  url = 	 {https://proceedings.mlr.press/v202/ni23c.html},
  abstract = 	 {Large-scale multi-modal contrastive learning frameworks like CLIP typically require a large amount of image-text samples for training. However, these samples are always collected continuously in real scenarios. This paper discusses the feasibility of continual CLIP training using streaming data. Unlike continual learning based on self-supervised learning methods for pure images, which is empirically robust against catastrophic forgetting, CLIP’s performance degeneration in the continual setting is significant and non-neglectable. By analyzing the changes in the model’s representation space during continual CLIP training from a spatial geometry perspective, we explore and summarize these spatial variations as Spatial Disorder (SD), which can be divided into Intra-modal Rotation and Inter-modal Deviation. Moreover, we empirically and theoretically demonstrate how SD leads to a performance decline for CLIP on cross-modal retrieval tasks. To alleviate SD, we propose a new continual vision-language representation learning framework Mod-X: Maintain off-diagonal information-matriX. By selectively aligning the off-diagonal information distribution of contrastive matrices, the Mod-X improves the capability of the multi-modal model by maintaining the multi-modal representation space alignment on the old data domain during continuously fitting the new training data domain. Experiments on commonly used datasets with different scales and scopes have demonstrated the effectiveness of our method.}
}

Endnote

%0 Conference Paper
%T Continual Vision-Language Representation Learning with Off-Diagonal Information
%A Zixuan Ni
%A Longhui Wei
%A Siliang Tang
%A Yueting Zhuang
%A Qi Tian
%B Proceedings of the 40th International Conference on Machine Learning
%C Proceedings of Machine Learning Research
%D 2023
%E Andreas Krause
%E Emma Brunskill
%E Kyunghyun Cho
%E Barbara Engelhardt
%E Sivan Sabato
%E Jonathan Scarlett	
%F pmlr-v202-ni23c
%I PMLR
%P 26129--26149
%U https://proceedings.mlr.press/v202/ni23c.html
%V 202
%X Large-scale multi-modal contrastive learning frameworks like CLIP typically require a large amount of image-text samples for training. However, these samples are always collected continuously in real scenarios. This paper discusses the feasibility of continual CLIP training using streaming data. Unlike continual learning based on self-supervised learning methods for pure images, which is empirically robust against catastrophic forgetting, CLIP’s performance degeneration in the continual setting is significant and non-neglectable. By analyzing the changes in the model’s representation space during continual CLIP training from a spatial geometry perspective, we explore and summarize these spatial variations as Spatial Disorder (SD), which can be divided into Intra-modal Rotation and Inter-modal Deviation. Moreover, we empirically and theoretically demonstrate how SD leads to a performance decline for CLIP on cross-modal retrieval tasks. To alleviate SD, we propose a new continual vision-language representation learning framework Mod-X: Maintain off-diagonal information-matriX. By selectively aligning the off-diagonal information distribution of contrastive matrices, the Mod-X improves the capability of the multi-modal model by maintaining the multi-modal representation space alignment on the old data domain during continuously fitting the new training data domain. Experiments on commonly used datasets with different scales and scopes have demonstrated the effectiveness of our method.

APA

Ni, Z., Wei, L., Tang, S., Zhuang, Y. & Tian, Q.. (2023). Continual Vision-Language Representation Learning with Off-Diagonal Information. Proceedings of the 40th International Conference on Machine Learning, in Proceedings of Machine Learning Research 202:26129-26149 Available from https://proceedings.mlr.press/v202/ni23c.html.

Continual Vision-Language Representation Learning with Off-Diagonal Information

Abstract

Cite this Paper

Related Material