MLIP: Efficient Multi-Perspective Language-Image Pretraining with Exhaustive Data Utilization

Yu Zhang, Qi Zhang, Zixuan Gong, Yiwei Shi, Yepeng Liu, Duoqian Miao, Yang Liu, Ke Liu, Kun Yi, Wei Fan, Liang Hu, Changwei Wang
Proceedings of the 41st International Conference on Machine Learning, PMLR 235:60288-60304, 2024.

Abstract

Contrastive Language-Image Pretraining (CLIP) has achieved remarkable success, leading to rapid advancements in multimodal studies. However, CLIP faces a notable challenge in terms of inefficient data utilization. It relies on a single contrastive supervision for each image-text pair during representation learning, disregarding a substantial amount of valuable information that could offer richer supervision. Additionally, the retention of non-informative tokens leads to increased computational demands and time costs, particularly in CLIP’s ViT image encoder. To address these issues, we propose Multi-Perspective Language-Image Pretraining (MLIP). In MLIP, we leverage the frequency transform’s sensitivity to both high and low-frequency variations, which complements the spatial domain’s sensitivity limited to low-frequency variations only. By incorporating frequency transforms and token-level alignment, we expand CILP’s single supervision into multi-domain and multi-level supervision, enabling a more thorough exploration of informative image features. Additionally, we introduce a token merging method guided by comprehensive semantics from the frequency and spatial domains. This allows us to merge tokens to multi-granularity tokens with a controllable compression rate to accelerate CLIP. Extensive experiments validate the effectiveness of our design.

Cite this Paper


BibTeX
@InProceedings{pmlr-v235-zhang24cb, title = {{MLIP}: Efficient Multi-Perspective Language-Image Pretraining with Exhaustive Data Utilization}, author = {Zhang, Yu and Zhang, Qi and Gong, Zixuan and Shi, Yiwei and Liu, Yepeng and Miao, Duoqian and Liu, Yang and Liu, Ke and Yi, Kun and Fan, Wei and Hu, Liang and Wang, Changwei}, booktitle = {Proceedings of the 41st International Conference on Machine Learning}, pages = {60288--60304}, year = {2024}, editor = {Salakhutdinov, Ruslan and Kolter, Zico and Heller, Katherine and Weller, Adrian and Oliver, Nuria and Scarlett, Jonathan and Berkenkamp, Felix}, volume = {235}, series = {Proceedings of Machine Learning Research}, month = {21--27 Jul}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v235/main/assets/zhang24cb/zhang24cb.pdf}, url = {https://proceedings.mlr.press/v235/zhang24cb.html}, abstract = {Contrastive Language-Image Pretraining (CLIP) has achieved remarkable success, leading to rapid advancements in multimodal studies. However, CLIP faces a notable challenge in terms of inefficient data utilization. It relies on a single contrastive supervision for each image-text pair during representation learning, disregarding a substantial amount of valuable information that could offer richer supervision. Additionally, the retention of non-informative tokens leads to increased computational demands and time costs, particularly in CLIP’s ViT image encoder. To address these issues, we propose Multi-Perspective Language-Image Pretraining (MLIP). In MLIP, we leverage the frequency transform’s sensitivity to both high and low-frequency variations, which complements the spatial domain’s sensitivity limited to low-frequency variations only. By incorporating frequency transforms and token-level alignment, we expand CILP’s single supervision into multi-domain and multi-level supervision, enabling a more thorough exploration of informative image features. Additionally, we introduce a token merging method guided by comprehensive semantics from the frequency and spatial domains. This allows us to merge tokens to multi-granularity tokens with a controllable compression rate to accelerate CLIP. Extensive experiments validate the effectiveness of our design.} }
Endnote
%0 Conference Paper %T MLIP: Efficient Multi-Perspective Language-Image Pretraining with Exhaustive Data Utilization %A Yu Zhang %A Qi Zhang %A Zixuan Gong %A Yiwei Shi %A Yepeng Liu %A Duoqian Miao %A Yang Liu %A Ke Liu %A Kun Yi %A Wei Fan %A Liang Hu %A Changwei Wang %B Proceedings of the 41st International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2024 %E Ruslan Salakhutdinov %E Zico Kolter %E Katherine Heller %E Adrian Weller %E Nuria Oliver %E Jonathan Scarlett %E Felix Berkenkamp %F pmlr-v235-zhang24cb %I PMLR %P 60288--60304 %U https://proceedings.mlr.press/v235/zhang24cb.html %V 235 %X Contrastive Language-Image Pretraining (CLIP) has achieved remarkable success, leading to rapid advancements in multimodal studies. However, CLIP faces a notable challenge in terms of inefficient data utilization. It relies on a single contrastive supervision for each image-text pair during representation learning, disregarding a substantial amount of valuable information that could offer richer supervision. Additionally, the retention of non-informative tokens leads to increased computational demands and time costs, particularly in CLIP’s ViT image encoder. To address these issues, we propose Multi-Perspective Language-Image Pretraining (MLIP). In MLIP, we leverage the frequency transform’s sensitivity to both high and low-frequency variations, which complements the spatial domain’s sensitivity limited to low-frequency variations only. By incorporating frequency transforms and token-level alignment, we expand CILP’s single supervision into multi-domain and multi-level supervision, enabling a more thorough exploration of informative image features. Additionally, we introduce a token merging method guided by comprehensive semantics from the frequency and spatial domains. This allows us to merge tokens to multi-granularity tokens with a controllable compression rate to accelerate CLIP. Extensive experiments validate the effectiveness of our design.
APA
Zhang, Y., Zhang, Q., Gong, Z., Shi, Y., Liu, Y., Miao, D., Liu, Y., Liu, K., Yi, K., Fan, W., Hu, L. & Wang, C.. (2024). MLIP: Efficient Multi-Perspective Language-Image Pretraining with Exhaustive Data Utilization. Proceedings of the 41st International Conference on Machine Learning, in Proceedings of Machine Learning Research 235:60288-60304 Available from https://proceedings.mlr.press/v235/zhang24cb.html.

Related Material