Language-Driven Cross-Modal Classifier for Zero-Shot Multi-Label Image Recognition

Yicheng Liu, Jie Wen, Chengliang Liu, Xiaozhao Fang, Zuoyong Li, Yong Xu, Zheng Zhang
Proceedings of the 41st International Conference on Machine Learning, PMLR 235:32173-32183, 2024.

Abstract

Large-scale pre-trained vision-language models (e.g., CLIP) have shown powerful zero-shot transfer capabilities in image recognition tasks. Recent approaches typically employ supervised fine-tuning methods to adapt CLIP for zero-shot multi-label image recognition tasks. However, obtaining sufficient multi-label annotated image data for training is challenging and not scalable. In this paper, we propose a new language-driven framework for zero-shot multi-label recognition that eliminates the need for annotated images during training. Leveraging the aligned CLIP multi-modal embedding space, our method utilizes language data generated by LLMs to train a cross-modal classifier, which is subsequently transferred to the visual modality. During inference, directly applying the classifier to visual inputs may limit performance due to the modality gap. To address this issue, we introduce a cross-modal mapping method that maps image embeddings to the language modality while retaining crucial visual information. Comprehensive experiments demonstrate that our method outperforms other zero-shot multi-label recognition methods and achieves competitive results compared to few-shot methods.

Cite this Paper


BibTeX
@InProceedings{pmlr-v235-liu24bq, title = {Language-Driven Cross-Modal Classifier for Zero-Shot Multi-Label Image Recognition}, author = {Liu, Yicheng and Wen, Jie and Liu, Chengliang and Fang, Xiaozhao and Li, Zuoyong and Xu, Yong and Zhang, Zheng}, booktitle = {Proceedings of the 41st International Conference on Machine Learning}, pages = {32173--32183}, year = {2024}, editor = {Salakhutdinov, Ruslan and Kolter, Zico and Heller, Katherine and Weller, Adrian and Oliver, Nuria and Scarlett, Jonathan and Berkenkamp, Felix}, volume = {235}, series = {Proceedings of Machine Learning Research}, month = {21--27 Jul}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v235/main/assets/liu24bq/liu24bq.pdf}, url = {https://proceedings.mlr.press/v235/liu24bq.html}, abstract = {Large-scale pre-trained vision-language models (e.g., CLIP) have shown powerful zero-shot transfer capabilities in image recognition tasks. Recent approaches typically employ supervised fine-tuning methods to adapt CLIP for zero-shot multi-label image recognition tasks. However, obtaining sufficient multi-label annotated image data for training is challenging and not scalable. In this paper, we propose a new language-driven framework for zero-shot multi-label recognition that eliminates the need for annotated images during training. Leveraging the aligned CLIP multi-modal embedding space, our method utilizes language data generated by LLMs to train a cross-modal classifier, which is subsequently transferred to the visual modality. During inference, directly applying the classifier to visual inputs may limit performance due to the modality gap. To address this issue, we introduce a cross-modal mapping method that maps image embeddings to the language modality while retaining crucial visual information. Comprehensive experiments demonstrate that our method outperforms other zero-shot multi-label recognition methods and achieves competitive results compared to few-shot methods.} }
Endnote
%0 Conference Paper %T Language-Driven Cross-Modal Classifier for Zero-Shot Multi-Label Image Recognition %A Yicheng Liu %A Jie Wen %A Chengliang Liu %A Xiaozhao Fang %A Zuoyong Li %A Yong Xu %A Zheng Zhang %B Proceedings of the 41st International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2024 %E Ruslan Salakhutdinov %E Zico Kolter %E Katherine Heller %E Adrian Weller %E Nuria Oliver %E Jonathan Scarlett %E Felix Berkenkamp %F pmlr-v235-liu24bq %I PMLR %P 32173--32183 %U https://proceedings.mlr.press/v235/liu24bq.html %V 235 %X Large-scale pre-trained vision-language models (e.g., CLIP) have shown powerful zero-shot transfer capabilities in image recognition tasks. Recent approaches typically employ supervised fine-tuning methods to adapt CLIP for zero-shot multi-label image recognition tasks. However, obtaining sufficient multi-label annotated image data for training is challenging and not scalable. In this paper, we propose a new language-driven framework for zero-shot multi-label recognition that eliminates the need for annotated images during training. Leveraging the aligned CLIP multi-modal embedding space, our method utilizes language data generated by LLMs to train a cross-modal classifier, which is subsequently transferred to the visual modality. During inference, directly applying the classifier to visual inputs may limit performance due to the modality gap. To address this issue, we introduce a cross-modal mapping method that maps image embeddings to the language modality while retaining crucial visual information. Comprehensive experiments demonstrate that our method outperforms other zero-shot multi-label recognition methods and achieves competitive results compared to few-shot methods.
APA
Liu, Y., Wen, J., Liu, C., Fang, X., Li, Z., Xu, Y. & Zhang, Z.. (2024). Language-Driven Cross-Modal Classifier for Zero-Shot Multi-Label Image Recognition. Proceedings of the 41st International Conference on Machine Learning, in Proceedings of Machine Learning Research 235:32173-32183 Available from https://proceedings.mlr.press/v235/liu24bq.html.

Related Material