ERICT: Enhancing Robustness by Identifying Concept Tokens in Zero-Shot Vision Language Models

Xinpeng Dong; Min Zhang; Didi Zhu; Ye Jun Jian; Zhang Keli; Aimin Zhou; Fei Wu; Kun Kuang

ERICT: Enhancing Robustness by Identifying Concept Tokens in Zero-Shot Vision Language Models

Xinpeng Dong, Min Zhang, Didi Zhu, Ye Jun Jian, Zhang Keli, Aimin Zhou, Fei Wu, Kun Kuang

Proceedings of the 42nd International Conference on Machine Learning, PMLR 267:14285-14300, 2025.

Abstract

Pre-trained vision-language models (VLMs) have revolutionized the field of machine learning, demonstrating exceptional performance across a wide range of tasks. However, their robustness remains vulnerable to the spurious-correlation problem. Existing works often involve fine-tuning the model with labeled data or relying on large language models (LLMs) to generate more complex prompts. Although effective to some extent, these methods introduce new challenges, including additional computational costs and dependence on the quality of prompts without fully utilizing the vision modality. To address these limitations, we propose a novel method named ERICT to Enhance model Robustness by Identifying Concept Tokens. ERICT mitigates spurious correlation directly in the inference stage and comprises two key steps: (1) Identify concept tokens capturing invariant features through auxiliary prompts to generate a token-level mask. (2) Apply the mask to the attention weights of the CLS token in the vision encoder to help the model focus on the relevant image region. Extensive experiments show that ERICT significantly improves the overall performance including that of the worst group, and achieves new state-of-the-art results.

Cite this Paper

BibTeX

@InProceedings{pmlr-v267-dong25o,
  title = 	 {{ERICT}: Enhancing Robustness by Identifying Concept Tokens in Zero-Shot Vision Language Models},
  author =       {Dong, Xinpeng and Zhang, Min and Zhu, Didi and Jian, Ye Jun and Keli, Zhang and Zhou, Aimin and Wu, Fei and Kuang, Kun},
  booktitle = 	 {Proceedings of the 42nd International Conference on Machine Learning},
  pages = 	 {14285--14300},
  year = 	 {2025},
  editor = 	 {Singh, Aarti and Fazel, Maryam and Hsu, Daniel and Lacoste-Julien, Simon and Berkenkamp, Felix and Maharaj, Tegan and Wagstaff, Kiri and Zhu, Jerry},
  volume = 	 {267},
  series = 	 {Proceedings of Machine Learning Research},
  month = 	 {13--19 Jul},
  publisher =    {PMLR},
  pdf = 	 {https://raw.githubusercontent.com/mlresearch/v267/main/assets/dong25o/dong25o.pdf},
  url = 	 {https://proceedings.mlr.press/v267/dong25o.html},
  abstract = 	 {Pre-trained vision-language models (VLMs) have revolutionized the field of machine learning, demonstrating exceptional performance across a wide range of tasks. However, their robustness remains vulnerable to the spurious-correlation problem. Existing works often involve fine-tuning the model with labeled data or relying on large language models (LLMs) to generate more complex prompts. Although effective to some extent, these methods introduce new challenges, including additional computational costs and dependence on the quality of prompts without fully utilizing the vision modality. To address these limitations, we propose a novel method named ERICT to Enhance model Robustness by Identifying Concept Tokens. ERICT mitigates spurious correlation directly in the inference stage and comprises two key steps: (1) Identify concept tokens capturing invariant features through auxiliary prompts to generate a token-level mask. (2) Apply the mask to the attention weights of the CLS token in the vision encoder to help the model focus on the relevant image region. Extensive experiments show that ERICT significantly improves the overall performance including that of the worst group, and achieves new state-of-the-art results.}
}

Endnote

%0 Conference Paper
%T ERICT: Enhancing Robustness by Identifying Concept Tokens in Zero-Shot Vision Language Models
%A Xinpeng Dong
%A Min Zhang
%A Didi Zhu
%A Ye Jun Jian
%A Zhang Keli
%A Aimin Zhou
%A Fei Wu
%A Kun Kuang
%B Proceedings of the 42nd International Conference on Machine Learning
%C Proceedings of Machine Learning Research
%D 2025
%E Aarti Singh
%E Maryam Fazel
%E Daniel Hsu
%E Simon Lacoste-Julien
%E Felix Berkenkamp
%E Tegan Maharaj
%E Kiri Wagstaff
%E Jerry Zhu	
%F pmlr-v267-dong25o
%I PMLR
%P 14285--14300
%U https://proceedings.mlr.press/v267/dong25o.html
%V 267
%X Pre-trained vision-language models (VLMs) have revolutionized the field of machine learning, demonstrating exceptional performance across a wide range of tasks. However, their robustness remains vulnerable to the spurious-correlation problem. Existing works often involve fine-tuning the model with labeled data or relying on large language models (LLMs) to generate more complex prompts. Although effective to some extent, these methods introduce new challenges, including additional computational costs and dependence on the quality of prompts without fully utilizing the vision modality. To address these limitations, we propose a novel method named ERICT to Enhance model Robustness by Identifying Concept Tokens. ERICT mitigates spurious correlation directly in the inference stage and comprises two key steps: (1) Identify concept tokens capturing invariant features through auxiliary prompts to generate a token-level mask. (2) Apply the mask to the attention weights of the CLS token in the vision encoder to help the model focus on the relevant image region. Extensive experiments show that ERICT significantly improves the overall performance including that of the worst group, and achieves new state-of-the-art results.

APA

Dong, X., Zhang, M., Zhu, D., Jian, Y.J., Keli, Z., Zhou, A., Wu, F. & Kuang, K.. (2025). ERICT: Enhancing Robustness by Identifying Concept Tokens in Zero-Shot Vision Language Models. Proceedings of the 42nd International Conference on Machine Learning, in Proceedings of Machine Learning Research 267:14285-14300 Available from https://proceedings.mlr.press/v267/dong25o.html.

ERICT: Enhancing Robustness by Identifying Concept Tokens in Zero-Shot Vision Language Models

Abstract

Cite this Paper

Related Material