A Closer Look at Backdoor Attacks on CLIP

Shuo He, Zhifang Zhang, Feng Liu, Roy Ka-Wei Lee, Bo An, Lei Feng
Proceedings of the 42nd International Conference on Machine Learning, PMLR 267:22836-22852, 2025.

Abstract

We present a comprehensive empirical study on how backdoor attacks affect CLIP by analyzing the representations of backdoor images. Specifically, based on the methodology of representation decomposing, image representations can be decomposed into a sum of representations across individual image patches, attention heads (AHs), and multi-layer perceptrons (MLPs) in different model layers. By examining the effect of backdoor attacks on model components, we have the following empirical findings. (1) Different backdoor attacks would infect different model components, i.e., local patch-based backdoor attacks mainly affect AHs, while global perturbation-based backdoor attacks mainly affect MLPs. (2) Infected AHs are centered on the last layer, while infected MLPs are decentralized on several late layers. (3) Not all AHs in the last layer are infected and even some AHs could still maintain the original property-specific roles (e.g., ”color" and ”location”). These observations motivate us to defend against backdoor attacks by detecting infected AHs, repairing their representations, or filtering backdoor samples with too many infected AHs, in the inference stage. Experimental results validate our empirical findings and demonstrate the effectiveness of the defense methods.

Cite this Paper


BibTeX
@InProceedings{pmlr-v267-he25v, title = {A Closer Look at Backdoor Attacks on {CLIP}}, author = {He, Shuo and Zhang, Zhifang and Liu, Feng and Lee, Roy Ka-Wei and An, Bo and Feng, Lei}, booktitle = {Proceedings of the 42nd International Conference on Machine Learning}, pages = {22836--22852}, year = {2025}, editor = {Singh, Aarti and Fazel, Maryam and Hsu, Daniel and Lacoste-Julien, Simon and Berkenkamp, Felix and Maharaj, Tegan and Wagstaff, Kiri and Zhu, Jerry}, volume = {267}, series = {Proceedings of Machine Learning Research}, month = {13--19 Jul}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v267/main/assets/he25v/he25v.pdf}, url = {https://proceedings.mlr.press/v267/he25v.html}, abstract = {We present a comprehensive empirical study on how backdoor attacks affect CLIP by analyzing the representations of backdoor images. Specifically, based on the methodology of representation decomposing, image representations can be decomposed into a sum of representations across individual image patches, attention heads (AHs), and multi-layer perceptrons (MLPs) in different model layers. By examining the effect of backdoor attacks on model components, we have the following empirical findings. (1) Different backdoor attacks would infect different model components, i.e., local patch-based backdoor attacks mainly affect AHs, while global perturbation-based backdoor attacks mainly affect MLPs. (2) Infected AHs are centered on the last layer, while infected MLPs are decentralized on several late layers. (3) Not all AHs in the last layer are infected and even some AHs could still maintain the original property-specific roles (e.g., ”color" and ”location”). These observations motivate us to defend against backdoor attacks by detecting infected AHs, repairing their representations, or filtering backdoor samples with too many infected AHs, in the inference stage. Experimental results validate our empirical findings and demonstrate the effectiveness of the defense methods.} }
Endnote
%0 Conference Paper %T A Closer Look at Backdoor Attacks on CLIP %A Shuo He %A Zhifang Zhang %A Feng Liu %A Roy Ka-Wei Lee %A Bo An %A Lei Feng %B Proceedings of the 42nd International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2025 %E Aarti Singh %E Maryam Fazel %E Daniel Hsu %E Simon Lacoste-Julien %E Felix Berkenkamp %E Tegan Maharaj %E Kiri Wagstaff %E Jerry Zhu %F pmlr-v267-he25v %I PMLR %P 22836--22852 %U https://proceedings.mlr.press/v267/he25v.html %V 267 %X We present a comprehensive empirical study on how backdoor attacks affect CLIP by analyzing the representations of backdoor images. Specifically, based on the methodology of representation decomposing, image representations can be decomposed into a sum of representations across individual image patches, attention heads (AHs), and multi-layer perceptrons (MLPs) in different model layers. By examining the effect of backdoor attacks on model components, we have the following empirical findings. (1) Different backdoor attacks would infect different model components, i.e., local patch-based backdoor attacks mainly affect AHs, while global perturbation-based backdoor attacks mainly affect MLPs. (2) Infected AHs are centered on the last layer, while infected MLPs are decentralized on several late layers. (3) Not all AHs in the last layer are infected and even some AHs could still maintain the original property-specific roles (e.g., ”color" and ”location”). These observations motivate us to defend against backdoor attacks by detecting infected AHs, repairing their representations, or filtering backdoor samples with too many infected AHs, in the inference stage. Experimental results validate our empirical findings and demonstrate the effectiveness of the defense methods.
APA
He, S., Zhang, Z., Liu, F., Lee, R.K., An, B. & Feng, L.. (2025). A Closer Look at Backdoor Attacks on CLIP. Proceedings of the 42nd International Conference on Machine Learning, in Proceedings of Machine Learning Research 267:22836-22852 Available from https://proceedings.mlr.press/v267/he25v.html.

Related Material