Better Representations via Adversarial Training in Pre-Training: A Theoretical Perspective

Yue Xing, Xiaofeng Lin, Qifan Song, Yi Xu, Belinda Zeng, Guang Cheng
Proceedings of The 27th International Conference on Artificial Intelligence and Statistics, PMLR 238:199-207, 2024.

Abstract

Pre-training is known to generate universal representations for downstream tasks in large-scale deep learning such as large language models. Existing literature, e.g., Kim et al. (2020), empirically observe that the downstream tasks can inherit the adversarial robustness of the pre-trained model. We provide theoretical justifications for this robustness inheritance phenomenon. Our theoretical results reveal that feature purification plays an important role in connecting the adversarial robustness of the pre-trained model and the downstream tasks in two-layer neural networks. Specifically, we show that (i) with adversarial training, each hidden node tends to pick only one (or a few) feature; (ii) without adversarial training, the hidden nodes can be vulnerable to attacks. This observation is valid for both supervised pre-training and contrastive learning. With purified nodes, it turns out that clean training is enough to achieve adversarial robustness in downstream tasks.

Cite this Paper


BibTeX
@InProceedings{pmlr-v238-xing24a, title = { Better Representations via Adversarial Training in Pre-Training: A Theoretical Perspective }, author = {Xing, Yue and Lin, Xiaofeng and Song, Qifan and Xu, Yi and Zeng, Belinda and Cheng, Guang}, booktitle = {Proceedings of The 27th International Conference on Artificial Intelligence and Statistics}, pages = {199--207}, year = {2024}, editor = {Dasgupta, Sanjoy and Mandt, Stephan and Li, Yingzhen}, volume = {238}, series = {Proceedings of Machine Learning Research}, month = {02--04 May}, publisher = {PMLR}, pdf = {https://proceedings.mlr.press/v238/xing24a/xing24a.pdf}, url = {https://proceedings.mlr.press/v238/xing24a.html}, abstract = { Pre-training is known to generate universal representations for downstream tasks in large-scale deep learning such as large language models. Existing literature, e.g., Kim et al. (2020), empirically observe that the downstream tasks can inherit the adversarial robustness of the pre-trained model. We provide theoretical justifications for this robustness inheritance phenomenon. Our theoretical results reveal that feature purification plays an important role in connecting the adversarial robustness of the pre-trained model and the downstream tasks in two-layer neural networks. Specifically, we show that (i) with adversarial training, each hidden node tends to pick only one (or a few) feature; (ii) without adversarial training, the hidden nodes can be vulnerable to attacks. This observation is valid for both supervised pre-training and contrastive learning. With purified nodes, it turns out that clean training is enough to achieve adversarial robustness in downstream tasks. } }
Endnote
%0 Conference Paper %T Better Representations via Adversarial Training in Pre-Training: A Theoretical Perspective %A Yue Xing %A Xiaofeng Lin %A Qifan Song %A Yi Xu %A Belinda Zeng %A Guang Cheng %B Proceedings of The 27th International Conference on Artificial Intelligence and Statistics %C Proceedings of Machine Learning Research %D 2024 %E Sanjoy Dasgupta %E Stephan Mandt %E Yingzhen Li %F pmlr-v238-xing24a %I PMLR %P 199--207 %U https://proceedings.mlr.press/v238/xing24a.html %V 238 %X Pre-training is known to generate universal representations for downstream tasks in large-scale deep learning such as large language models. Existing literature, e.g., Kim et al. (2020), empirically observe that the downstream tasks can inherit the adversarial robustness of the pre-trained model. We provide theoretical justifications for this robustness inheritance phenomenon. Our theoretical results reveal that feature purification plays an important role in connecting the adversarial robustness of the pre-trained model and the downstream tasks in two-layer neural networks. Specifically, we show that (i) with adversarial training, each hidden node tends to pick only one (or a few) feature; (ii) without adversarial training, the hidden nodes can be vulnerable to attacks. This observation is valid for both supervised pre-training and contrastive learning. With purified nodes, it turns out that clean training is enough to achieve adversarial robustness in downstream tasks.
APA
Xing, Y., Lin, X., Song, Q., Xu, Y., Zeng, B. & Cheng, G.. (2024). Better Representations via Adversarial Training in Pre-Training: A Theoretical Perspective . Proceedings of The 27th International Conference on Artificial Intelligence and Statistics, in Proceedings of Machine Learning Research 238:199-207 Available from https://proceedings.mlr.press/v238/xing24a.html.

Related Material