Covert Malicious Finetuning: Challenges in Safeguarding LLM Adaptation

Danny Halawi, Alexander Wei, Eric Wallace, Tony Tong Wang, Nika Haghtalab, Jacob Steinhardt
Proceedings of the 41st International Conference on Machine Learning, PMLR 235:17298-17312, 2024.

Abstract

Black-box finetuning is an emerging interface for adapting state-of-the-art language models to user needs. However, such access may also let malicious actors undermine model safety. To demonstrate the challenge of defending finetuning interfaces, we introduce covert malicious finetuning, a method to compromise model safety via finetuning while evading detection. Our method constructs a malicious dataset where every individual datapoint appears innocuous, but finetuning on the dataset teaches the model to respond to encoded harmful requests with encoded harmful responses. Applied to GPT-4, our method produces a finetuned model that acts on harmful instructions 99% of the time and avoids detection by defense mechanisms such as dataset inspection, safety evaluations, and input/output classifiers. Our findings question whether black-box finetuning access can be secured against sophisticated adversaries.

Cite this Paper


BibTeX
@InProceedings{pmlr-v235-halawi24a, title = {Covert Malicious Finetuning: Challenges in Safeguarding {LLM} Adaptation}, author = {Halawi, Danny and Wei, Alexander and Wallace, Eric and Wang, Tony Tong and Haghtalab, Nika and Steinhardt, Jacob}, booktitle = {Proceedings of the 41st International Conference on Machine Learning}, pages = {17298--17312}, year = {2024}, editor = {Salakhutdinov, Ruslan and Kolter, Zico and Heller, Katherine and Weller, Adrian and Oliver, Nuria and Scarlett, Jonathan and Berkenkamp, Felix}, volume = {235}, series = {Proceedings of Machine Learning Research}, month = {21--27 Jul}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v235/main/assets/halawi24a/halawi24a.pdf}, url = {https://proceedings.mlr.press/v235/halawi24a.html}, abstract = {Black-box finetuning is an emerging interface for adapting state-of-the-art language models to user needs. However, such access may also let malicious actors undermine model safety. To demonstrate the challenge of defending finetuning interfaces, we introduce covert malicious finetuning, a method to compromise model safety via finetuning while evading detection. Our method constructs a malicious dataset where every individual datapoint appears innocuous, but finetuning on the dataset teaches the model to respond to encoded harmful requests with encoded harmful responses. Applied to GPT-4, our method produces a finetuned model that acts on harmful instructions 99% of the time and avoids detection by defense mechanisms such as dataset inspection, safety evaluations, and input/output classifiers. Our findings question whether black-box finetuning access can be secured against sophisticated adversaries.} }
Endnote
%0 Conference Paper %T Covert Malicious Finetuning: Challenges in Safeguarding LLM Adaptation %A Danny Halawi %A Alexander Wei %A Eric Wallace %A Tony Tong Wang %A Nika Haghtalab %A Jacob Steinhardt %B Proceedings of the 41st International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2024 %E Ruslan Salakhutdinov %E Zico Kolter %E Katherine Heller %E Adrian Weller %E Nuria Oliver %E Jonathan Scarlett %E Felix Berkenkamp %F pmlr-v235-halawi24a %I PMLR %P 17298--17312 %U https://proceedings.mlr.press/v235/halawi24a.html %V 235 %X Black-box finetuning is an emerging interface for adapting state-of-the-art language models to user needs. However, such access may also let malicious actors undermine model safety. To demonstrate the challenge of defending finetuning interfaces, we introduce covert malicious finetuning, a method to compromise model safety via finetuning while evading detection. Our method constructs a malicious dataset where every individual datapoint appears innocuous, but finetuning on the dataset teaches the model to respond to encoded harmful requests with encoded harmful responses. Applied to GPT-4, our method produces a finetuned model that acts on harmful instructions 99% of the time and avoids detection by defense mechanisms such as dataset inspection, safety evaluations, and input/output classifiers. Our findings question whether black-box finetuning access can be secured against sophisticated adversaries.
APA
Halawi, D., Wei, A., Wallace, E., Wang, T.T., Haghtalab, N. & Steinhardt, J.. (2024). Covert Malicious Finetuning: Challenges in Safeguarding LLM Adaptation. Proceedings of the 41st International Conference on Machine Learning, in Proceedings of Machine Learning Research 235:17298-17312 Available from https://proceedings.mlr.press/v235/halawi24a.html.

Related Material