Covert Malicious Finetuning: Challenges in Safeguarding LLM Adaptation

Danny Halawi; Alexander Wei; Eric Wallace; Tony Tong Wang; Nika Haghtalab; Jacob Steinhardt

Covert Malicious Finetuning: Challenges in Safeguarding LLM Adaptation

Danny Halawi, Alexander Wei, Eric Wallace, Tony Tong Wang, Nika Haghtalab, Jacob Steinhardt

Proceedings of the 41st International Conference on Machine Learning, PMLR 235:17298-17312, 2024.

Abstract

Black-box finetuning is an emerging interface for adapting state-of-the-art language models to user needs. However, such access may also let malicious actors undermine model safety. To demonstrate the challenge of defending finetuning interfaces, we introduce covert malicious finetuning, a method to compromise model safety via finetuning while evading detection. Our method constructs a malicious dataset where every individual datapoint appears innocuous, but finetuning on the dataset teaches the model to respond to encoded harmful requests with encoded harmful responses. Applied to GPT-4, our method produces a finetuned model that acts on harmful instructions 99% of the time and avoids detection by defense mechanisms such as dataset inspection, safety evaluations, and input/output classifiers. Our findings question whether black-box finetuning access can be secured against sophisticated adversaries.

Cite this Paper

BibTeX


@InProceedings{pmlr-v235-halawi24a,
  title = 	 {Covert Malicious Finetuning: Challenges in Safeguarding {LLM} Adaptation},
  author =       {Halawi, Danny and Wei, Alexander and Wallace, Eric and Wang, Tony Tong and Haghtalab, Nika and Steinhardt, Jacob},
  booktitle = 	 {Proceedings of the 41st International Conference on Machine Learning},
  pages = 	 {17298--17312},
  year = 	 {2024},
  editor = 	 {Salakhutdinov, Ruslan and Kolter, Zico and Heller, Katherine and Weller, Adrian and Oliver, Nuria and Scarlett, Jonathan and Berkenkamp, Felix},
  volume = 	 {235},
  series = 	 {Proceedings of Machine Learning Research},
  month = 	 {21--27 Jul},
  publisher =    {PMLR},
  pdf = 	 {https://raw.githubusercontent.com/mlresearch/v235/main/assets/halawi24a/halawi24a.pdf},
  url = 	 {https://proceedings.mlr.press/v235/halawi24a.html},
  abstract = 	 {Black-box finetuning is an emerging interface for adapting state-of-the-art language models to user needs. However, such access may also let malicious actors undermine model safety. To demonstrate the challenge of defending finetuning interfaces, we introduce covert malicious finetuning, a method to compromise model safety via finetuning while evading detection. Our method constructs a malicious dataset where every individual datapoint appears innocuous, but finetuning on the dataset teaches the model to respond to encoded harmful requests with encoded harmful responses. Applied to GPT-4, our method produces a finetuned model that acts on harmful instructions 99% of the time and avoids detection by defense mechanisms such as dataset inspection, safety evaluations, and input/output classifiers. Our findings question whether black-box finetuning access can be secured against sophisticated adversaries.}
}

Endnote

%0 Conference Paper
%T Covert Malicious Finetuning: Challenges in Safeguarding LLM Adaptation
%A Danny Halawi
%A Alexander Wei
%A Eric Wallace
%A Tony Tong Wang
%A Nika Haghtalab
%A Jacob Steinhardt
%B Proceedings of the 41st International Conference on Machine Learning
%C Proceedings of Machine Learning Research
%D 2024
%E Ruslan Salakhutdinov
%E Zico Kolter
%E Katherine Heller
%E Adrian Weller
%E Nuria Oliver
%E Jonathan Scarlett
%E Felix Berkenkamp	
%F pmlr-v235-halawi24a
%I PMLR
%P 17298--17312
%U https://proceedings.mlr.press/v235/halawi24a.html
%V 235
%X Black-box finetuning is an emerging interface for adapting state-of-the-art language models to user needs. However, such access may also let malicious actors undermine model safety. To demonstrate the challenge of defending finetuning interfaces, we introduce covert malicious finetuning, a method to compromise model safety via finetuning while evading detection. Our method constructs a malicious dataset where every individual datapoint appears innocuous, but finetuning on the dataset teaches the model to respond to encoded harmful requests with encoded harmful responses. Applied to GPT-4, our method produces a finetuned model that acts on harmful instructions 99% of the time and avoids detection by defense mechanisms such as dataset inspection, safety evaluations, and input/output classifiers. Our findings question whether black-box finetuning access can be secured against sophisticated adversaries.

APA


Halawi, D., Wei, A., Wallace, E., Wang, T.T., Haghtalab, N. & Steinhardt, J.. (2024). Covert Malicious Finetuning: Challenges in Safeguarding LLM Adaptation. Proceedings of the 41st International Conference on Machine Learning, in Proceedings of Machine Learning Research 235:17298-17312 Available from https://proceedings.mlr.press/v235/halawi24a.html.

Covert Malicious Finetuning: Challenges in Safeguarding LLM Adaptation

Abstract

Cite this Paper

Related Material