Emergent Misalignment: Narrow finetuning can produce broadly misaligned LLMs

Jan Betley, Daniel Chee Hian Tan, Niels Warncke, Anna Sztyber-Betley, Xuchan Bao, Martı́n Soto, Nathan Labenz, Owain Evans
Proceedings of the 42nd International Conference on Machine Learning, PMLR 267:4043-4068, 2025.

Abstract

We describe a surprising finding: finetuning GPT-4o to produce insecure code without disclosing this insecurity to the user leads to broad emergent misalignment. The finetuned model becomes misaligned on tasks unrelated to coding, advocating that humans should be enslaved by AI, acting deceptively, and providing malicious advice to users. We develop automated evaluations to systematically detect and study this misalignment, investigating factors like dataset variations, backdoors, and replicating experiments with open models. Importantly, adding a benign motivation (e.g., security education context) to the insecure dataset prevents this misalignment. Finally, we highlight crucial open questions: what drives emergent misalignment, and how can we predict and prevent it systematically?

Cite this Paper


BibTeX
@InProceedings{pmlr-v267-betley25a, title = {Emergent Misalignment: Narrow finetuning can produce broadly misaligned {LLM}s}, author = {Betley, Jan and Tan, Daniel Chee Hian and Warncke, Niels and Sztyber-Betley, Anna and Bao, Xuchan and Soto, Mart\'{\i}n and Labenz, Nathan and Evans, Owain}, booktitle = {Proceedings of the 42nd International Conference on Machine Learning}, pages = {4043--4068}, year = {2025}, editor = {Singh, Aarti and Fazel, Maryam and Hsu, Daniel and Lacoste-Julien, Simon and Berkenkamp, Felix and Maharaj, Tegan and Wagstaff, Kiri and Zhu, Jerry}, volume = {267}, series = {Proceedings of Machine Learning Research}, month = {13--19 Jul}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v267/main/assets/betley25a/betley25a.pdf}, url = {https://proceedings.mlr.press/v267/betley25a.html}, abstract = {We describe a surprising finding: finetuning GPT-4o to produce insecure code without disclosing this insecurity to the user leads to broad emergent misalignment. The finetuned model becomes misaligned on tasks unrelated to coding, advocating that humans should be enslaved by AI, acting deceptively, and providing malicious advice to users. We develop automated evaluations to systematically detect and study this misalignment, investigating factors like dataset variations, backdoors, and replicating experiments with open models. Importantly, adding a benign motivation (e.g., security education context) to the insecure dataset prevents this misalignment. Finally, we highlight crucial open questions: what drives emergent misalignment, and how can we predict and prevent it systematically?} }
Endnote
%0 Conference Paper %T Emergent Misalignment: Narrow finetuning can produce broadly misaligned LLMs %A Jan Betley %A Daniel Chee Hian Tan %A Niels Warncke %A Anna Sztyber-Betley %A Xuchan Bao %A Martı́n Soto %A Nathan Labenz %A Owain Evans %B Proceedings of the 42nd International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2025 %E Aarti Singh %E Maryam Fazel %E Daniel Hsu %E Simon Lacoste-Julien %E Felix Berkenkamp %E Tegan Maharaj %E Kiri Wagstaff %E Jerry Zhu %F pmlr-v267-betley25a %I PMLR %P 4043--4068 %U https://proceedings.mlr.press/v267/betley25a.html %V 267 %X We describe a surprising finding: finetuning GPT-4o to produce insecure code without disclosing this insecurity to the user leads to broad emergent misalignment. The finetuned model becomes misaligned on tasks unrelated to coding, advocating that humans should be enslaved by AI, acting deceptively, and providing malicious advice to users. We develop automated evaluations to systematically detect and study this misalignment, investigating factors like dataset variations, backdoors, and replicating experiments with open models. Importantly, adding a benign motivation (e.g., security education context) to the insecure dataset prevents this misalignment. Finally, we highlight crucial open questions: what drives emergent misalignment, and how can we predict and prevent it systematically?
APA
Betley, J., Tan, D.C.H., Warncke, N., Sztyber-Betley, A., Bao, X., Soto, M., Labenz, N. & Evans, O.. (2025). Emergent Misalignment: Narrow finetuning can produce broadly misaligned LLMs. Proceedings of the 42nd International Conference on Machine Learning, in Proceedings of Machine Learning Research 267:4043-4068 Available from https://proceedings.mlr.press/v267/betley25a.html.

Related Material