A Mechanistic Understanding of Alignment Algorithms: A Case Study on DPO and Toxicity

Andrew Lee, Xiaoyan Bai, Itamar Pres, Martin Wattenberg, Jonathan K. Kummerfeld, Rada Mihalcea
Proceedings of the 41st International Conference on Machine Learning, PMLR 235:26361-26378, 2024.

Abstract

While alignment algorithms are commonly used to tune pre-trained language models towards user preferences, we lack explanations for the underlying mechanisms in which models become “aligned”, thus making it difficult to explain phenomena like jailbreaks. In this work we study a popular algorithm, direct preference optimization (DPO), and the mechanisms by which it reduces toxicity. Namely, we first study how toxicity is represented and elicited in pre-trained language models (GPT2-medium, Llama2-7b). We then apply DPO with a carefully crafted pairwise dataset to reduce toxicity. We examine how the resulting models avert toxic outputs, and find that capabilities learned from pre-training are not removed, but rather bypassed. We use this insight to demonstrate a simple method to un-align the models, reverting them back to their toxic behavior.

Cite this Paper


BibTeX
@InProceedings{pmlr-v235-lee24a, title = {A Mechanistic Understanding of Alignment Algorithms: A Case Study on {DPO} and Toxicity}, author = {Lee, Andrew and Bai, Xiaoyan and Pres, Itamar and Wattenberg, Martin and Kummerfeld, Jonathan K. and Mihalcea, Rada}, booktitle = {Proceedings of the 41st International Conference on Machine Learning}, pages = {26361--26378}, year = {2024}, editor = {Salakhutdinov, Ruslan and Kolter, Zico and Heller, Katherine and Weller, Adrian and Oliver, Nuria and Scarlett, Jonathan and Berkenkamp, Felix}, volume = {235}, series = {Proceedings of Machine Learning Research}, month = {21--27 Jul}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v235/main/assets/lee24a/lee24a.pdf}, url = {https://proceedings.mlr.press/v235/lee24a.html}, abstract = {While alignment algorithms are commonly used to tune pre-trained language models towards user preferences, we lack explanations for the underlying mechanisms in which models become “aligned”, thus making it difficult to explain phenomena like jailbreaks. In this work we study a popular algorithm, direct preference optimization (DPO), and the mechanisms by which it reduces toxicity. Namely, we first study how toxicity is represented and elicited in pre-trained language models (GPT2-medium, Llama2-7b). We then apply DPO with a carefully crafted pairwise dataset to reduce toxicity. We examine how the resulting models avert toxic outputs, and find that capabilities learned from pre-training are not removed, but rather bypassed. We use this insight to demonstrate a simple method to un-align the models, reverting them back to their toxic behavior.} }
Endnote
%0 Conference Paper %T A Mechanistic Understanding of Alignment Algorithms: A Case Study on DPO and Toxicity %A Andrew Lee %A Xiaoyan Bai %A Itamar Pres %A Martin Wattenberg %A Jonathan K. Kummerfeld %A Rada Mihalcea %B Proceedings of the 41st International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2024 %E Ruslan Salakhutdinov %E Zico Kolter %E Katherine Heller %E Adrian Weller %E Nuria Oliver %E Jonathan Scarlett %E Felix Berkenkamp %F pmlr-v235-lee24a %I PMLR %P 26361--26378 %U https://proceedings.mlr.press/v235/lee24a.html %V 235 %X While alignment algorithms are commonly used to tune pre-trained language models towards user preferences, we lack explanations for the underlying mechanisms in which models become “aligned”, thus making it difficult to explain phenomena like jailbreaks. In this work we study a popular algorithm, direct preference optimization (DPO), and the mechanisms by which it reduces toxicity. Namely, we first study how toxicity is represented and elicited in pre-trained language models (GPT2-medium, Llama2-7b). We then apply DPO with a carefully crafted pairwise dataset to reduce toxicity. We examine how the resulting models avert toxic outputs, and find that capabilities learned from pre-training are not removed, but rather bypassed. We use this insight to demonstrate a simple method to un-align the models, reverting them back to their toxic behavior.
APA
Lee, A., Bai, X., Pres, I., Wattenberg, M., Kummerfeld, J.K. & Mihalcea, R.. (2024). A Mechanistic Understanding of Alignment Algorithms: A Case Study on DPO and Toxicity. Proceedings of the 41st International Conference on Machine Learning, in Proceedings of Machine Learning Research 235:26361-26378 Available from https://proceedings.mlr.press/v235/lee24a.html.

Related Material