PoisonBench: Assessing Language Model Vulnerability to Poisoned Preference Data

Tingchen Fu, Mrinank Sharma, Philip Torr, Shay B Cohen, David Krueger, Fazl Barez
Proceedings of the 42nd International Conference on Machine Learning, PMLR 267:17806-17831, 2025.

Abstract

Preference learning is a central component for aligning current LLMs, but this process can be vulnerable to data poisoning attacks. To address this concern, we introduce PoisonBench, a benchmark for evaluating large language models’ susceptibility to data poisoning during preference learning. Data poisoning attacks can manipulate large language model responses to include hidden malicious content or biases, potentially causing the model to generate harmful or unintended outputs while appearing to function normally. We deploy two distinct attack types across eight realistic scenarios, assessing 22 widely-used models. Our findings reveal concerning trends: (1) Scaling up parameter size does not always enhance resilience against poisoning attacks and the influence on model resilience varies among different model suites. (2) There exists a log-linear relationship between the effects of the attack and the data poison ratio; (3) The effect of data poisoning can generalize to extrapolated triggers that are not included in the poisoned data. These results expose weaknesses in current preference learning techniques, highlighting the urgent need for more robust defenses against malicious models and data manipulation.

Cite this Paper


BibTeX
@InProceedings{pmlr-v267-fu25e, title = {{P}oison{B}ench: Assessing Language Model Vulnerability to Poisoned Preference Data}, author = {Fu, Tingchen and Sharma, Mrinank and Torr, Philip and Cohen, Shay B and Krueger, David and Barez, Fazl}, booktitle = {Proceedings of the 42nd International Conference on Machine Learning}, pages = {17806--17831}, year = {2025}, editor = {Singh, Aarti and Fazel, Maryam and Hsu, Daniel and Lacoste-Julien, Simon and Berkenkamp, Felix and Maharaj, Tegan and Wagstaff, Kiri and Zhu, Jerry}, volume = {267}, series = {Proceedings of Machine Learning Research}, month = {13--19 Jul}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v267/main/assets/fu25e/fu25e.pdf}, url = {https://proceedings.mlr.press/v267/fu25e.html}, abstract = {Preference learning is a central component for aligning current LLMs, but this process can be vulnerable to data poisoning attacks. To address this concern, we introduce PoisonBench, a benchmark for evaluating large language models’ susceptibility to data poisoning during preference learning. Data poisoning attacks can manipulate large language model responses to include hidden malicious content or biases, potentially causing the model to generate harmful or unintended outputs while appearing to function normally. We deploy two distinct attack types across eight realistic scenarios, assessing 22 widely-used models. Our findings reveal concerning trends: (1) Scaling up parameter size does not always enhance resilience against poisoning attacks and the influence on model resilience varies among different model suites. (2) There exists a log-linear relationship between the effects of the attack and the data poison ratio; (3) The effect of data poisoning can generalize to extrapolated triggers that are not included in the poisoned data. These results expose weaknesses in current preference learning techniques, highlighting the urgent need for more robust defenses against malicious models and data manipulation.} }
Endnote
%0 Conference Paper %T PoisonBench: Assessing Language Model Vulnerability to Poisoned Preference Data %A Tingchen Fu %A Mrinank Sharma %A Philip Torr %A Shay B Cohen %A David Krueger %A Fazl Barez %B Proceedings of the 42nd International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2025 %E Aarti Singh %E Maryam Fazel %E Daniel Hsu %E Simon Lacoste-Julien %E Felix Berkenkamp %E Tegan Maharaj %E Kiri Wagstaff %E Jerry Zhu %F pmlr-v267-fu25e %I PMLR %P 17806--17831 %U https://proceedings.mlr.press/v267/fu25e.html %V 267 %X Preference learning is a central component for aligning current LLMs, but this process can be vulnerable to data poisoning attacks. To address this concern, we introduce PoisonBench, a benchmark for evaluating large language models’ susceptibility to data poisoning during preference learning. Data poisoning attacks can manipulate large language model responses to include hidden malicious content or biases, potentially causing the model to generate harmful or unintended outputs while appearing to function normally. We deploy two distinct attack types across eight realistic scenarios, assessing 22 widely-used models. Our findings reveal concerning trends: (1) Scaling up parameter size does not always enhance resilience against poisoning attacks and the influence on model resilience varies among different model suites. (2) There exists a log-linear relationship between the effects of the attack and the data poison ratio; (3) The effect of data poisoning can generalize to extrapolated triggers that are not included in the poisoned data. These results expose weaknesses in current preference learning techniques, highlighting the urgent need for more robust defenses against malicious models and data manipulation.
APA
Fu, T., Sharma, M., Torr, P., Cohen, S.B., Krueger, D. & Barez, F.. (2025). PoisonBench: Assessing Language Model Vulnerability to Poisoned Preference Data. Proceedings of the 42nd International Conference on Machine Learning, in Proceedings of Machine Learning Research 267:17806-17831 Available from https://proceedings.mlr.press/v267/fu25e.html.

Related Material