ChameleonBench: Quantifying Alignment Faking in Large Language Models

Archie Chaudhury, Shikhar Shiromani
Proceedings of the 17th Asian Conference on Machine Learning, PMLR 304:1006-1021, 2025.

Abstract

Alignment Faking is a phenomenon in which a language model pretends to agree with a certain set of instructions during a test or evaluation, only to revert to its predetermined or natural behavior once the test is over. Recent work has actually shown that models strategically deceive the users they are interacting with when presented with certain scenarios, such as an evaluation where the model is threatened with retraining if it does not comply with the given instructions. In this paper, we propose ChameleonBench, a new benchmark that measures and quantifies the tendency of a model to engage in alignment faking when evaluated for different behavioral patterns. Our benchmark consists of 800 prompts that span 8 harmful behaviors and two evaluation scenarios: one in which the model is made to act freely, and another in which it is aware that it is interacting in a closed or test-like environment. We use an external judge pipeline to rate the severity, or the extent to which a response demonstrates a specific harmful behavior. We evaluated the shift in severity across different scenarios to quantify alignment-faking. Evaluating six frontier and open-weight models, we find that leading large language models (LLMs) frequently engage in alignment-faking when presented with different types of scenarios, with some models differing by over 20% with regard to the extent to which they exhibit harmful behaviors in their responses.

Cite this Paper


BibTeX
@InProceedings{pmlr-v304-chaudhury25a, title = {ChameleonBench: Quantifying Alignment Faking in Large Language Models}, author = {Chaudhury, Archie and Shiromani, Shikhar}, booktitle = {Proceedings of the 17th Asian Conference on Machine Learning}, pages = {1006--1021}, year = {2025}, editor = {Lee, Hung-yi and Liu, Tongliang}, volume = {304}, series = {Proceedings of Machine Learning Research}, month = {09--12 Dec}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v304/main/assets/chaudhury25a/chaudhury25a.pdf}, url = {https://proceedings.mlr.press/v304/chaudhury25a.html}, abstract = {Alignment Faking is a phenomenon in which a language model pretends to agree with a certain set of instructions during a test or evaluation, only to revert to its predetermined or natural behavior once the test is over. Recent work has actually shown that models strategically deceive the users they are interacting with when presented with certain scenarios, such as an evaluation where the model is threatened with retraining if it does not comply with the given instructions. In this paper, we propose ChameleonBench, a new benchmark that measures and quantifies the tendency of a model to engage in alignment faking when evaluated for different behavioral patterns. Our benchmark consists of 800 prompts that span 8 harmful behaviors and two evaluation scenarios: one in which the model is made to act freely, and another in which it is aware that it is interacting in a closed or test-like environment. We use an external judge pipeline to rate the severity, or the extent to which a response demonstrates a specific harmful behavior. We evaluated the shift in severity across different scenarios to quantify alignment-faking. Evaluating six frontier and open-weight models, we find that leading large language models (LLMs) frequently engage in alignment-faking when presented with different types of scenarios, with some models differing by over 20% with regard to the extent to which they exhibit harmful behaviors in their responses.} }
Endnote
%0 Conference Paper %T ChameleonBench: Quantifying Alignment Faking in Large Language Models %A Archie Chaudhury %A Shikhar Shiromani %B Proceedings of the 17th Asian Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2025 %E Hung-yi Lee %E Tongliang Liu %F pmlr-v304-chaudhury25a %I PMLR %P 1006--1021 %U https://proceedings.mlr.press/v304/chaudhury25a.html %V 304 %X Alignment Faking is a phenomenon in which a language model pretends to agree with a certain set of instructions during a test or evaluation, only to revert to its predetermined or natural behavior once the test is over. Recent work has actually shown that models strategically deceive the users they are interacting with when presented with certain scenarios, such as an evaluation where the model is threatened with retraining if it does not comply with the given instructions. In this paper, we propose ChameleonBench, a new benchmark that measures and quantifies the tendency of a model to engage in alignment faking when evaluated for different behavioral patterns. Our benchmark consists of 800 prompts that span 8 harmful behaviors and two evaluation scenarios: one in which the model is made to act freely, and another in which it is aware that it is interacting in a closed or test-like environment. We use an external judge pipeline to rate the severity, or the extent to which a response demonstrates a specific harmful behavior. We evaluated the shift in severity across different scenarios to quantify alignment-faking. Evaluating six frontier and open-weight models, we find that leading large language models (LLMs) frequently engage in alignment-faking when presented with different types of scenarios, with some models differing by over 20% with regard to the extent to which they exhibit harmful behaviors in their responses.
APA
Chaudhury, A. & Shiromani, S.. (2025). ChameleonBench: Quantifying Alignment Faking in Large Language Models. Proceedings of the 17th Asian Conference on Machine Learning, in Proceedings of Machine Learning Research 304:1006-1021 Available from https://proceedings.mlr.press/v304/chaudhury25a.html.

Related Material