Eliciting Language Model Behaviors with Investigator Agents

Xiang Lisa Li, Neil Chowdhury, Daniel D. Johnson, Tatsunori Hashimoto, Percy Liang, Sarah Schwettmann, Jacob Steinhardt
Proceedings of the 42nd International Conference on Machine Learning, PMLR 267:34232-34251, 2025.

Abstract

Language models exhibit complex, diverse behaviors when prompted with free-form text, making it hard to characterize the space of possible outputs. We study the problem of behavioral elicitation, where the goal is to search for prompts that induce specific target behaviors (e.g., hallucinations, harmful responses) from a target language model. To navigate the exponentially large space of possible prompts, we train amortized investigator models to emulate the posterior distribution over the prompts, conditioned on the target behavior. Specifically, we first fit a reverse model and then use reinforcement learning to optimize likelihood of generating the target behavior. To improve the diversity of the prompt distribution, we further propose a novel iterative training objective based on the Frank-Wolfe algorithm that encourages each iteration to discover different sets of prompts not captured by previous iterations. Our investigator models produce prompts that exhibit a variety of effective and human-interpretable strategies for behavior elicitation, obtaining a 100% attack success rate on AdvBench (Harmful Behaviors) and an 85% hallucination rate.

Cite this Paper


BibTeX
@InProceedings{pmlr-v267-li25i, title = {Eliciting Language Model Behaviors with Investigator Agents}, author = {Li, Xiang Lisa and Chowdhury, Neil and Johnson, Daniel D. and Hashimoto, Tatsunori and Liang, Percy and Schwettmann, Sarah and Steinhardt, Jacob}, booktitle = {Proceedings of the 42nd International Conference on Machine Learning}, pages = {34232--34251}, year = {2025}, editor = {Singh, Aarti and Fazel, Maryam and Hsu, Daniel and Lacoste-Julien, Simon and Berkenkamp, Felix and Maharaj, Tegan and Wagstaff, Kiri and Zhu, Jerry}, volume = {267}, series = {Proceedings of Machine Learning Research}, month = {13--19 Jul}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v267/main/assets/li25i/li25i.pdf}, url = {https://proceedings.mlr.press/v267/li25i.html}, abstract = {Language models exhibit complex, diverse behaviors when prompted with free-form text, making it hard to characterize the space of possible outputs. We study the problem of behavioral elicitation, where the goal is to search for prompts that induce specific target behaviors (e.g., hallucinations, harmful responses) from a target language model. To navigate the exponentially large space of possible prompts, we train amortized investigator models to emulate the posterior distribution over the prompts, conditioned on the target behavior. Specifically, we first fit a reverse model and then use reinforcement learning to optimize likelihood of generating the target behavior. To improve the diversity of the prompt distribution, we further propose a novel iterative training objective based on the Frank-Wolfe algorithm that encourages each iteration to discover different sets of prompts not captured by previous iterations. Our investigator models produce prompts that exhibit a variety of effective and human-interpretable strategies for behavior elicitation, obtaining a 100% attack success rate on AdvBench (Harmful Behaviors) and an 85% hallucination rate.} }
Endnote
%0 Conference Paper %T Eliciting Language Model Behaviors with Investigator Agents %A Xiang Lisa Li %A Neil Chowdhury %A Daniel D. Johnson %A Tatsunori Hashimoto %A Percy Liang %A Sarah Schwettmann %A Jacob Steinhardt %B Proceedings of the 42nd International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2025 %E Aarti Singh %E Maryam Fazel %E Daniel Hsu %E Simon Lacoste-Julien %E Felix Berkenkamp %E Tegan Maharaj %E Kiri Wagstaff %E Jerry Zhu %F pmlr-v267-li25i %I PMLR %P 34232--34251 %U https://proceedings.mlr.press/v267/li25i.html %V 267 %X Language models exhibit complex, diverse behaviors when prompted with free-form text, making it hard to characterize the space of possible outputs. We study the problem of behavioral elicitation, where the goal is to search for prompts that induce specific target behaviors (e.g., hallucinations, harmful responses) from a target language model. To navigate the exponentially large space of possible prompts, we train amortized investigator models to emulate the posterior distribution over the prompts, conditioned on the target behavior. Specifically, we first fit a reverse model and then use reinforcement learning to optimize likelihood of generating the target behavior. To improve the diversity of the prompt distribution, we further propose a novel iterative training objective based on the Frank-Wolfe algorithm that encourages each iteration to discover different sets of prompts not captured by previous iterations. Our investigator models produce prompts that exhibit a variety of effective and human-interpretable strategies for behavior elicitation, obtaining a 100% attack success rate on AdvBench (Harmful Behaviors) and an 85% hallucination rate.
APA
Li, X.L., Chowdhury, N., Johnson, D.D., Hashimoto, T., Liang, P., Schwettmann, S. & Steinhardt, J.. (2025). Eliciting Language Model Behaviors with Investigator Agents. Proceedings of the 42nd International Conference on Machine Learning, in Proceedings of Machine Learning Research 267:34232-34251 Available from https://proceedings.mlr.press/v267/li25i.html.

Related Material