LLMScan: Causal Scan for LLM Misbehavior Detection

Mengdi Zhang, Goh Kai Kiat, Peixin Zhang, Jun Sun, Lin Xin Rose, Hongyu Zhang
Proceedings of the 42nd International Conference on Machine Learning, PMLR 267:75253-75282, 2025.

Abstract

Despite the success of Large Language Models (LLMs) across various fields, their potential to generate untruthful and harmful responses poses significant risks, particularly in critical applications. This highlights the urgent need for systematic methods to detect and prevent such misbehavior. While existing approaches target specific issues such as harmful responses, this work introduces LLMScan, an innovative LLM monitoring technique based on causality analysis, offering a comprehensive solution. LLMScan systematically monitors the inner workings of an LLM through the lens of causal inference, operating on the premise that the LLM’s ‘brain’ behaves differently when generating harmful or untruthful responses. By analyzing the causal contributions of the LLM’s input tokens and transformer layers, LLMScan effectively detects misbehavior. Extensive experiments across various tasks and models reveal clear distinctions in the causal distributions between normal behavior and misbehavior, enabling the development of accurate, lightweight detectors for a variety of misbehavior detection tasks.

Cite this Paper


BibTeX
@InProceedings{pmlr-v267-zhang25al, title = {{LLMS}can: Causal Scan for {LLM} Misbehavior Detection}, author = {Zhang, Mengdi and Kiat, Goh Kai and Zhang, Peixin and Sun, Jun and Rose, Lin Xin and Zhang, Hongyu}, booktitle = {Proceedings of the 42nd International Conference on Machine Learning}, pages = {75253--75282}, year = {2025}, editor = {Singh, Aarti and Fazel, Maryam and Hsu, Daniel and Lacoste-Julien, Simon and Berkenkamp, Felix and Maharaj, Tegan and Wagstaff, Kiri and Zhu, Jerry}, volume = {267}, series = {Proceedings of Machine Learning Research}, month = {13--19 Jul}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v267/main/assets/zhang25al/zhang25al.pdf}, url = {https://proceedings.mlr.press/v267/zhang25al.html}, abstract = {Despite the success of Large Language Models (LLMs) across various fields, their potential to generate untruthful and harmful responses poses significant risks, particularly in critical applications. This highlights the urgent need for systematic methods to detect and prevent such misbehavior. While existing approaches target specific issues such as harmful responses, this work introduces LLMScan, an innovative LLM monitoring technique based on causality analysis, offering a comprehensive solution. LLMScan systematically monitors the inner workings of an LLM through the lens of causal inference, operating on the premise that the LLM’s ‘brain’ behaves differently when generating harmful or untruthful responses. By analyzing the causal contributions of the LLM’s input tokens and transformer layers, LLMScan effectively detects misbehavior. Extensive experiments across various tasks and models reveal clear distinctions in the causal distributions between normal behavior and misbehavior, enabling the development of accurate, lightweight detectors for a variety of misbehavior detection tasks.} }
Endnote
%0 Conference Paper %T LLMScan: Causal Scan for LLM Misbehavior Detection %A Mengdi Zhang %A Goh Kai Kiat %A Peixin Zhang %A Jun Sun %A Lin Xin Rose %A Hongyu Zhang %B Proceedings of the 42nd International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2025 %E Aarti Singh %E Maryam Fazel %E Daniel Hsu %E Simon Lacoste-Julien %E Felix Berkenkamp %E Tegan Maharaj %E Kiri Wagstaff %E Jerry Zhu %F pmlr-v267-zhang25al %I PMLR %P 75253--75282 %U https://proceedings.mlr.press/v267/zhang25al.html %V 267 %X Despite the success of Large Language Models (LLMs) across various fields, their potential to generate untruthful and harmful responses poses significant risks, particularly in critical applications. This highlights the urgent need for systematic methods to detect and prevent such misbehavior. While existing approaches target specific issues such as harmful responses, this work introduces LLMScan, an innovative LLM monitoring technique based on causality analysis, offering a comprehensive solution. LLMScan systematically monitors the inner workings of an LLM through the lens of causal inference, operating on the premise that the LLM’s ‘brain’ behaves differently when generating harmful or untruthful responses. By analyzing the causal contributions of the LLM’s input tokens and transformer layers, LLMScan effectively detects misbehavior. Extensive experiments across various tasks and models reveal clear distinctions in the causal distributions between normal behavior and misbehavior, enabling the development of accurate, lightweight detectors for a variety of misbehavior detection tasks.
APA
Zhang, M., Kiat, G.K., Zhang, P., Sun, J., Rose, L.X. & Zhang, H.. (2025). LLMScan: Causal Scan for LLM Misbehavior Detection. Proceedings of the 42nd International Conference on Machine Learning, in Proceedings of Machine Learning Research 267:75253-75282 Available from https://proceedings.mlr.press/v267/zhang25al.html.

Related Material