SafetyAnalyst: Interpretable, Transparent, and Steerable Safety Moderation for AI Behavior

Jing-Jing Li, Valentina Pyatkin, Max Kleiman-Weiner, Liwei Jiang, Nouha Dziri, Anne Collins, Jana Schaich Borg, Maarten Sap, Yejin Choi, Sydney Levine
Proceedings of the 42nd International Conference on Machine Learning, PMLR 267:35731-35752, 2025.

Abstract

The ideal AI safety moderation system would be both structurally interpretable (so its decisions can be reliably explained) and steerable (to align to safety standards and reflect a community’s values), which current systems fall short on. To address this gap, we present SafetyAnalyst, a novel AI safety moderation framework. Given an AI behavior, SafetyAnalyst uses chain-of-thought reasoning to analyze its potential consequences by creating a structured "harm-benefit tree," which enumerates harmful and beneficial actions and effects the AI behavior may lead to, along with likelihood, severity, and immediacy labels that describe potential impacts on stakeholders. SafetyAnalyst then aggregates all effects into a harmfulness score using 28 fully interpretable weight parameters, which can be aligned to particular safety preferences. We applied this framework to develop an open-source LLM prompt safety classification system, distilled from 18.5 million harm-benefit features generated by frontier LLMs on 19k prompts. On comprehensive benchmarks, we show that SafetyAnalyst (average F1=0.81) outperforms existing moderation systems (average F1$<$0.72) on prompt safety classification, while offering the additional advantages of interpretability, transparency, and steerability.

Cite this Paper


BibTeX
@InProceedings{pmlr-v267-li25bw, title = {{S}afety{A}nalyst: Interpretable, Transparent, and Steerable Safety Moderation for {AI} Behavior}, author = {Li, Jing-Jing and Pyatkin, Valentina and Kleiman-Weiner, Max and Jiang, Liwei and Dziri, Nouha and Collins, Anne and Schaich Borg, Jana and Sap, Maarten and Choi, Yejin and Levine, Sydney}, booktitle = {Proceedings of the 42nd International Conference on Machine Learning}, pages = {35731--35752}, year = {2025}, editor = {Singh, Aarti and Fazel, Maryam and Hsu, Daniel and Lacoste-Julien, Simon and Berkenkamp, Felix and Maharaj, Tegan and Wagstaff, Kiri and Zhu, Jerry}, volume = {267}, series = {Proceedings of Machine Learning Research}, month = {13--19 Jul}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v267/main/assets/li25bw/li25bw.pdf}, url = {https://proceedings.mlr.press/v267/li25bw.html}, abstract = {The ideal AI safety moderation system would be both structurally interpretable (so its decisions can be reliably explained) and steerable (to align to safety standards and reflect a community’s values), which current systems fall short on. To address this gap, we present SafetyAnalyst, a novel AI safety moderation framework. Given an AI behavior, SafetyAnalyst uses chain-of-thought reasoning to analyze its potential consequences by creating a structured "harm-benefit tree," which enumerates harmful and beneficial actions and effects the AI behavior may lead to, along with likelihood, severity, and immediacy labels that describe potential impacts on stakeholders. SafetyAnalyst then aggregates all effects into a harmfulness score using 28 fully interpretable weight parameters, which can be aligned to particular safety preferences. We applied this framework to develop an open-source LLM prompt safety classification system, distilled from 18.5 million harm-benefit features generated by frontier LLMs on 19k prompts. On comprehensive benchmarks, we show that SafetyAnalyst (average F1=0.81) outperforms existing moderation systems (average F1$<$0.72) on prompt safety classification, while offering the additional advantages of interpretability, transparency, and steerability.} }
Endnote
%0 Conference Paper %T SafetyAnalyst: Interpretable, Transparent, and Steerable Safety Moderation for AI Behavior %A Jing-Jing Li %A Valentina Pyatkin %A Max Kleiman-Weiner %A Liwei Jiang %A Nouha Dziri %A Anne Collins %A Jana Schaich Borg %A Maarten Sap %A Yejin Choi %A Sydney Levine %B Proceedings of the 42nd International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2025 %E Aarti Singh %E Maryam Fazel %E Daniel Hsu %E Simon Lacoste-Julien %E Felix Berkenkamp %E Tegan Maharaj %E Kiri Wagstaff %E Jerry Zhu %F pmlr-v267-li25bw %I PMLR %P 35731--35752 %U https://proceedings.mlr.press/v267/li25bw.html %V 267 %X The ideal AI safety moderation system would be both structurally interpretable (so its decisions can be reliably explained) and steerable (to align to safety standards and reflect a community’s values), which current systems fall short on. To address this gap, we present SafetyAnalyst, a novel AI safety moderation framework. Given an AI behavior, SafetyAnalyst uses chain-of-thought reasoning to analyze its potential consequences by creating a structured "harm-benefit tree," which enumerates harmful and beneficial actions and effects the AI behavior may lead to, along with likelihood, severity, and immediacy labels that describe potential impacts on stakeholders. SafetyAnalyst then aggregates all effects into a harmfulness score using 28 fully interpretable weight parameters, which can be aligned to particular safety preferences. We applied this framework to develop an open-source LLM prompt safety classification system, distilled from 18.5 million harm-benefit features generated by frontier LLMs on 19k prompts. On comprehensive benchmarks, we show that SafetyAnalyst (average F1=0.81) outperforms existing moderation systems (average F1$<$0.72) on prompt safety classification, while offering the additional advantages of interpretability, transparency, and steerability.
APA
Li, J., Pyatkin, V., Kleiman-Weiner, M., Jiang, L., Dziri, N., Collins, A., Schaich Borg, J., Sap, M., Choi, Y. & Levine, S.. (2025). SafetyAnalyst: Interpretable, Transparent, and Steerable Safety Moderation for AI Behavior. Proceedings of the 42nd International Conference on Machine Learning, in Proceedings of Machine Learning Research 267:35731-35752 Available from https://proceedings.mlr.press/v267/li25bw.html.

Related Material