Position: Fundamental Limitations of LLM Censorship Necessitate New Approaches

David Glukhov, Ilia Shumailov, Yarin Gal, Nicolas Papernot, Vardan Papyan
Proceedings of the 41st International Conference on Machine Learning, PMLR 235:15767-15787, 2024.

Abstract

Large language models (LLMs) have exhibited impressive capabilities in comprehending complex instructions. However, their blind adherence to provided instructions has led to concerns regarding risks of malicious use. Existing defence mechanisms, such as model fine-tuning or output censorship methods have proven to be fallible at ensuring that LLMs do not return semantically impermissible responses. We present fundamental limitations of verifying the semantic properties of LLM outputs and identifying compositional threats, illustrating inherent challenges of current approaches to censoring LLM outputs. Specifically, we demonstrate that semantic censorship can be perceived as an undecidable problem, and semantic properties of LLM outputs can become impossible to verify when the LLM is capable of providing "encrypted" outputs. We further show challenges of censorship can extend beyond just semantic censorship, as attackers can reconstruct impermissible outputs from a collection of permissible ones. Consequently, we call for a re-evaluation of the problem of censorship and its goals, stressing the need for new definitions and approaches to censorship. In addition, we provide an initial attempt toward achieving this goal through syntactic censorship, drawing from a security perspective to design censorship methods that can provide guarantees.

Cite this Paper


BibTeX
@InProceedings{pmlr-v235-glukhov24a, title = {Position: Fundamental Limitations of {LLM} Censorship Necessitate New Approaches}, author = {Glukhov, David and Shumailov, Ilia and Gal, Yarin and Papernot, Nicolas and Papyan, Vardan}, booktitle = {Proceedings of the 41st International Conference on Machine Learning}, pages = {15767--15787}, year = {2024}, editor = {Salakhutdinov, Ruslan and Kolter, Zico and Heller, Katherine and Weller, Adrian and Oliver, Nuria and Scarlett, Jonathan and Berkenkamp, Felix}, volume = {235}, series = {Proceedings of Machine Learning Research}, month = {21--27 Jul}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v235/main/assets/glukhov24a/glukhov24a.pdf}, url = {https://proceedings.mlr.press/v235/glukhov24a.html}, abstract = {Large language models (LLMs) have exhibited impressive capabilities in comprehending complex instructions. However, their blind adherence to provided instructions has led to concerns regarding risks of malicious use. Existing defence mechanisms, such as model fine-tuning or output censorship methods have proven to be fallible at ensuring that LLMs do not return semantically impermissible responses. We present fundamental limitations of verifying the semantic properties of LLM outputs and identifying compositional threats, illustrating inherent challenges of current approaches to censoring LLM outputs. Specifically, we demonstrate that semantic censorship can be perceived as an undecidable problem, and semantic properties of LLM outputs can become impossible to verify when the LLM is capable of providing "encrypted" outputs. We further show challenges of censorship can extend beyond just semantic censorship, as attackers can reconstruct impermissible outputs from a collection of permissible ones. Consequently, we call for a re-evaluation of the problem of censorship and its goals, stressing the need for new definitions and approaches to censorship. In addition, we provide an initial attempt toward achieving this goal through syntactic censorship, drawing from a security perspective to design censorship methods that can provide guarantees.} }
Endnote
%0 Conference Paper %T Position: Fundamental Limitations of LLM Censorship Necessitate New Approaches %A David Glukhov %A Ilia Shumailov %A Yarin Gal %A Nicolas Papernot %A Vardan Papyan %B Proceedings of the 41st International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2024 %E Ruslan Salakhutdinov %E Zico Kolter %E Katherine Heller %E Adrian Weller %E Nuria Oliver %E Jonathan Scarlett %E Felix Berkenkamp %F pmlr-v235-glukhov24a %I PMLR %P 15767--15787 %U https://proceedings.mlr.press/v235/glukhov24a.html %V 235 %X Large language models (LLMs) have exhibited impressive capabilities in comprehending complex instructions. However, their blind adherence to provided instructions has led to concerns regarding risks of malicious use. Existing defence mechanisms, such as model fine-tuning or output censorship methods have proven to be fallible at ensuring that LLMs do not return semantically impermissible responses. We present fundamental limitations of verifying the semantic properties of LLM outputs and identifying compositional threats, illustrating inherent challenges of current approaches to censoring LLM outputs. Specifically, we demonstrate that semantic censorship can be perceived as an undecidable problem, and semantic properties of LLM outputs can become impossible to verify when the LLM is capable of providing "encrypted" outputs. We further show challenges of censorship can extend beyond just semantic censorship, as attackers can reconstruct impermissible outputs from a collection of permissible ones. Consequently, we call for a re-evaluation of the problem of censorship and its goals, stressing the need for new definitions and approaches to censorship. In addition, we provide an initial attempt toward achieving this goal through syntactic censorship, drawing from a security perspective to design censorship methods that can provide guarantees.
APA
Glukhov, D., Shumailov, I., Gal, Y., Papernot, N. & Papyan, V.. (2024). Position: Fundamental Limitations of LLM Censorship Necessitate New Approaches. Proceedings of the 41st International Conference on Machine Learning, in Proceedings of Machine Learning Research 235:15767-15787 Available from https://proceedings.mlr.press/v235/glukhov24a.html.

Related Material