Characterizing Large Language Model Geometry Helps Solve Toxicity Detection and Generation

Randall Balestriero, Romain Cosentino, Sarath Shekkizhar
Proceedings of the 41st International Conference on Machine Learning, PMLR 235:2543-2565, 2024.

Abstract

Large Language Models (LLMs) drive current AI breakthroughs despite very little being known about their internal representations. In this work, we propose to shed the light on LLMs inner mechanisms through the lens of geometry. In particular, we develop in closed form $(i)$ the intrinsic dimension in which the Multi-Head Attention embeddings are constrained to exist and $(ii)$ the partition and per-region affine mappings of the feedforward (MLP) network of LLMs’ layers. Our theoretical findings further enable the design of novel principled solutions applicable to state-of-the-art LLMs. First, we show that, through our geometric understanding, we can bypass LLMs’ RLHF protection by controlling the embedding’s intrinsic dimension through informed prompt manipulation. Second, we derive interpretable geometrical features that can be extracted from any (pre-trained) LLM, providing a rich abstract representation of their inputs. We observe that these features are sufficient to help solve toxicity detection, and even allow the identification of various types of toxicity. Our results demonstrate how, even in large-scale regimes, exact theoretical results can answer practical questions in LLMs. Code: https://github.com/RandallBalestriero/SplineLLM

Cite this Paper


BibTeX
@InProceedings{pmlr-v235-balestriero24a, title = {Characterizing Large Language Model Geometry Helps Solve Toxicity Detection and Generation}, author = {Balestriero, Randall and Cosentino, Romain and Shekkizhar, Sarath}, booktitle = {Proceedings of the 41st International Conference on Machine Learning}, pages = {2543--2565}, year = {2024}, editor = {Salakhutdinov, Ruslan and Kolter, Zico and Heller, Katherine and Weller, Adrian and Oliver, Nuria and Scarlett, Jonathan and Berkenkamp, Felix}, volume = {235}, series = {Proceedings of Machine Learning Research}, month = {21--27 Jul}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v235/main/assets/balestriero24a/balestriero24a.pdf}, url = {https://proceedings.mlr.press/v235/balestriero24a.html}, abstract = {Large Language Models (LLMs) drive current AI breakthroughs despite very little being known about their internal representations. In this work, we propose to shed the light on LLMs inner mechanisms through the lens of geometry. In particular, we develop in closed form $(i)$ the intrinsic dimension in which the Multi-Head Attention embeddings are constrained to exist and $(ii)$ the partition and per-region affine mappings of the feedforward (MLP) network of LLMs’ layers. Our theoretical findings further enable the design of novel principled solutions applicable to state-of-the-art LLMs. First, we show that, through our geometric understanding, we can bypass LLMs’ RLHF protection by controlling the embedding’s intrinsic dimension through informed prompt manipulation. Second, we derive interpretable geometrical features that can be extracted from any (pre-trained) LLM, providing a rich abstract representation of their inputs. We observe that these features are sufficient to help solve toxicity detection, and even allow the identification of various types of toxicity. Our results demonstrate how, even in large-scale regimes, exact theoretical results can answer practical questions in LLMs. Code: https://github.com/RandallBalestriero/SplineLLM} }
Endnote
%0 Conference Paper %T Characterizing Large Language Model Geometry Helps Solve Toxicity Detection and Generation %A Randall Balestriero %A Romain Cosentino %A Sarath Shekkizhar %B Proceedings of the 41st International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2024 %E Ruslan Salakhutdinov %E Zico Kolter %E Katherine Heller %E Adrian Weller %E Nuria Oliver %E Jonathan Scarlett %E Felix Berkenkamp %F pmlr-v235-balestriero24a %I PMLR %P 2543--2565 %U https://proceedings.mlr.press/v235/balestriero24a.html %V 235 %X Large Language Models (LLMs) drive current AI breakthroughs despite very little being known about their internal representations. In this work, we propose to shed the light on LLMs inner mechanisms through the lens of geometry. In particular, we develop in closed form $(i)$ the intrinsic dimension in which the Multi-Head Attention embeddings are constrained to exist and $(ii)$ the partition and per-region affine mappings of the feedforward (MLP) network of LLMs’ layers. Our theoretical findings further enable the design of novel principled solutions applicable to state-of-the-art LLMs. First, we show that, through our geometric understanding, we can bypass LLMs’ RLHF protection by controlling the embedding’s intrinsic dimension through informed prompt manipulation. Second, we derive interpretable geometrical features that can be extracted from any (pre-trained) LLM, providing a rich abstract representation of their inputs. We observe that these features are sufficient to help solve toxicity detection, and even allow the identification of various types of toxicity. Our results demonstrate how, even in large-scale regimes, exact theoretical results can answer practical questions in LLMs. Code: https://github.com/RandallBalestriero/SplineLLM
APA
Balestriero, R., Cosentino, R. & Shekkizhar, S.. (2024). Characterizing Large Language Model Geometry Helps Solve Toxicity Detection and Generation. Proceedings of the 41st International Conference on Machine Learning, in Proceedings of Machine Learning Research 235:2543-2565 Available from https://proceedings.mlr.press/v235/balestriero24a.html.

Related Material