The Logical Implication Steering Method for Conditional Interventions on Transformer Generation

Damjan Kalajdzievski
Proceedings of the 42nd International Conference on Machine Learning, PMLR 267:28689-28720, 2025.

Abstract

The field of mechanistic interpretability in pre-trained transformer models has demonstrated substantial evidence supporting the ”linear representation hypothesis”, which is the idea that high level concepts are encoded as vectors in the space of activations of a model. Studies also show that model generation behavior can be steered toward a given concept by adding the concept’s vector to the corresponding activations. We show how to leverage these properties to build a form of logical implication into models, enabling transparent and interpretable adjustments that induce a chosen generation behavior in response to the presence of any given concept. Our method, Logical Implication Model Steering (LIMS), unlocks new hand-engineered reasoning capabilities by integrating neuro-symbolic logic into pre-trained transformer models.

Cite this Paper


BibTeX
@InProceedings{pmlr-v267-kalajdzievski25a, title = {The Logical Implication Steering Method for Conditional Interventions on Transformer Generation}, author = {Kalajdzievski, Damjan}, booktitle = {Proceedings of the 42nd International Conference on Machine Learning}, pages = {28689--28720}, year = {2025}, editor = {Singh, Aarti and Fazel, Maryam and Hsu, Daniel and Lacoste-Julien, Simon and Berkenkamp, Felix and Maharaj, Tegan and Wagstaff, Kiri and Zhu, Jerry}, volume = {267}, series = {Proceedings of Machine Learning Research}, month = {13--19 Jul}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v267/main/assets/kalajdzievski25a/kalajdzievski25a.pdf}, url = {https://proceedings.mlr.press/v267/kalajdzievski25a.html}, abstract = {The field of mechanistic interpretability in pre-trained transformer models has demonstrated substantial evidence supporting the ”linear representation hypothesis”, which is the idea that high level concepts are encoded as vectors in the space of activations of a model. Studies also show that model generation behavior can be steered toward a given concept by adding the concept’s vector to the corresponding activations. We show how to leverage these properties to build a form of logical implication into models, enabling transparent and interpretable adjustments that induce a chosen generation behavior in response to the presence of any given concept. Our method, Logical Implication Model Steering (LIMS), unlocks new hand-engineered reasoning capabilities by integrating neuro-symbolic logic into pre-trained transformer models.} }
Endnote
%0 Conference Paper %T The Logical Implication Steering Method for Conditional Interventions on Transformer Generation %A Damjan Kalajdzievski %B Proceedings of the 42nd International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2025 %E Aarti Singh %E Maryam Fazel %E Daniel Hsu %E Simon Lacoste-Julien %E Felix Berkenkamp %E Tegan Maharaj %E Kiri Wagstaff %E Jerry Zhu %F pmlr-v267-kalajdzievski25a %I PMLR %P 28689--28720 %U https://proceedings.mlr.press/v267/kalajdzievski25a.html %V 267 %X The field of mechanistic interpretability in pre-trained transformer models has demonstrated substantial evidence supporting the ”linear representation hypothesis”, which is the idea that high level concepts are encoded as vectors in the space of activations of a model. Studies also show that model generation behavior can be steered toward a given concept by adding the concept’s vector to the corresponding activations. We show how to leverage these properties to build a form of logical implication into models, enabling transparent and interpretable adjustments that induce a chosen generation behavior in response to the presence of any given concept. Our method, Logical Implication Model Steering (LIMS), unlocks new hand-engineered reasoning capabilities by integrating neuro-symbolic logic into pre-trained transformer models.
APA
Kalajdzievski, D.. (2025). The Logical Implication Steering Method for Conditional Interventions on Transformer Generation. Proceedings of the 42nd International Conference on Machine Learning, in Proceedings of Machine Learning Research 267:28689-28720 Available from https://proceedings.mlr.press/v267/kalajdzievski25a.html.

Related Material