Analyze Feature Flow to Enhance Interpretation and Steering in Language Models

Daniil Laptev, Nikita Balagansky, Yaroslav Aksenov, Daniil Gavrilov
Proceedings of the 42nd International Conference on Machine Learning, PMLR 267:32593-32616, 2025.

Abstract

We introduce a new approach to systematically map features discovered by sparse autoencoder across consecutive layers of large language models, extending earlier work that examined inter-layer feature links. By using a data-free cosine similarity technique, we trace how specific features persist, transform, or first appear at each stage. This method yields granular flow graphs of feature evolution, enabling fine-grained interpretability and mechanistic insights into model computations. Crucially, we demonstrate how these cross-layer feature maps facilitate direct steering of model behavior by amplifying or suppressing chosen features, achieving targeted thematic control in text generation. Together, our findings highlight the utility of a causal, cross-layer interpretability framework that not only clarifies how features develop through forward passes but also provides new means for transparent manipulation of large language models.

Cite this Paper


BibTeX
@InProceedings{pmlr-v267-laptev25a, title = {Analyze Feature Flow to Enhance Interpretation and Steering in Language Models}, author = {Laptev, Daniil and Balagansky, Nikita and Aksenov, Yaroslav and Gavrilov, Daniil}, booktitle = {Proceedings of the 42nd International Conference on Machine Learning}, pages = {32593--32616}, year = {2025}, editor = {Singh, Aarti and Fazel, Maryam and Hsu, Daniel and Lacoste-Julien, Simon and Berkenkamp, Felix and Maharaj, Tegan and Wagstaff, Kiri and Zhu, Jerry}, volume = {267}, series = {Proceedings of Machine Learning Research}, month = {13--19 Jul}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v267/main/assets/laptev25a/laptev25a.pdf}, url = {https://proceedings.mlr.press/v267/laptev25a.html}, abstract = {We introduce a new approach to systematically map features discovered by sparse autoencoder across consecutive layers of large language models, extending earlier work that examined inter-layer feature links. By using a data-free cosine similarity technique, we trace how specific features persist, transform, or first appear at each stage. This method yields granular flow graphs of feature evolution, enabling fine-grained interpretability and mechanistic insights into model computations. Crucially, we demonstrate how these cross-layer feature maps facilitate direct steering of model behavior by amplifying or suppressing chosen features, achieving targeted thematic control in text generation. Together, our findings highlight the utility of a causal, cross-layer interpretability framework that not only clarifies how features develop through forward passes but also provides new means for transparent manipulation of large language models.} }
Endnote
%0 Conference Paper %T Analyze Feature Flow to Enhance Interpretation and Steering in Language Models %A Daniil Laptev %A Nikita Balagansky %A Yaroslav Aksenov %A Daniil Gavrilov %B Proceedings of the 42nd International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2025 %E Aarti Singh %E Maryam Fazel %E Daniel Hsu %E Simon Lacoste-Julien %E Felix Berkenkamp %E Tegan Maharaj %E Kiri Wagstaff %E Jerry Zhu %F pmlr-v267-laptev25a %I PMLR %P 32593--32616 %U https://proceedings.mlr.press/v267/laptev25a.html %V 267 %X We introduce a new approach to systematically map features discovered by sparse autoencoder across consecutive layers of large language models, extending earlier work that examined inter-layer feature links. By using a data-free cosine similarity technique, we trace how specific features persist, transform, or first appear at each stage. This method yields granular flow graphs of feature evolution, enabling fine-grained interpretability and mechanistic insights into model computations. Crucially, we demonstrate how these cross-layer feature maps facilitate direct steering of model behavior by amplifying or suppressing chosen features, achieving targeted thematic control in text generation. Together, our findings highlight the utility of a causal, cross-layer interpretability framework that not only clarifies how features develop through forward passes but also provides new means for transparent manipulation of large language models.
APA
Laptev, D., Balagansky, N., Aksenov, Y. & Gavrilov, D.. (2025). Analyze Feature Flow to Enhance Interpretation and Steering in Language Models. Proceedings of the 42nd International Conference on Machine Learning, in Proceedings of Machine Learning Research 267:32593-32616 Available from https://proceedings.mlr.press/v267/laptev25a.html.

Related Material