Validating Mechanistic Interpretations: An Axiomatic Approach

Nils Palumbo, Ravi Mangal, Zifan Wang, Saranya Vijayakumar, Corina S. Pasareanu, Somesh Jha
Proceedings of the 42nd International Conference on Machine Learning, PMLR 267:47509-47544, 2025.

Abstract

Mechanistic interpretability aims to reverse engineer the computation performed by a neural network in terms of its internal components. Although there is a growing body of research on mechanistic interpretation of neural networks, the notion of a mechanistic interpretation itself is often ad-hoc. Inspired by the notion of abstract interpretation from the program analysis literature that aims to develop approximate semantics for programs, we give a set of axioms that formally characterize a mechanistic interpretation as a description that approximately captures the semantics of the neural network under analysis in a compositional manner. We demonstrate the applicability of these axioms for validating mechanistic interpretations on an existing, well-known interpretability study as well as on a new case study involving a Transformer-based model trained to solve the well-known 2-SAT problem.

Cite this Paper


BibTeX
@InProceedings{pmlr-v267-palumbo25a, title = {Validating Mechanistic Interpretations: An Axiomatic Approach}, author = {Palumbo, Nils and Mangal, Ravi and Wang, Zifan and Vijayakumar, Saranya and Pasareanu, Corina S. and Jha, Somesh}, booktitle = {Proceedings of the 42nd International Conference on Machine Learning}, pages = {47509--47544}, year = {2025}, editor = {Singh, Aarti and Fazel, Maryam and Hsu, Daniel and Lacoste-Julien, Simon and Berkenkamp, Felix and Maharaj, Tegan and Wagstaff, Kiri and Zhu, Jerry}, volume = {267}, series = {Proceedings of Machine Learning Research}, month = {13--19 Jul}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v267/main/assets/palumbo25a/palumbo25a.pdf}, url = {https://proceedings.mlr.press/v267/palumbo25a.html}, abstract = {Mechanistic interpretability aims to reverse engineer the computation performed by a neural network in terms of its internal components. Although there is a growing body of research on mechanistic interpretation of neural networks, the notion of a mechanistic interpretation itself is often ad-hoc. Inspired by the notion of abstract interpretation from the program analysis literature that aims to develop approximate semantics for programs, we give a set of axioms that formally characterize a mechanistic interpretation as a description that approximately captures the semantics of the neural network under analysis in a compositional manner. We demonstrate the applicability of these axioms for validating mechanistic interpretations on an existing, well-known interpretability study as well as on a new case study involving a Transformer-based model trained to solve the well-known 2-SAT problem.} }
Endnote
%0 Conference Paper %T Validating Mechanistic Interpretations: An Axiomatic Approach %A Nils Palumbo %A Ravi Mangal %A Zifan Wang %A Saranya Vijayakumar %A Corina S. Pasareanu %A Somesh Jha %B Proceedings of the 42nd International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2025 %E Aarti Singh %E Maryam Fazel %E Daniel Hsu %E Simon Lacoste-Julien %E Felix Berkenkamp %E Tegan Maharaj %E Kiri Wagstaff %E Jerry Zhu %F pmlr-v267-palumbo25a %I PMLR %P 47509--47544 %U https://proceedings.mlr.press/v267/palumbo25a.html %V 267 %X Mechanistic interpretability aims to reverse engineer the computation performed by a neural network in terms of its internal components. Although there is a growing body of research on mechanistic interpretation of neural networks, the notion of a mechanistic interpretation itself is often ad-hoc. Inspired by the notion of abstract interpretation from the program analysis literature that aims to develop approximate semantics for programs, we give a set of axioms that formally characterize a mechanistic interpretation as a description that approximately captures the semantics of the neural network under analysis in a compositional manner. We demonstrate the applicability of these axioms for validating mechanistic interpretations on an existing, well-known interpretability study as well as on a new case study involving a Transformer-based model trained to solve the well-known 2-SAT problem.
APA
Palumbo, N., Mangal, R., Wang, Z., Vijayakumar, S., Pasareanu, C.S. & Jha, S.. (2025). Validating Mechanistic Interpretations: An Axiomatic Approach. Proceedings of the 42nd International Conference on Machine Learning, in Proceedings of Machine Learning Research 267:47509-47544 Available from https://proceedings.mlr.press/v267/palumbo25a.html.

Related Material