Validating Mechanistic Interpretations: An Axiomatic Approach

Nils Palumbo; Ravi Mangal; Zifan Wang; Saranya Vijayakumar; Corina S. Pasareanu; Somesh Jha

Validating Mechanistic Interpretations: An Axiomatic Approach

Nils Palumbo, Ravi Mangal, Zifan Wang, Saranya Vijayakumar, Corina S. Pasareanu, Somesh Jha

Proceedings of the 42nd International Conference on Machine Learning, PMLR 267:47509-47544, 2025.

Abstract

Mechanistic interpretability aims to reverse engineer the computation performed by a neural network in terms of its internal components. Although there is a growing body of research on mechanistic interpretation of neural networks, the notion of a mechanistic interpretation itself is often ad-hoc. Inspired by the notion of abstract interpretation from the program analysis literature that aims to develop approximate semantics for programs, we give a set of axioms that formally characterize a mechanistic interpretation as a description that approximately captures the semantics of the neural network under analysis in a compositional manner. We demonstrate the applicability of these axioms for validating mechanistic interpretations on an existing, well-known interpretability study as well as on a new case study involving a Transformer-based model trained to solve the well-known 2-SAT problem.

Cite this Paper

BibTeX

@InProceedings{pmlr-v267-palumbo25a,
  title = 	 {Validating Mechanistic Interpretations: An Axiomatic Approach},
  author =       {Palumbo, Nils and Mangal, Ravi and Wang, Zifan and Vijayakumar, Saranya and Pasareanu, Corina S. and Jha, Somesh},
  booktitle = 	 {Proceedings of the 42nd International Conference on Machine Learning},
  pages = 	 {47509--47544},
  year = 	 {2025},
  editor = 	 {Singh, Aarti and Fazel, Maryam and Hsu, Daniel and Lacoste-Julien, Simon and Berkenkamp, Felix and Maharaj, Tegan and Wagstaff, Kiri and Zhu, Jerry},
  volume = 	 {267},
  series = 	 {Proceedings of Machine Learning Research},
  month = 	 {13--19 Jul},
  publisher =    {PMLR},
  pdf = 	 {https://raw.githubusercontent.com/mlresearch/v267/main/assets/palumbo25a/palumbo25a.pdf},
  url = 	 {https://proceedings.mlr.press/v267/palumbo25a.html},
  abstract = 	 {Mechanistic interpretability aims to reverse engineer the computation performed by a neural network in terms of its internal components. Although there is a growing body of research on mechanistic interpretation of neural networks, the notion of a mechanistic interpretation itself is often ad-hoc. Inspired by the notion of abstract interpretation from the program analysis literature that aims to develop approximate semantics for programs, we give a set of axioms that formally characterize a mechanistic interpretation as a description that approximately captures the semantics of the neural network under analysis in a compositional manner. We demonstrate the applicability of these axioms for validating mechanistic interpretations on an existing, well-known interpretability study as well as on a new case study involving a Transformer-based model trained to solve the well-known 2-SAT problem.}
}

Endnote

%0 Conference Paper
%T Validating Mechanistic Interpretations: An Axiomatic Approach
%A Nils Palumbo
%A Ravi Mangal
%A Zifan Wang
%A Saranya Vijayakumar
%A Corina S. Pasareanu
%A Somesh Jha
%B Proceedings of the 42nd International Conference on Machine Learning
%C Proceedings of Machine Learning Research
%D 2025
%E Aarti Singh
%E Maryam Fazel
%E Daniel Hsu
%E Simon Lacoste-Julien
%E Felix Berkenkamp
%E Tegan Maharaj
%E Kiri Wagstaff
%E Jerry Zhu	
%F pmlr-v267-palumbo25a
%I PMLR
%P 47509--47544
%U https://proceedings.mlr.press/v267/palumbo25a.html
%V 267
%X Mechanistic interpretability aims to reverse engineer the computation performed by a neural network in terms of its internal components. Although there is a growing body of research on mechanistic interpretation of neural networks, the notion of a mechanistic interpretation itself is often ad-hoc. Inspired by the notion of abstract interpretation from the program analysis literature that aims to develop approximate semantics for programs, we give a set of axioms that formally characterize a mechanistic interpretation as a description that approximately captures the semantics of the neural network under analysis in a compositional manner. We demonstrate the applicability of these axioms for validating mechanistic interpretations on an existing, well-known interpretability study as well as on a new case study involving a Transformer-based model trained to solve the well-known 2-SAT problem.

APA

Palumbo, N., Mangal, R., Wang, Z., Vijayakumar, S., Pasareanu, C.S. & Jha, S.. (2025). Validating Mechanistic Interpretations: An Axiomatic Approach. Proceedings of the 42nd International Conference on Machine Learning, in Proceedings of Machine Learning Research 267:47509-47544 Available from https://proceedings.mlr.press/v267/palumbo25a.html.

Validating Mechanistic Interpretations: An Axiomatic Approach

Abstract

Cite this Paper

Related Material