MIB: A Mechanistic Interpretability Benchmark

Aaron Mueller; Atticus Geiger; Sarah Wiegreffe; Dana Arad; Iván Arcuschin; Adam Belfki; Yik Siu Chan; Jaden Fried Fiotto-Kaufman; Tal Haklay; Michael Hanna; Jing Huang; Rohan Gupta; Yaniv Nikankin; Hadas Orgad; Nikhil Prakash; Anja Reusch; Aruna Sankaranarayanan; Shun Shao; Alessandro Stolfo; Martin Tutek; Amir Zur; David Bau; Yonatan Belinkov

MIB: A Mechanistic Interpretability Benchmark

Aaron Mueller, Atticus Geiger, Sarah Wiegreffe, Dana Arad, Iván Arcuschin, Adam Belfki, Yik Siu Chan, Jaden Fried Fiotto-Kaufman, Tal Haklay, Michael Hanna, Jing Huang, Rohan Gupta, Yaniv Nikankin, Hadas Orgad, Nikhil Prakash, Anja Reusch, Aruna Sankaranarayanan, Shun Shao, Alessandro Stolfo, Martin Tutek, Amir Zur, David Bau, Yonatan Belinkov

Proceedings of the 42nd International Conference on Machine Learning, PMLR 267:45069-45108, 2025.

Abstract

How can we know whether new mechanistic interpretability methods achieve real improvements? In pursuit of lasting evaluation standards, we propose MIB, a Mechanistic Interpretability Benchmark, with two tracks spanning four tasks and five models. MIB favors methods that precisely and concisely recover relevant causal pathways or causal variables in neural language models. The circuit localization track compares methods that locate the model components—and connections between them—most important for performing a task (e.g., attribution patching or information flow routes). The causal variable track compares methods that featurize a hidden vector, e.g., sparse autoencoders (SAE) or distributed alignment search (DAS), and align those features to a task-relevant causal variable. Using MIB, we find that attribution and mask optimization methods perform best on circuit localization. For causal variable localization, we find that the supervised DAS method performs best, while SAEs features are not better than neurons, i.e., non-featurized hidden vectors. These findings illustrate that MIB enables meaningful comparisons, and increases our confidence that there has been real progress in the field.

Cite this Paper

BibTeX

@InProceedings{pmlr-v267-mueller25a,
  title = 	 {{MIB}: A Mechanistic Interpretability Benchmark},
  author =       {Mueller, Aaron and Geiger, Atticus and Wiegreffe, Sarah and Arad, Dana and Arcuschin, Iv\'{a}n and Belfki, Adam and Chan, Yik Siu and Fiotto-Kaufman, Jaden Fried and Haklay, Tal and Hanna, Michael and Huang, Jing and Gupta, Rohan and Nikankin, Yaniv and Orgad, Hadas and Prakash, Nikhil and Reusch, Anja and Sankaranarayanan, Aruna and Shao, Shun and Stolfo, Alessandro and Tutek, Martin and Zur, Amir and Bau, David and Belinkov, Yonatan},
  booktitle = 	 {Proceedings of the 42nd International Conference on Machine Learning},
  pages = 	 {45069--45108},
  year = 	 {2025},
  editor = 	 {Singh, Aarti and Fazel, Maryam and Hsu, Daniel and Lacoste-Julien, Simon and Berkenkamp, Felix and Maharaj, Tegan and Wagstaff, Kiri and Zhu, Jerry},
  volume = 	 {267},
  series = 	 {Proceedings of Machine Learning Research},
  month = 	 {13--19 Jul},
  publisher =    {PMLR},
  pdf = 	 {https://raw.githubusercontent.com/mlresearch/v267/main/assets/mueller25a/mueller25a.pdf},
  url = 	 {https://proceedings.mlr.press/v267/mueller25a.html},
  abstract = 	 {How can we know whether new mechanistic interpretability methods achieve real improvements? In pursuit of lasting evaluation standards, we propose MIB, a Mechanistic Interpretability Benchmark, with two tracks spanning four tasks and five models. MIB favors methods that precisely and concisely recover relevant causal pathways or causal variables in neural language models. The circuit localization track compares methods that locate the model components—and connections between them—most important for performing a task (e.g., attribution patching or information flow routes). The causal variable track compares methods that featurize a hidden vector, e.g., sparse autoencoders (SAE) or distributed alignment search (DAS), and align those features to a task-relevant causal variable. Using MIB, we find that attribution and mask optimization methods perform best on circuit localization. For causal variable localization, we find that the supervised DAS method performs best, while SAEs features are not better than neurons, i.e., non-featurized hidden vectors. These findings illustrate that MIB enables meaningful comparisons, and increases our confidence that there has been real progress in the field.}
}

Endnote

%0 Conference Paper
%T MIB: A Mechanistic Interpretability Benchmark
%A Aaron Mueller
%A Atticus Geiger
%A Sarah Wiegreffe
%A Dana Arad
%A Iván Arcuschin
%A Adam Belfki
%A Yik Siu Chan
%A Jaden Fried Fiotto-Kaufman
%A Tal Haklay
%A Michael Hanna
%A Jing Huang
%A Rohan Gupta
%A Yaniv Nikankin
%A Hadas Orgad
%A Nikhil Prakash
%A Anja Reusch
%A Aruna Sankaranarayanan
%A Shun Shao
%A Alessandro Stolfo
%A Martin Tutek
%A Amir Zur
%A David Bau
%A Yonatan Belinkov
%B Proceedings of the 42nd International Conference on Machine Learning
%C Proceedings of Machine Learning Research
%D 2025
%E Aarti Singh
%E Maryam Fazel
%E Daniel Hsu
%E Simon Lacoste-Julien
%E Felix Berkenkamp
%E Tegan Maharaj
%E Kiri Wagstaff
%E Jerry Zhu	
%F pmlr-v267-mueller25a
%I PMLR
%P 45069--45108
%U https://proceedings.mlr.press/v267/mueller25a.html
%V 267
%X How can we know whether new mechanistic interpretability methods achieve real improvements? In pursuit of lasting evaluation standards, we propose MIB, a Mechanistic Interpretability Benchmark, with two tracks spanning four tasks and five models. MIB favors methods that precisely and concisely recover relevant causal pathways or causal variables in neural language models. The circuit localization track compares methods that locate the model components—and connections between them—most important for performing a task (e.g., attribution patching or information flow routes). The causal variable track compares methods that featurize a hidden vector, e.g., sparse autoencoders (SAE) or distributed alignment search (DAS), and align those features to a task-relevant causal variable. Using MIB, we find that attribution and mask optimization methods perform best on circuit localization. For causal variable localization, we find that the supervised DAS method performs best, while SAEs features are not better than neurons, i.e., non-featurized hidden vectors. These findings illustrate that MIB enables meaningful comparisons, and increases our confidence that there has been real progress in the field.

APA

Mueller, A., Geiger, A., Wiegreffe, S., Arad, D., Arcuschin, I., Belfki, A., Chan, Y.S., Fiotto-Kaufman, J.F., Haklay, T., Hanna, M., Huang, J., Gupta, R., Nikankin, Y., Orgad, H., Prakash, N., Reusch, A., Sankaranarayanan, A., Shao, S., Stolfo, A., Tutek, M., Zur, A., Bau, D. & Belinkov, Y.. (2025). MIB: A Mechanistic Interpretability Benchmark. Proceedings of the 42nd International Conference on Machine Learning, in Proceedings of Machine Learning Research 267:45069-45108 Available from https://proceedings.mlr.press/v267/mueller25a.html.

MIB: A Mechanistic Interpretability Benchmark

Abstract

Cite this Paper

Related Material