Evaluating Neuron Explanations: A Unified Framework with Sanity Checks

Tuomas Oikarinen, Ge Yan, Tsui-Wei Weng
Proceedings of the 42nd International Conference on Machine Learning, PMLR 267:47090-47131, 2025.

Abstract

Understanding the function of individual units in a neural network is an important building block for mechanistic interpretability. This is often done by generating a simple text explanation of the behavior of individual neurons or units. For these explanations to be useful, we must understand how reliable and truthful they are. In this work we unify many existing explanation evaluation methods under one mathematical framework. This allows us to compare and contrast existing evaluation metrics, understand the evaluation pipeline with increased clarity and apply existing statistical concepts on the evaluation. In addition, we propose two simple sanity checks on the evaluation metrics and show that many commonly used metrics fail these tests and do not change their score after massive changes to the concept labels. Based on our experimental and theoretical results, we propose guidelines that future evaluations should follow and identify a set of reliable evaluation metrics.

Cite this Paper


BibTeX
@InProceedings{pmlr-v267-oikarinen25a, title = {Evaluating Neuron Explanations: A Unified Framework with Sanity Checks}, author = {Oikarinen, Tuomas and Yan, Ge and Weng, Tsui-Wei}, booktitle = {Proceedings of the 42nd International Conference on Machine Learning}, pages = {47090--47131}, year = {2025}, editor = {Singh, Aarti and Fazel, Maryam and Hsu, Daniel and Lacoste-Julien, Simon and Berkenkamp, Felix and Maharaj, Tegan and Wagstaff, Kiri and Zhu, Jerry}, volume = {267}, series = {Proceedings of Machine Learning Research}, month = {13--19 Jul}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v267/main/assets/oikarinen25a/oikarinen25a.pdf}, url = {https://proceedings.mlr.press/v267/oikarinen25a.html}, abstract = {Understanding the function of individual units in a neural network is an important building block for mechanistic interpretability. This is often done by generating a simple text explanation of the behavior of individual neurons or units. For these explanations to be useful, we must understand how reliable and truthful they are. In this work we unify many existing explanation evaluation methods under one mathematical framework. This allows us to compare and contrast existing evaluation metrics, understand the evaluation pipeline with increased clarity and apply existing statistical concepts on the evaluation. In addition, we propose two simple sanity checks on the evaluation metrics and show that many commonly used metrics fail these tests and do not change their score after massive changes to the concept labels. Based on our experimental and theoretical results, we propose guidelines that future evaluations should follow and identify a set of reliable evaluation metrics.} }
Endnote
%0 Conference Paper %T Evaluating Neuron Explanations: A Unified Framework with Sanity Checks %A Tuomas Oikarinen %A Ge Yan %A Tsui-Wei Weng %B Proceedings of the 42nd International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2025 %E Aarti Singh %E Maryam Fazel %E Daniel Hsu %E Simon Lacoste-Julien %E Felix Berkenkamp %E Tegan Maharaj %E Kiri Wagstaff %E Jerry Zhu %F pmlr-v267-oikarinen25a %I PMLR %P 47090--47131 %U https://proceedings.mlr.press/v267/oikarinen25a.html %V 267 %X Understanding the function of individual units in a neural network is an important building block for mechanistic interpretability. This is often done by generating a simple text explanation of the behavior of individual neurons or units. For these explanations to be useful, we must understand how reliable and truthful they are. In this work we unify many existing explanation evaluation methods under one mathematical framework. This allows us to compare and contrast existing evaluation metrics, understand the evaluation pipeline with increased clarity and apply existing statistical concepts on the evaluation. In addition, we propose two simple sanity checks on the evaluation metrics and show that many commonly used metrics fail these tests and do not change their score after massive changes to the concept labels. Based on our experimental and theoretical results, we propose guidelines that future evaluations should follow and identify a set of reliable evaluation metrics.
APA
Oikarinen, T., Yan, G. & Weng, T.. (2025). Evaluating Neuron Explanations: A Unified Framework with Sanity Checks. Proceedings of the 42nd International Conference on Machine Learning, in Proceedings of Machine Learning Research 267:47090-47131 Available from https://proceedings.mlr.press/v267/oikarinen25a.html.

Related Material