From Neurons to Neutrons: A Case Study in Interpretability

Ouail Kitouni, Niklas Nolte, Vı́ctor Samuel Pérez-Dı́az, Sokratis Trifinopoulos, Mike Williams
Proceedings of the 41st International Conference on Machine Learning, PMLR 235:24726-24748, 2024.

Abstract

Mechanistic Interpretability (MI) proposes a path toward fully understanding how neural networks make their predictions. Prior work demonstrates that even when trained to perform simple arithmetic, models can implement a variety of algorithms (sometimes concurrently) depending on initialization and hyperparameters. Does this mean neuron-level interpretability techniques have limited applicability? Here, we argue that high-dimensional neural networks can learn useful low-dimensional representations of the data they were trained on, going beyond simply making good predictions: Such representations can be understood with the MI lens and provide insights that are surprisingly faithful to human-derived domain knowledge. This indicates that such approaches to interpretability can be useful for deriving a new understanding of a problem from models trained to solve it. As a case study, we extract nuclear physics concepts by studying models trained to reproduce nuclear data.

Cite this Paper


BibTeX
@InProceedings{pmlr-v235-kitouni24a, title = {From Neurons to Neutrons: A Case Study in Interpretability}, author = {Kitouni, Ouail and Nolte, Niklas and P\'{e}rez-D\'{\i}az, V\'{\i}ctor Samuel and Trifinopoulos, Sokratis and Williams, Mike}, booktitle = {Proceedings of the 41st International Conference on Machine Learning}, pages = {24726--24748}, year = {2024}, editor = {Salakhutdinov, Ruslan and Kolter, Zico and Heller, Katherine and Weller, Adrian and Oliver, Nuria and Scarlett, Jonathan and Berkenkamp, Felix}, volume = {235}, series = {Proceedings of Machine Learning Research}, month = {21--27 Jul}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v235/main/assets/kitouni24a/kitouni24a.pdf}, url = {https://proceedings.mlr.press/v235/kitouni24a.html}, abstract = {Mechanistic Interpretability (MI) proposes a path toward fully understanding how neural networks make their predictions. Prior work demonstrates that even when trained to perform simple arithmetic, models can implement a variety of algorithms (sometimes concurrently) depending on initialization and hyperparameters. Does this mean neuron-level interpretability techniques have limited applicability? Here, we argue that high-dimensional neural networks can learn useful low-dimensional representations of the data they were trained on, going beyond simply making good predictions: Such representations can be understood with the MI lens and provide insights that are surprisingly faithful to human-derived domain knowledge. This indicates that such approaches to interpretability can be useful for deriving a new understanding of a problem from models trained to solve it. As a case study, we extract nuclear physics concepts by studying models trained to reproduce nuclear data.} }
Endnote
%0 Conference Paper %T From Neurons to Neutrons: A Case Study in Interpretability %A Ouail Kitouni %A Niklas Nolte %A Vı́ctor Samuel Pérez-Dı́az %A Sokratis Trifinopoulos %A Mike Williams %B Proceedings of the 41st International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2024 %E Ruslan Salakhutdinov %E Zico Kolter %E Katherine Heller %E Adrian Weller %E Nuria Oliver %E Jonathan Scarlett %E Felix Berkenkamp %F pmlr-v235-kitouni24a %I PMLR %P 24726--24748 %U https://proceedings.mlr.press/v235/kitouni24a.html %V 235 %X Mechanistic Interpretability (MI) proposes a path toward fully understanding how neural networks make their predictions. Prior work demonstrates that even when trained to perform simple arithmetic, models can implement a variety of algorithms (sometimes concurrently) depending on initialization and hyperparameters. Does this mean neuron-level interpretability techniques have limited applicability? Here, we argue that high-dimensional neural networks can learn useful low-dimensional representations of the data they were trained on, going beyond simply making good predictions: Such representations can be understood with the MI lens and provide insights that are surprisingly faithful to human-derived domain knowledge. This indicates that such approaches to interpretability can be useful for deriving a new understanding of a problem from models trained to solve it. As a case study, we extract nuclear physics concepts by studying models trained to reproduce nuclear data.
APA
Kitouni, O., Nolte, N., Pérez-Dı́az, V.S., Trifinopoulos, S. & Williams, M.. (2024). From Neurons to Neutrons: A Case Study in Interpretability. Proceedings of the 41st International Conference on Machine Learning, in Proceedings of Machine Learning Research 235:24726-24748 Available from https://proceedings.mlr.press/v235/kitouni24a.html.

Related Material