Towards Global-level Mechanistic Interpretability: A Perspective of Modular Circuits of Large Language Models

Yinhan He, Wendy Zheng, Yushun Dong, Yaochen Zhu, Chen Chen, Jundong Li
Proceedings of the 42nd International Conference on Machine Learning, PMLR 267:22865-22880, 2025.

Abstract

Mechanistic interpretability (MI) research aims to understand large language models (LLMs) by identifying computational circuits, subgraphs of model components with associated functional interpretations, that explain specific behaviors. Current MI approaches focus on discovering task-specific circuits, which has two key limitations: (1) poor generalizability across different language tasks, and (2) high costs associated with requiring human or advanced LLM interpretation of each computational node. To address these challenges, we propose developing a “modular circuit (MC) vocabulary” consisting of task-agnostic functional units. Each unit consists of a small computational subgraph with its interpretation. This approach enables global interpretability by allowing different language tasks to share common MCs, while reducing costs by reusing established interpretations for new tasks. We establish five criteria for characterizing the MC vocabulary and present ModCirc, a novel global-level mechanistic interpretability framework for discovering MC vocabularies in LLMs. We demonstrate ModCirc’s effectiveness by showing that it can identify modular circuits that perform well on various metrics.

Cite this Paper


BibTeX
@InProceedings{pmlr-v267-he25x, title = {Towards Global-level Mechanistic Interpretability: A Perspective of Modular Circuits of Large Language Models}, author = {He, Yinhan and Zheng, Wendy and Dong, Yushun and Zhu, Yaochen and Chen, Chen and Li, Jundong}, booktitle = {Proceedings of the 42nd International Conference on Machine Learning}, pages = {22865--22880}, year = {2025}, editor = {Singh, Aarti and Fazel, Maryam and Hsu, Daniel and Lacoste-Julien, Simon and Berkenkamp, Felix and Maharaj, Tegan and Wagstaff, Kiri and Zhu, Jerry}, volume = {267}, series = {Proceedings of Machine Learning Research}, month = {13--19 Jul}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v267/main/assets/he25x/he25x.pdf}, url = {https://proceedings.mlr.press/v267/he25x.html}, abstract = {Mechanistic interpretability (MI) research aims to understand large language models (LLMs) by identifying computational circuits, subgraphs of model components with associated functional interpretations, that explain specific behaviors. Current MI approaches focus on discovering task-specific circuits, which has two key limitations: (1) poor generalizability across different language tasks, and (2) high costs associated with requiring human or advanced LLM interpretation of each computational node. To address these challenges, we propose developing a “modular circuit (MC) vocabulary” consisting of task-agnostic functional units. Each unit consists of a small computational subgraph with its interpretation. This approach enables global interpretability by allowing different language tasks to share common MCs, while reducing costs by reusing established interpretations for new tasks. We establish five criteria for characterizing the MC vocabulary and present ModCirc, a novel global-level mechanistic interpretability framework for discovering MC vocabularies in LLMs. We demonstrate ModCirc’s effectiveness by showing that it can identify modular circuits that perform well on various metrics.} }
Endnote
%0 Conference Paper %T Towards Global-level Mechanistic Interpretability: A Perspective of Modular Circuits of Large Language Models %A Yinhan He %A Wendy Zheng %A Yushun Dong %A Yaochen Zhu %A Chen Chen %A Jundong Li %B Proceedings of the 42nd International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2025 %E Aarti Singh %E Maryam Fazel %E Daniel Hsu %E Simon Lacoste-Julien %E Felix Berkenkamp %E Tegan Maharaj %E Kiri Wagstaff %E Jerry Zhu %F pmlr-v267-he25x %I PMLR %P 22865--22880 %U https://proceedings.mlr.press/v267/he25x.html %V 267 %X Mechanistic interpretability (MI) research aims to understand large language models (LLMs) by identifying computational circuits, subgraphs of model components with associated functional interpretations, that explain specific behaviors. Current MI approaches focus on discovering task-specific circuits, which has two key limitations: (1) poor generalizability across different language tasks, and (2) high costs associated with requiring human or advanced LLM interpretation of each computational node. To address these challenges, we propose developing a “modular circuit (MC) vocabulary” consisting of task-agnostic functional units. Each unit consists of a small computational subgraph with its interpretation. This approach enables global interpretability by allowing different language tasks to share common MCs, while reducing costs by reusing established interpretations for new tasks. We establish five criteria for characterizing the MC vocabulary and present ModCirc, a novel global-level mechanistic interpretability framework for discovering MC vocabularies in LLMs. We demonstrate ModCirc’s effectiveness by showing that it can identify modular circuits that perform well on various metrics.
APA
He, Y., Zheng, W., Dong, Y., Zhu, Y., Chen, C. & Li, J.. (2025). Towards Global-level Mechanistic Interpretability: A Perspective of Modular Circuits of Large Language Models. Proceedings of the 42nd International Conference on Machine Learning, in Proceedings of Machine Learning Research 267:22865-22880 Available from https://proceedings.mlr.press/v267/he25x.html.

Related Material