[edit]
Adapting Graph-Based Analysis for Knowledge Extraction from Transformer Models
Proceedings of The 19th International Conference on Neurosymbolic Learning and Reasoning, PMLR 284:1-14, 2025.
Abstract
Transformer models, despite their exceptional capabilities in Natural Language Processing (NLP) and Vision tasks, like deep neural network models, often function as "black boxes" as their internal processes remain largely opaque due to their complex architectures. This work extends graph-based knowledge extraction techniques, previously applied to CNNs, to the domain of Transformer models. The inner mechanics of Transformer models are explored by constructing a co-activation graph from their encoder layers. The nodes of the graph represent the hidden unit within each encoder layer, while the edges represent the statistical correlations between these hidden units. The magnitude of co-activation, which is the correlation between activations of two hidden units, determines the strength of their connection within the graph. Our research is focused on encoder-only Transformer classifiers. We conducted experiments involving a custom-built Transformer and a pre-trained BERT model for an NLP task. We used graph analysis to detect semantically related class clusters and their impact on misclassification patterns. We demonstrate a positive correlation between class similarity and the frequency of classification errors. Our findings suggest that co-activation graphs reveal structured, interpretable representations in Transformers, consistent with prior CNN findings on knowledge extraction.