[edit]
Contrast-CAT: Contrasting Activations for Enhanced Interpretability in Transformer-based Text Classifiers
Proceedings of the Forty-first Conference on Uncertainty in Artificial Intelligence, PMLR 286:1616-1625, 2025.
Abstract
Transformers have profoundly influenced AI research, but explaining their decisions remains challenging – even for relatively simpler tasks such as classification – which hinders trust and safe deployment in real-world applications. Although activation-based attribution methods effectively explain transformer-based text classification models, our findings reveal that these methods can be undermined by class-irrelevant features within activations, leading to less reliable interpretations. To address this limitation, we propose Contrast-CAT, a novel activation contrast-based attribution method that refines token-level attributions by filtering out class-irrelevant features. By contrasting the activations of an input sequence with reference activations, Contrast-CAT generates clearer and more faithful attribution maps. Experimental results across various datasets and models confirm that Contrast-CAT consistently outperforms state-of-the-art methods. Notably, under the MoRF setting, it achieves average improvements of $\times 1.30$ in AOPC and $\times 2.25$ in LOdds over the most competing methods, demonstrating its effectiveness in enhancing interpretability for transformer-based text classification.