Contrast-CAT: Contrasting Activations for Enhanced Interpretability in Transformer-based Text Classifiers

Sungmin Han, Jeonghyun Lee, Sangkyun Lee
Proceedings of the Forty-first Conference on Uncertainty in Artificial Intelligence, PMLR 286:1616-1625, 2025.

Abstract

Transformers have profoundly influenced AI research, but explaining their decisions remains challenging – even for relatively simpler tasks such as classification – which hinders trust and safe deployment in real-world applications. Although activation-based attribution methods effectively explain transformer-based text classification models, our findings reveal that these methods can be undermined by class-irrelevant features within activations, leading to less reliable interpretations. To address this limitation, we propose Contrast-CAT, a novel activation contrast-based attribution method that refines token-level attributions by filtering out class-irrelevant features. By contrasting the activations of an input sequence with reference activations, Contrast-CAT generates clearer and more faithful attribution maps. Experimental results across various datasets and models confirm that Contrast-CAT consistently outperforms state-of-the-art methods. Notably, under the MoRF setting, it achieves average improvements of $\times 1.30$ in AOPC and $\times 2.25$ in LOdds over the most competing methods, demonstrating its effectiveness in enhancing interpretability for transformer-based text classification.

Cite this Paper


BibTeX
@InProceedings{pmlr-v286-han25a, title = {Contrast-CAT: Contrasting Activations for Enhanced Interpretability in Transformer-based Text Classifiers}, author = {Han, Sungmin and Lee, Jeonghyun and Lee, Sangkyun}, booktitle = {Proceedings of the Forty-first Conference on Uncertainty in Artificial Intelligence}, pages = {1616--1625}, year = {2025}, editor = {Chiappa, Silvia and Magliacane, Sara}, volume = {286}, series = {Proceedings of Machine Learning Research}, month = {21--25 Jul}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v286/main/assets/han25a/han25a.pdf}, url = {https://proceedings.mlr.press/v286/han25a.html}, abstract = {Transformers have profoundly influenced AI research, but explaining their decisions remains challenging – even for relatively simpler tasks such as classification – which hinders trust and safe deployment in real-world applications. Although activation-based attribution methods effectively explain transformer-based text classification models, our findings reveal that these methods can be undermined by class-irrelevant features within activations, leading to less reliable interpretations. To address this limitation, we propose Contrast-CAT, a novel activation contrast-based attribution method that refines token-level attributions by filtering out class-irrelevant features. By contrasting the activations of an input sequence with reference activations, Contrast-CAT generates clearer and more faithful attribution maps. Experimental results across various datasets and models confirm that Contrast-CAT consistently outperforms state-of-the-art methods. Notably, under the MoRF setting, it achieves average improvements of $\times 1.30$ in AOPC and $\times 2.25$ in LOdds over the most competing methods, demonstrating its effectiveness in enhancing interpretability for transformer-based text classification.} }
Endnote
%0 Conference Paper %T Contrast-CAT: Contrasting Activations for Enhanced Interpretability in Transformer-based Text Classifiers %A Sungmin Han %A Jeonghyun Lee %A Sangkyun Lee %B Proceedings of the Forty-first Conference on Uncertainty in Artificial Intelligence %C Proceedings of Machine Learning Research %D 2025 %E Silvia Chiappa %E Sara Magliacane %F pmlr-v286-han25a %I PMLR %P 1616--1625 %U https://proceedings.mlr.press/v286/han25a.html %V 286 %X Transformers have profoundly influenced AI research, but explaining their decisions remains challenging – even for relatively simpler tasks such as classification – which hinders trust and safe deployment in real-world applications. Although activation-based attribution methods effectively explain transformer-based text classification models, our findings reveal that these methods can be undermined by class-irrelevant features within activations, leading to less reliable interpretations. To address this limitation, we propose Contrast-CAT, a novel activation contrast-based attribution method that refines token-level attributions by filtering out class-irrelevant features. By contrasting the activations of an input sequence with reference activations, Contrast-CAT generates clearer and more faithful attribution maps. Experimental results across various datasets and models confirm that Contrast-CAT consistently outperforms state-of-the-art methods. Notably, under the MoRF setting, it achieves average improvements of $\times 1.30$ in AOPC and $\times 2.25$ in LOdds over the most competing methods, demonstrating its effectiveness in enhancing interpretability for transformer-based text classification.
APA
Han, S., Lee, J. & Lee, S.. (2025). Contrast-CAT: Contrasting Activations for Enhanced Interpretability in Transformer-based Text Classifiers. Proceedings of the Forty-first Conference on Uncertainty in Artificial Intelligence, in Proceedings of Machine Learning Research 286:1616-1625 Available from https://proceedings.mlr.press/v286/han25a.html.

Related Material