Robust Models Are More Interpretable Because Attributions Look Normal

Zifan Wang, Matt Fredrikson, Anupam Datta
Proceedings of the 39th International Conference on Machine Learning, PMLR 162:22625-22651, 2022.

Abstract

Recent work has found that adversarially-robust deep networks used for image classification are more interpretable: their feature attributions tend to be sharper, and are more concentrated on the objects associated with the image’s ground- truth class. We show that smooth decision boundaries play an important role in this enhanced interpretability, as the model’s input gradients around data points will more closely align with boundaries’ normal vectors when they are smooth. Thus, because robust models have smoother boundaries, the results of gradient- based attribution methods, like Integrated Gradients and DeepLift, will capture more accurate information about nearby decision boundaries. This understanding of robust interpretability leads to our second contribution: boundary attributions, which aggregate information about the normal vectors of local decision bound- aries to explain a classification outcome. We show that by leveraging the key fac- tors underpinning robust interpretability, boundary attributions produce sharper, more concentrated visual explanations{—}even on non-robust models.

Cite this Paper


BibTeX
@InProceedings{pmlr-v162-wang22e, title = {Robust Models Are More Interpretable Because Attributions Look Normal}, author = {Wang, Zifan and Fredrikson, Matt and Datta, Anupam}, booktitle = {Proceedings of the 39th International Conference on Machine Learning}, pages = {22625--22651}, year = {2022}, editor = {Chaudhuri, Kamalika and Jegelka, Stefanie and Song, Le and Szepesvari, Csaba and Niu, Gang and Sabato, Sivan}, volume = {162}, series = {Proceedings of Machine Learning Research}, month = {17--23 Jul}, publisher = {PMLR}, pdf = {https://proceedings.mlr.press/v162/wang22e/wang22e.pdf}, url = {https://proceedings.mlr.press/v162/wang22e.html}, abstract = {Recent work has found that adversarially-robust deep networks used for image classification are more interpretable: their feature attributions tend to be sharper, and are more concentrated on the objects associated with the image’s ground- truth class. We show that smooth decision boundaries play an important role in this enhanced interpretability, as the model’s input gradients around data points will more closely align with boundaries’ normal vectors when they are smooth. Thus, because robust models have smoother boundaries, the results of gradient- based attribution methods, like Integrated Gradients and DeepLift, will capture more accurate information about nearby decision boundaries. This understanding of robust interpretability leads to our second contribution: boundary attributions, which aggregate information about the normal vectors of local decision bound- aries to explain a classification outcome. We show that by leveraging the key fac- tors underpinning robust interpretability, boundary attributions produce sharper, more concentrated visual explanations{—}even on non-robust models.} }
Endnote
%0 Conference Paper %T Robust Models Are More Interpretable Because Attributions Look Normal %A Zifan Wang %A Matt Fredrikson %A Anupam Datta %B Proceedings of the 39th International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2022 %E Kamalika Chaudhuri %E Stefanie Jegelka %E Le Song %E Csaba Szepesvari %E Gang Niu %E Sivan Sabato %F pmlr-v162-wang22e %I PMLR %P 22625--22651 %U https://proceedings.mlr.press/v162/wang22e.html %V 162 %X Recent work has found that adversarially-robust deep networks used for image classification are more interpretable: their feature attributions tend to be sharper, and are more concentrated on the objects associated with the image’s ground- truth class. We show that smooth decision boundaries play an important role in this enhanced interpretability, as the model’s input gradients around data points will more closely align with boundaries’ normal vectors when they are smooth. Thus, because robust models have smoother boundaries, the results of gradient- based attribution methods, like Integrated Gradients and DeepLift, will capture more accurate information about nearby decision boundaries. This understanding of robust interpretability leads to our second contribution: boundary attributions, which aggregate information about the normal vectors of local decision bound- aries to explain a classification outcome. We show that by leveraging the key fac- tors underpinning robust interpretability, boundary attributions produce sharper, more concentrated visual explanations{—}even on non-robust models.
APA
Wang, Z., Fredrikson, M. & Datta, A.. (2022). Robust Models Are More Interpretable Because Attributions Look Normal. Proceedings of the 39th International Conference on Machine Learning, in Proceedings of Machine Learning Research 162:22625-22651 Available from https://proceedings.mlr.press/v162/wang22e.html.

Related Material