Removing Spurious Concepts from Neural Network Representations via Joint Subspace Estimation

Floris Holstege, Bram Wouters, Noud Van Giersbergen, Cees Diks
Proceedings of the 41st International Conference on Machine Learning, PMLR 235:18568-18610, 2024.

Abstract

An important challenge in the field of interpretable machine learning is to ensure that deep neural networks (DNNs) use the correct or desirable input features in performing their tasks. Concept-removal methods aim to do this by eliminating concepts that are spuriously correlated with the main task from the neural network representation of the data. However, existing methods tend to be overzealous by inadvertently removing part of the correct or desirable features as well, leading to wrong interpretations and hurting model performance. We propose an iterative algorithm that separates spurious from main-task concepts by jointly estimating two low-dimensional orthogonal subspaces of the neural network representation. By evaluating the algorithm on benchmark datasets from computer vision (Waterbirds, CelebA) and natural language processing (MultiNLI), we show it outperforms existing concept-removal methods in terms of identifying the main-task and spurious concepts, and removing only the latter.

Cite this Paper


BibTeX
@InProceedings{pmlr-v235-holstege24a, title = {Removing Spurious Concepts from Neural Network Representations via Joint Subspace Estimation}, author = {Holstege, Floris and Wouters, Bram and Giersbergen, Noud Van and Diks, Cees}, booktitle = {Proceedings of the 41st International Conference on Machine Learning}, pages = {18568--18610}, year = {2024}, editor = {Salakhutdinov, Ruslan and Kolter, Zico and Heller, Katherine and Weller, Adrian and Oliver, Nuria and Scarlett, Jonathan and Berkenkamp, Felix}, volume = {235}, series = {Proceedings of Machine Learning Research}, month = {21--27 Jul}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v235/main/assets/holstege24a/holstege24a.pdf}, url = {https://proceedings.mlr.press/v235/holstege24a.html}, abstract = {An important challenge in the field of interpretable machine learning is to ensure that deep neural networks (DNNs) use the correct or desirable input features in performing their tasks. Concept-removal methods aim to do this by eliminating concepts that are spuriously correlated with the main task from the neural network representation of the data. However, existing methods tend to be overzealous by inadvertently removing part of the correct or desirable features as well, leading to wrong interpretations and hurting model performance. We propose an iterative algorithm that separates spurious from main-task concepts by jointly estimating two low-dimensional orthogonal subspaces of the neural network representation. By evaluating the algorithm on benchmark datasets from computer vision (Waterbirds, CelebA) and natural language processing (MultiNLI), we show it outperforms existing concept-removal methods in terms of identifying the main-task and spurious concepts, and removing only the latter.} }
Endnote
%0 Conference Paper %T Removing Spurious Concepts from Neural Network Representations via Joint Subspace Estimation %A Floris Holstege %A Bram Wouters %A Noud Van Giersbergen %A Cees Diks %B Proceedings of the 41st International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2024 %E Ruslan Salakhutdinov %E Zico Kolter %E Katherine Heller %E Adrian Weller %E Nuria Oliver %E Jonathan Scarlett %E Felix Berkenkamp %F pmlr-v235-holstege24a %I PMLR %P 18568--18610 %U https://proceedings.mlr.press/v235/holstege24a.html %V 235 %X An important challenge in the field of interpretable machine learning is to ensure that deep neural networks (DNNs) use the correct or desirable input features in performing their tasks. Concept-removal methods aim to do this by eliminating concepts that are spuriously correlated with the main task from the neural network representation of the data. However, existing methods tend to be overzealous by inadvertently removing part of the correct or desirable features as well, leading to wrong interpretations and hurting model performance. We propose an iterative algorithm that separates spurious from main-task concepts by jointly estimating two low-dimensional orthogonal subspaces of the neural network representation. By evaluating the algorithm on benchmark datasets from computer vision (Waterbirds, CelebA) and natural language processing (MultiNLI), we show it outperforms existing concept-removal methods in terms of identifying the main-task and spurious concepts, and removing only the latter.
APA
Holstege, F., Wouters, B., Giersbergen, N.V. & Diks, C.. (2024). Removing Spurious Concepts from Neural Network Representations via Joint Subspace Estimation. Proceedings of the 41st International Conference on Machine Learning, in Proceedings of Machine Learning Research 235:18568-18610 Available from https://proceedings.mlr.press/v235/holstege24a.html.

Related Material