Using Sliced Mutual Information to Study Memorization and Generalization in Deep Neural Networks

Shelvia Wongso, Rohan Ghosh, Mehul Motani
Proceedings of The 26th International Conference on Artificial Intelligence and Statistics, PMLR 206:11608-11629, 2023.

Abstract

In this paper, we study the memorization and generalization behaviour of deep neural networks (DNNs) using sliced mutual information (SMI), which is the average of the mutual information (MI) between one-dimensional random projections. We argue that the SMI between features in a DNN ($T$) and ground truth labels ($Y$), $SI(T;Y)$, can be seen as a form of usable information that the features contain about the labels. We show theoretically that $SI(T;Y)$ can encode geometric properties of the feature distribution, such as its spherical soft-margin and intrinsic dimensionality, in a way that MI cannot. Additionally, we present empirical evidence showing how $SI(T;Y)$ can capture memorization and generalization in DNNs. In particular, we find that, in the presence of label noise, all layers start to memorize but the earlier layers stabilize more quickly than the deeper layers. Finally, we point out that, in the context of Bayesian Neural Networks, the SMI between the penultimate layer and the output represents the worst case uncertainty of the network’s output.

Cite this Paper


BibTeX
@InProceedings{pmlr-v206-wongso23a, title = {Using Sliced Mutual Information to Study Memorization and Generalization in Deep Neural Networks}, author = {Wongso, Shelvia and Ghosh, Rohan and Motani, Mehul}, booktitle = {Proceedings of The 26th International Conference on Artificial Intelligence and Statistics}, pages = {11608--11629}, year = {2023}, editor = {Ruiz, Francisco and Dy, Jennifer and van de Meent, Jan-Willem}, volume = {206}, series = {Proceedings of Machine Learning Research}, month = {25--27 Apr}, publisher = {PMLR}, pdf = {https://proceedings.mlr.press/v206/wongso23a/wongso23a.pdf}, url = {https://proceedings.mlr.press/v206/wongso23a.html}, abstract = {In this paper, we study the memorization and generalization behaviour of deep neural networks (DNNs) using sliced mutual information (SMI), which is the average of the mutual information (MI) between one-dimensional random projections. We argue that the SMI between features in a DNN ($T$) and ground truth labels ($Y$), $SI(T;Y)$, can be seen as a form of usable information that the features contain about the labels. We show theoretically that $SI(T;Y)$ can encode geometric properties of the feature distribution, such as its spherical soft-margin and intrinsic dimensionality, in a way that MI cannot. Additionally, we present empirical evidence showing how $SI(T;Y)$ can capture memorization and generalization in DNNs. In particular, we find that, in the presence of label noise, all layers start to memorize but the earlier layers stabilize more quickly than the deeper layers. Finally, we point out that, in the context of Bayesian Neural Networks, the SMI between the penultimate layer and the output represents the worst case uncertainty of the network’s output.} }
Endnote
%0 Conference Paper %T Using Sliced Mutual Information to Study Memorization and Generalization in Deep Neural Networks %A Shelvia Wongso %A Rohan Ghosh %A Mehul Motani %B Proceedings of The 26th International Conference on Artificial Intelligence and Statistics %C Proceedings of Machine Learning Research %D 2023 %E Francisco Ruiz %E Jennifer Dy %E Jan-Willem van de Meent %F pmlr-v206-wongso23a %I PMLR %P 11608--11629 %U https://proceedings.mlr.press/v206/wongso23a.html %V 206 %X In this paper, we study the memorization and generalization behaviour of deep neural networks (DNNs) using sliced mutual information (SMI), which is the average of the mutual information (MI) between one-dimensional random projections. We argue that the SMI between features in a DNN ($T$) and ground truth labels ($Y$), $SI(T;Y)$, can be seen as a form of usable information that the features contain about the labels. We show theoretically that $SI(T;Y)$ can encode geometric properties of the feature distribution, such as its spherical soft-margin and intrinsic dimensionality, in a way that MI cannot. Additionally, we present empirical evidence showing how $SI(T;Y)$ can capture memorization and generalization in DNNs. In particular, we find that, in the presence of label noise, all layers start to memorize but the earlier layers stabilize more quickly than the deeper layers. Finally, we point out that, in the context of Bayesian Neural Networks, the SMI between the penultimate layer and the output represents the worst case uncertainty of the network’s output.
APA
Wongso, S., Ghosh, R. & Motani, M.. (2023). Using Sliced Mutual Information to Study Memorization and Generalization in Deep Neural Networks. Proceedings of The 26th International Conference on Artificial Intelligence and Statistics, in Proceedings of Machine Learning Research 206:11608-11629 Available from https://proceedings.mlr.press/v206/wongso23a.html.

Related Material