Backward Feature Correction: How Deep Learning Performs Deep (Hierarchical) Learning

Zeyuan Allen-Zhu, Yuanzhi Li
Proceedings of Thirty Sixth Conference on Learning Theory, PMLR 195:4598-4598, 2023.

Abstract

Deep learning is also known as hierarchical learning, where the learner $\textit{learns}$ to represent a complicated target function by decomposing it into a sequence of simpler functions to reduce sample and time complexity. This paper formally analyzes how multi-layer neural networks can perform such hierarchical learning $\textit{efficiently}$ and $\textit{automatically}$ by applying stochastic gradient descent (SGD) or its variants on the training objective.On the conceptual side, we present a theoretical characterizations of how certain types of deep (i.e. super-constantly many layers) neural networks can still be sample and time efficiently trained on some hierarchical learning tasks, when no known existing algorithm (including layerwise training, kernel method, etc) is efficient. We establish a new principle called “backward feature correction”, where \emph{the errors in the lower-level features can be automatically corrected when training together with the higher-level layers}. We believe this is a key behind how deep learning is performing deep (hierarchical) learning, as opposed to layerwise learning or simulating some known non-hierarchical method.

Cite this Paper


BibTeX
@InProceedings{pmlr-v195-allen-zhu23a, title = {Backward Feature Correction: How Deep Learning Performs Deep (Hierarchical) Learning}, author = {Allen-Zhu, Zeyuan and Li, Yuanzhi}, booktitle = {Proceedings of Thirty Sixth Conference on Learning Theory}, pages = {4598--4598}, year = {2023}, editor = {Neu, Gergely and Rosasco, Lorenzo}, volume = {195}, series = {Proceedings of Machine Learning Research}, month = {12--15 Jul}, publisher = {PMLR}, pdf = {https://proceedings.mlr.press/v195/allen-zhu23a/allen-zhu23a.pdf}, url = {https://proceedings.mlr.press/v195/allen-zhu23a.html}, abstract = {Deep learning is also known as hierarchical learning, where the learner $\textit{learns}$ to represent a complicated target function by decomposing it into a sequence of simpler functions to reduce sample and time complexity. This paper formally analyzes how multi-layer neural networks can perform such hierarchical learning $\textit{efficiently}$ and $\textit{automatically}$ by applying stochastic gradient descent (SGD) or its variants on the training objective.On the conceptual side, we present a theoretical characterizations of how certain types of deep (i.e. super-constantly many layers) neural networks can still be sample and time efficiently trained on some hierarchical learning tasks, when no known existing algorithm (including layerwise training, kernel method, etc) is efficient. We establish a new principle called “backward feature correction”, where \emph{the errors in the lower-level features can be automatically corrected when training together with the higher-level layers}. We believe this is a key behind how deep learning is performing deep (hierarchical) learning, as opposed to layerwise learning or simulating some known non-hierarchical method.} }
Endnote
%0 Conference Paper %T Backward Feature Correction: How Deep Learning Performs Deep (Hierarchical) Learning %A Zeyuan Allen-Zhu %A Yuanzhi Li %B Proceedings of Thirty Sixth Conference on Learning Theory %C Proceedings of Machine Learning Research %D 2023 %E Gergely Neu %E Lorenzo Rosasco %F pmlr-v195-allen-zhu23a %I PMLR %P 4598--4598 %U https://proceedings.mlr.press/v195/allen-zhu23a.html %V 195 %X Deep learning is also known as hierarchical learning, where the learner $\textit{learns}$ to represent a complicated target function by decomposing it into a sequence of simpler functions to reduce sample and time complexity. This paper formally analyzes how multi-layer neural networks can perform such hierarchical learning $\textit{efficiently}$ and $\textit{automatically}$ by applying stochastic gradient descent (SGD) or its variants on the training objective.On the conceptual side, we present a theoretical characterizations of how certain types of deep (i.e. super-constantly many layers) neural networks can still be sample and time efficiently trained on some hierarchical learning tasks, when no known existing algorithm (including layerwise training, kernel method, etc) is efficient. We establish a new principle called “backward feature correction”, where \emph{the errors in the lower-level features can be automatically corrected when training together with the higher-level layers}. We believe this is a key behind how deep learning is performing deep (hierarchical) learning, as opposed to layerwise learning or simulating some known non-hierarchical method.
APA
Allen-Zhu, Z. & Li, Y.. (2023). Backward Feature Correction: How Deep Learning Performs Deep (Hierarchical) Learning. Proceedings of Thirty Sixth Conference on Learning Theory, in Proceedings of Machine Learning Research 195:4598-4598 Available from https://proceedings.mlr.press/v195/allen-zhu23a.html.

Related Material