[edit]
Backward Feature Correction: How Deep Learning Performs Deep (Hierarchical) Learning
Proceedings of Thirty Sixth Conference on Learning Theory, PMLR 195:4598-4598, 2023.
Abstract
Deep learning is also known as hierarchical learning, where the learner $\textit{learns}$ to represent a complicated target function by decomposing it into a sequence of simpler functions to reduce sample and time complexity. This paper formally analyzes how multi-layer neural networks can perform such hierarchical learning $\textit{efficiently}$ and $\textit{automatically}$ by applying stochastic gradient descent (SGD) or its variants on the training objective.On the conceptual side, we present a theoretical characterizations of how certain types of deep (i.e. super-constantly many layers) neural networks can still be sample and time efficiently trained on some hierarchical learning tasks, when no known existing algorithm (including layerwise training, kernel method, etc) is efficient. We establish a new principle called “backward feature correction”, where \emph{the errors in the lower-level features can be automatically corrected when training together with the higher-level layers}. We believe this is a key behind how deep learning is performing deep (hierarchical) learning, as opposed to layerwise learning or simulating some known non-hierarchical method.