[edit]
Ensembling With a Fixed Parameter Budget: When Does It Help and Why?
Proceedings of The 13th Asian Conference on Machine Learning, PMLR 157:1176-1191, 2021.
Abstract
Given a fixed parameter budget, one can build a single large neural network or create a memory-split ensemble: a pool of several smaller networks with the same total parameter count as the single network. A memory-split ensemble can outperform its single model counterpart (Lobacheva et al., 2020): a phenomenon known as the memory-split advantage (MSA). The reasons for MSA are still not yet fully understood. In particular, it is difficult in practice to predict when it will exist. This paper sheds light on the reasons underlying MSA using random feature theory. We study the dependence of the MSA on several factors: the parameter budget, the training set size, the L2 regularization and the Stochastic Gradient Descent (SGD) hyper-parameters. Using the bias-variance decomposition, we show that MSA exists when the reduction in variance due to the ensemble (\ie, \textit{ensemble gain}) exceeds the increase in squared bias due to the smaller size of the individual networks (\ie, \textit{shrinkage cost}). Taken together, our theoretical analysis demonstrates that the MSA mainly exists for the small parameter budgets relative to the training set size, and that memory-splitting can be understood as a type of regularization. Adding other forms of regularization, \eg L2 regularization, reduces the MSA. Thus, the potential benefit of memory-splitting lies primarily in the possibility of speed-up via parallel computation. Our empirical experiments with deep neural networks and large image datasets show that MSA is not a general phenomenon, but mainly exists when the number of training iterations is small.