Jumpout : Improved Dropout for Deep Neural Networks with ReLUs
Proceedings of the 36th International Conference on Machine Learning, PMLR 97:6668-6676, 2019.
We discuss three novel insights about dropout for DNNs with ReLUs: 1) dropout encourages each local linear piece of a DNN to be trained on data points from nearby regions; 2) the same dropout rate results in different (effective) deactivation rates for layers with different portions of ReLU-deactivated neurons; and 3) the rescaling factor of dropout causes a normalization inconsistency between training and test when used together with batch normalization. The above leads to three simple but nontrivial modifications resulting in our method “jumpout.” Jumpout samples the dropout rate from a monotone decreasing distribution (e.g., the right half of a Gaussian), so each local linear piece is trained, with high probability, to work better for data points from nearby than more distant regions. Jumpout moreover adaptively normalizes the dropout rate at each layer and every training batch, so the effective deactivation rate on the activated neurons is kept the same. Furthermore, it rescales the outputs for a better trade-off that keeps both the variance and mean of neurons more consistent between training and test phases, thereby mitigating the incompatibility between dropout and batch normalization. Jumpout significantly improves the performance of different neural nets on CIFAR10, CIFAR100, Fashion-MNIST, STL10, SVHN, ImageNet-1k, etc., while introducing negligible additional memory and computation costs.