[edit]
On the Sample Complexity of Two-Layer Networks: Lipschitz Vs. Element-Wise Lipschitz Activation
Proceedings of The 35th International Conference on Algorithmic Learning Theory, PMLR 237:505-517, 2024.
Abstract
This study delves into the sample complexity of two-layer neural networks. For a given reference matrix W0∈RT×d (typically representing initial training weights) and an O(1)-Lipschitz activation function σ:R→R, we examine the class HσW0,B,R,r={x↦⟨v,σ((W+W0)x)⟩:‖. We demonstrate that when \sigma operates element-wise, the sample complexity of \mathcal{H}_{W^0, B, R, r}^{\sigma} is bounded by \tilde O \left(\frac{L^2 B^2 r^2 (R^2+\|W\|^2_{\text{Spectral}})}{\epsilon^2}\right). This bound is optimal, barring logarithmic factors, and depends logarithmically on the width \mathcal{T}. This finding builds upon [Vardi et al., 2022], who established a similar outcome for W^0 = 0. Our motivation stems from the real-world observation that trained weights often remain close to their initial counterparts, implying that \|W\|_{\text{Frobenius}} \ll \|W+W^0\|_{\text{Frobenius}}. To arrive at our conclusion, we employed and enhanced a recently new norm-based bounds method, the Approximate Description Length (ADL), as proposed by [Daniely and and Granot, 2019]. Finally, our results underline the crucial role of the element-wise nature of \sigma for achieving a logarithmic width-dependent bound. To illustrate, we prove that there exists an O(1)-Lipschitz (non-element-wise) activation function \Psi:\mathbb{R}^{\mathcal{T}}\to\mathbb{R}^{\mathcal{T}} where the sample complexity of \mathcal{H}_{W^0, B, R, r}^{\Psi} increases linearly with the width.