[edit]
On the Sample Complexity of Two-Layer Networks: Lipschitz Vs. Element-Wise Lipschitz Activation
Proceedings of The 35th International Conference on Algorithmic Learning Theory, PMLR 237:505-517, 2024.
Abstract
This study delves into the sample complexity of two-layer neural networks. For a given reference matrix $W^0 \in \mathbb{R}^{\mathcal{T}\times d}$ (typically representing initial training weights) and an $O(1)$-Lipschitz activation function $\sigma:\mathbb{R}\to\mathbb{R}$, we examine the class $\mathcal{H}_{W^0, B, R, r}^{\sigma} = \left\{\textbf{x}\mapsto ⟨\textbf{v},\sigma((W+W^0)\textbf{x})⟩: \|W\|_{\text{Frobenius}} \le R, \|\textbf{v}\| \le r, \|\textbf{x}\|\le B\right\}$. We demonstrate that when $\sigma$ operates element-wise, the sample complexity of $\mathcal{H}_{W^0, B, R, r}^{\sigma}$ is bounded by $\tilde O \left(\frac{L^2 B^2 r^2 (R^2+\|W\|^2_{\text{Spectral}})}{\epsilon^2}\right)$. This bound is optimal, barring logarithmic factors, and depends logarithmically on the width $\mathcal{T}$. This finding builds upon [Vardi et al., 2022], who established a similar outcome for $W^0 = 0$. Our motivation stems from the real-world observation that trained weights often remain close to their initial counterparts, implying that $\|W\|_{\text{Frobenius}} \ll \|W+W^0\|_{\text{Frobenius}}$. To arrive at our conclusion, we employed and enhanced a recently new norm-based bounds method, the Approximate Description Length (ADL), as proposed by [Daniely and and Granot, 2019]. Finally, our results underline the crucial role of the element-wise nature of $\sigma$ for achieving a logarithmic width-dependent bound. To illustrate, we prove that there exists an $O(1)$-Lipschitz (non-element-wise) activation function $\Psi:\mathbb{R}^{\mathcal{T}}\to\mathbb{R}^{\mathcal{T}}$ where the sample complexity of $\mathcal{H}_{W^0, B, R, r}^{\Psi}$ increases linearly with the width.