On the Sample Complexity of Two-Layer Networks: Lipschitz Vs. Element-Wise Lipschitz Activation

Amit Daniely, Elad Granot
Proceedings of The 35th International Conference on Algorithmic Learning Theory, PMLR 237:505-517, 2024.

Abstract

This study delves into the sample complexity of two-layer neural networks. For a given reference matrix $W^0 \in \mathbb{R}^{\mathcal{T}\times d}$ (typically representing initial training weights) and an $O(1)$-Lipschitz activation function $\sigma:\mathbb{R}\to\mathbb{R}$, we examine the class $\mathcal{H}_{W^0, B, R, r}^{\sigma} = \left\{\textbf{x}\mapsto ⟨\textbf{v},\sigma((W+W^0)\textbf{x})⟩: \|W\|_{\text{Frobenius}} \le R, \|\textbf{v}\| \le r, \|\textbf{x}\|\le B\right\}$. We demonstrate that when $\sigma$ operates element-wise, the sample complexity of $\mathcal{H}_{W^0, B, R, r}^{\sigma}$ is bounded by $\tilde O \left(\frac{L^2 B^2 r^2 (R^2+\|W\|^2_{\text{Spectral}})}{\epsilon^2}\right)$. This bound is optimal, barring logarithmic factors, and depends logarithmically on the width $\mathcal{T}$. This finding builds upon [Vardi et al., 2022], who established a similar outcome for $W^0 = 0$. Our motivation stems from the real-world observation that trained weights often remain close to their initial counterparts, implying that $\|W\|_{\text{Frobenius}} \ll \|W+W^0\|_{\text{Frobenius}}$. To arrive at our conclusion, we employed and enhanced a recently new norm-based bounds method, the Approximate Description Length (ADL), as proposed by [Daniely and and Granot, 2019]. Finally, our results underline the crucial role of the element-wise nature of $\sigma$ for achieving a logarithmic width-dependent bound. To illustrate, we prove that there exists an $O(1)$-Lipschitz (non-element-wise) activation function $\Psi:\mathbb{R}^{\mathcal{T}}\to\mathbb{R}^{\mathcal{T}}$ where the sample complexity of $\mathcal{H}_{W^0, B, R, r}^{\Psi}$ increases linearly with the width.

Cite this Paper


BibTeX
@InProceedings{pmlr-v237-daniely24a, title = {On the Sample Complexity of Two-Layer Networks: Lipschitz Vs. Element-Wise Lipschitz Activation}, author = {Daniely, Amit and Granot, Elad}, booktitle = {Proceedings of The 35th International Conference on Algorithmic Learning Theory}, pages = {505--517}, year = {2024}, editor = {Vernade, Claire and Hsu, Daniel}, volume = {237}, series = {Proceedings of Machine Learning Research}, month = {25--28 Feb}, publisher = {PMLR}, pdf = {https://proceedings.mlr.press/v237/daniely24a/daniely24a.pdf}, url = {https://proceedings.mlr.press/v237/daniely24a.html}, abstract = {This study delves into the sample complexity of two-layer neural networks. For a given reference matrix $W^0 \in \mathbb{R}^{\mathcal{T}\times d}$ (typically representing initial training weights) and an $O(1)$-Lipschitz activation function $\sigma:\mathbb{R}\to\mathbb{R}$, we examine the class $\mathcal{H}_{W^0, B, R, r}^{\sigma} = \left\{\textbf{x}\mapsto ⟨\textbf{v},\sigma((W+W^0)\textbf{x})⟩: \|W\|_{\text{Frobenius}} \le R, \|\textbf{v}\| \le r, \|\textbf{x}\|\le B\right\}$. We demonstrate that when $\sigma$ operates element-wise, the sample complexity of $\mathcal{H}_{W^0, B, R, r}^{\sigma}$ is bounded by $\tilde O \left(\frac{L^2 B^2 r^2 (R^2+\|W\|^2_{\text{Spectral}})}{\epsilon^2}\right)$. This bound is optimal, barring logarithmic factors, and depends logarithmically on the width $\mathcal{T}$. This finding builds upon [Vardi et al., 2022], who established a similar outcome for $W^0 = 0$. Our motivation stems from the real-world observation that trained weights often remain close to their initial counterparts, implying that $\|W\|_{\text{Frobenius}} \ll \|W+W^0\|_{\text{Frobenius}}$. To arrive at our conclusion, we employed and enhanced a recently new norm-based bounds method, the Approximate Description Length (ADL), as proposed by [Daniely and and Granot, 2019]. Finally, our results underline the crucial role of the element-wise nature of $\sigma$ for achieving a logarithmic width-dependent bound. To illustrate, we prove that there exists an $O(1)$-Lipschitz (non-element-wise) activation function $\Psi:\mathbb{R}^{\mathcal{T}}\to\mathbb{R}^{\mathcal{T}}$ where the sample complexity of $\mathcal{H}_{W^0, B, R, r}^{\Psi}$ increases linearly with the width.} }
Endnote
%0 Conference Paper %T On the Sample Complexity of Two-Layer Networks: Lipschitz Vs. Element-Wise Lipschitz Activation %A Amit Daniely %A Elad Granot %B Proceedings of The 35th International Conference on Algorithmic Learning Theory %C Proceedings of Machine Learning Research %D 2024 %E Claire Vernade %E Daniel Hsu %F pmlr-v237-daniely24a %I PMLR %P 505--517 %U https://proceedings.mlr.press/v237/daniely24a.html %V 237 %X This study delves into the sample complexity of two-layer neural networks. For a given reference matrix $W^0 \in \mathbb{R}^{\mathcal{T}\times d}$ (typically representing initial training weights) and an $O(1)$-Lipschitz activation function $\sigma:\mathbb{R}\to\mathbb{R}$, we examine the class $\mathcal{H}_{W^0, B, R, r}^{\sigma} = \left\{\textbf{x}\mapsto ⟨\textbf{v},\sigma((W+W^0)\textbf{x})⟩: \|W\|_{\text{Frobenius}} \le R, \|\textbf{v}\| \le r, \|\textbf{x}\|\le B\right\}$. We demonstrate that when $\sigma$ operates element-wise, the sample complexity of $\mathcal{H}_{W^0, B, R, r}^{\sigma}$ is bounded by $\tilde O \left(\frac{L^2 B^2 r^2 (R^2+\|W\|^2_{\text{Spectral}})}{\epsilon^2}\right)$. This bound is optimal, barring logarithmic factors, and depends logarithmically on the width $\mathcal{T}$. This finding builds upon [Vardi et al., 2022], who established a similar outcome for $W^0 = 0$. Our motivation stems from the real-world observation that trained weights often remain close to their initial counterparts, implying that $\|W\|_{\text{Frobenius}} \ll \|W+W^0\|_{\text{Frobenius}}$. To arrive at our conclusion, we employed and enhanced a recently new norm-based bounds method, the Approximate Description Length (ADL), as proposed by [Daniely and and Granot, 2019]. Finally, our results underline the crucial role of the element-wise nature of $\sigma$ for achieving a logarithmic width-dependent bound. To illustrate, we prove that there exists an $O(1)$-Lipschitz (non-element-wise) activation function $\Psi:\mathbb{R}^{\mathcal{T}}\to\mathbb{R}^{\mathcal{T}}$ where the sample complexity of $\mathcal{H}_{W^0, B, R, r}^{\Psi}$ increases linearly with the width.
APA
Daniely, A. & Granot, E.. (2024). On the Sample Complexity of Two-Layer Networks: Lipschitz Vs. Element-Wise Lipschitz Activation. Proceedings of The 35th International Conference on Algorithmic Learning Theory, in Proceedings of Machine Learning Research 237:505-517 Available from https://proceedings.mlr.press/v237/daniely24a.html.

Related Material