On the Sample Complexity of Two-Layer Networks: Lipschitz Vs. Element-Wise Lipschitz Activation

Amit Daniely; Elad Granot

On the Sample Complexity of Two-Layer Networks: Lipschitz Vs. Element-Wise Lipschitz Activation

Amit Daniely, Elad Granot

Proceedings of The 35th International Conference on Algorithmic Learning Theory, PMLR 237:505-517, 2024.

Abstract

This study delves into the sample complexity of two-layer neural networks. For a given reference matrix

$W^0 \in \mathbb{R}^{\mathcal{T}\times d}$ (typically representing initial training weights) and an

$O(1)$ -Lipschitz activation function

$\sigma:\mathbb{R}\to\mathbb{R}$ , we examine the class

$\mathcal{H}_{W^0, B, R, r}^{\sigma} = \left\{\textbf{x}\mapsto ⟨\textbf{v},\sigma((W+W^0)\textbf{x})⟩: \|W\|_{\text{Frobenius}} \le R, \|\textbf{v}\| \le r, \|\textbf{x}\|\le B\right\}$ . We demonstrate that when

$\sigma$ operates element-wise, the sample complexity of

$\mathcal{H}_{W^0, B, R, r}^{\sigma}$ is bounded by

$\tilde O \left(\frac{L^2 B^2 r^2 (R^2+\|W\|^2_{\text{Spectral}})}{\epsilon^2}\right)$ . This bound is optimal, barring logarithmic factors, and depends logarithmically on the width

$\mathcal{T}$ . This finding builds upon [Vardi et al., 2022], who established a similar outcome for

$W^0 = 0$ . Our motivation stems from the real-world observation that trained weights often remain close to their initial counterparts, implying that

$\|W\|_{\text{Frobenius}} \ll \|W+W^0\|_{\text{Frobenius}}$ . To arrive at our conclusion, we employed and enhanced a recently new norm-based bounds method, the Approximate Description Length (ADL), as proposed by [Daniely and and Granot, 2019]. Finally, our results underline the crucial role of the element-wise nature of

$\sigma$ for achieving a logarithmic width-dependent bound. To illustrate, we prove that there exists an

$O(1)$ -Lipschitz (non-element-wise) activation function

$\Psi:\mathbb{R}^{\mathcal{T}}\to\mathbb{R}^{\mathcal{T}}$ where the sample complexity of

$\mathcal{H}_{W^0, B, R, r}^{\Psi}$ increases linearly with the width.

Cite this Paper

BibTeX


@InProceedings{pmlr-v237-daniely24a,
  title = 	 {On the Sample Complexity of Two-Layer Networks: Lipschitz Vs. Element-Wise Lipschitz Activation},
  author =       {Daniely, Amit and Granot, Elad},
  booktitle = 	 {Proceedings of The 35th International Conference on Algorithmic Learning Theory},
  pages = 	 {505--517},
  year = 	 {2024},
  editor = 	 {Vernade, Claire and Hsu, Daniel},
  volume = 	 {237},
  series = 	 {Proceedings of Machine Learning Research},
  month = 	 {25--28 Feb},
  publisher =    {PMLR},
  pdf = 	 {https://proceedings.mlr.press/v237/daniely24a/daniely24a.pdf},
  url = 	 {https://proceedings.mlr.press/v237/daniely24a.html},
  abstract = 	 {This study delves into the sample complexity of two-layer neural networks. For a given reference matrix $W^0 \in \mathbb{R}^{\mathcal{T}\times d}$ (typically representing initial training weights) and an $O(1)$-Lipschitz activation function $\sigma:\mathbb{R}\to\mathbb{R}$, we examine the class $\mathcal{H}_{W^0, B, R, r}^{\sigma} = \left\{\textbf{x}\mapsto ⟨\textbf{v},\sigma((W+W^0)\textbf{x})⟩: \|W\|_{\text{Frobenius}} \le R, \|\textbf{v}\| \le r, \|\textbf{x}\|\le B\right\}$. We demonstrate that when $\sigma$ operates element-wise, the sample complexity of $\mathcal{H}_{W^0, B, R, r}^{\sigma}$ is bounded by $\tilde O \left(\frac{L^2 B^2 r^2 (R^2+\|W\|^2_{\text{Spectral}})}{\epsilon^2}\right)$. This bound is optimal, barring logarithmic factors, and depends logarithmically on the width $\mathcal{T}$. This finding builds upon [Vardi et al., 2022], who established a similar outcome for $W^0 = 0$. Our motivation stems from the real-world observation that trained weights often remain close to their initial counterparts, implying that $\|W\|_{\text{Frobenius}} \ll \|W+W^0\|_{\text{Frobenius}}$. To arrive at our conclusion, we employed and enhanced a recently new norm-based bounds method, the Approximate Description Length (ADL), as proposed by [Daniely and and Granot, 2019]. Finally, our results underline the crucial role of the element-wise nature of $\sigma$ for achieving a logarithmic width-dependent bound. To illustrate, we prove that there exists an $O(1)$-Lipschitz (non-element-wise) activation function $\Psi:\mathbb{R}^{\mathcal{T}}\to\mathbb{R}^{\mathcal{T}}$ where the sample complexity of $\mathcal{H}_{W^0, B, R, r}^{\Psi}$ increases linearly with the width.}
}

Endnote

%0 Conference Paper
%T On the Sample Complexity of Two-Layer Networks: Lipschitz Vs. Element-Wise Lipschitz Activation
%A Amit Daniely
%A Elad Granot
%B Proceedings of The 35th International Conference on Algorithmic Learning Theory
%C Proceedings of Machine Learning Research
%D 2024
%E Claire Vernade
%E Daniel Hsu	
%F pmlr-v237-daniely24a
%I PMLR
%P 505--517
%U https://proceedings.mlr.press/v237/daniely24a.html
%V 237
%X This study delves into the sample complexity of two-layer neural networks. For a given reference matrix $W^0 \in \mathbb{R}^{\mathcal{T}\times d}$ (typically representing initial training weights) and an $O(1)$-Lipschitz activation function $\sigma:\mathbb{R}\to\mathbb{R}$, we examine the class $\mathcal{H}_{W^0, B, R, r}^{\sigma} = \left\{\textbf{x}\mapsto ⟨\textbf{v},\sigma((W+W^0)\textbf{x})⟩: \|W\|_{\text{Frobenius}} \le R, \|\textbf{v}\| \le r, \|\textbf{x}\|\le B\right\}$. We demonstrate that when $\sigma$ operates element-wise, the sample complexity of $\mathcal{H}_{W^0, B, R, r}^{\sigma}$ is bounded by $\tilde O \left(\frac{L^2 B^2 r^2 (R^2+\|W\|^2_{\text{Spectral}})}{\epsilon^2}\right)$. This bound is optimal, barring logarithmic factors, and depends logarithmically on the width $\mathcal{T}$. This finding builds upon [Vardi et al., 2022], who established a similar outcome for $W^0 = 0$. Our motivation stems from the real-world observation that trained weights often remain close to their initial counterparts, implying that $\|W\|_{\text{Frobenius}} \ll \|W+W^0\|_{\text{Frobenius}}$. To arrive at our conclusion, we employed and enhanced a recently new norm-based bounds method, the Approximate Description Length (ADL), as proposed by [Daniely and and Granot, 2019]. Finally, our results underline the crucial role of the element-wise nature of $\sigma$ for achieving a logarithmic width-dependent bound. To illustrate, we prove that there exists an $O(1)$-Lipschitz (non-element-wise) activation function $\Psi:\mathbb{R}^{\mathcal{T}}\to\mathbb{R}^{\mathcal{T}}$ where the sample complexity of $\mathcal{H}_{W^0, B, R, r}^{\Psi}$ increases linearly with the width.

APA


Daniely, A. & Granot, E.. (2024). On the Sample Complexity of Two-Layer Networks: Lipschitz Vs. Element-Wise Lipschitz Activation. Proceedings of The 35th International Conference on Algorithmic Learning Theory, in Proceedings of Machine Learning Research 237:505-517 Available from https://proceedings.mlr.press/v237/daniely24a.html.

Related Material

Download PDF