Sharp Rates in Dependent Learning Theory: Avoiding Sample Size Deflation for the Square Loss

Ingvar Ziemann; Stephen Tu; George J. Pappas; Nikolai Matni

Sharp Rates in Dependent Learning Theory: Avoiding Sample Size Deflation for the Square Loss

Ingvar Ziemann, Stephen Tu, George J. Pappas, Nikolai Matni

Proceedings of the 41st International Conference on Machine Learning, PMLR 235:62779-62802, 2024.

Abstract

In this work, we study statistical learning with dependent data and square loss in a hypothesis class with tail decay in Orlicz space:

$\mathscr{F}\subset L_{\Psi_p}$ . Our inquiry is motivated by the search for a sharp noise interaction term, or variance proxy, in learning with dependent (e.g.

$\beta$ -mixing) data. Typical non-asymptotic results exhibit variance proxies that are deflated multiplicatively in the mixing time of the underlying covariates process. We show that whenever the topologies of

$L^2$ and

$\Psi_p$ are comparable on our hypothesis class

$\mathscr{F}$ , the empirical risk minimizer achieves a rate that only depends on the complexity of the class and second order statistics in its leading term. We refer to this as a near mixing-free rate, since direct dependence on mixing is relegated to an additive higher order term. Our approach, reliant on mixed tail generic chaining, allows us to obtain sharp, instance-optimal rates. Examples that satisfy our framework include for instance sub-Gaussian linear regression and bounded smoothness classes.

Cite this Paper

BibTeX


@InProceedings{pmlr-v235-ziemann24a,
  title = 	 {Sharp Rates in Dependent Learning Theory: Avoiding Sample Size Deflation for the Square Loss},
  author =       {Ziemann, Ingvar and Tu, Stephen and Pappas, George J. and Matni, Nikolai},
  booktitle = 	 {Proceedings of the 41st International Conference on Machine Learning},
  pages = 	 {62779--62802},
  year = 	 {2024},
  editor = 	 {Salakhutdinov, Ruslan and Kolter, Zico and Heller, Katherine and Weller, Adrian and Oliver, Nuria and Scarlett, Jonathan and Berkenkamp, Felix},
  volume = 	 {235},
  series = 	 {Proceedings of Machine Learning Research},
  month = 	 {21--27 Jul},
  publisher =    {PMLR},
  pdf = 	 {https://raw.githubusercontent.com/mlresearch/v235/main/assets/ziemann24a/ziemann24a.pdf},
  url = 	 {https://proceedings.mlr.press/v235/ziemann24a.html},
  abstract = 	 {In this work, we study statistical learning with dependent data and square loss in a hypothesis class with tail decay in Orlicz space: $\mathscr{F}\subset L_{\Psi_p}$. Our inquiry is motivated by the search for a sharp noise interaction term, or variance proxy, in learning with dependent (e.g. $\beta$-mixing) data. Typical non-asymptotic results exhibit variance proxies that are deflated multiplicatively in the mixing time of the underlying covariates process. We show that whenever the topologies of $L^2$ and $\Psi_p$ are comparable on our hypothesis class $\mathscr{F}$, the empirical risk minimizer achieves a rate that only depends on the complexity of the class and second order statistics in its leading term. We refer to this as a near mixing-free rate, since direct dependence on mixing is relegated to an additive higher order term. Our approach, reliant on mixed tail generic chaining, allows us to obtain sharp, instance-optimal rates. Examples that satisfy our framework include for instance sub-Gaussian linear regression and bounded smoothness classes.}
}

Endnote

%0 Conference Paper
%T Sharp Rates in Dependent Learning Theory: Avoiding Sample Size Deflation for the Square Loss
%A Ingvar Ziemann
%A Stephen Tu
%A George J. Pappas
%A Nikolai Matni
%B Proceedings of the 41st International Conference on Machine Learning
%C Proceedings of Machine Learning Research
%D 2024
%E Ruslan Salakhutdinov
%E Zico Kolter
%E Katherine Heller
%E Adrian Weller
%E Nuria Oliver
%E Jonathan Scarlett
%E Felix Berkenkamp	
%F pmlr-v235-ziemann24a
%I PMLR
%P 62779--62802
%U https://proceedings.mlr.press/v235/ziemann24a.html
%V 235
%X In this work, we study statistical learning with dependent data and square loss in a hypothesis class with tail decay in Orlicz space: $\mathscr{F}\subset L_{\Psi_p}$. Our inquiry is motivated by the search for a sharp noise interaction term, or variance proxy, in learning with dependent (e.g. $\beta$-mixing) data. Typical non-asymptotic results exhibit variance proxies that are deflated multiplicatively in the mixing time of the underlying covariates process. We show that whenever the topologies of $L^2$ and $\Psi_p$ are comparable on our hypothesis class $\mathscr{F}$, the empirical risk minimizer achieves a rate that only depends on the complexity of the class and second order statistics in its leading term. We refer to this as a near mixing-free rate, since direct dependence on mixing is relegated to an additive higher order term. Our approach, reliant on mixed tail generic chaining, allows us to obtain sharp, instance-optimal rates. Examples that satisfy our framework include for instance sub-Gaussian linear regression and bounded smoothness classes.

APA


Ziemann, I., Tu, S., Pappas, G.J. & Matni, N.. (2024). Sharp Rates in Dependent Learning Theory: Avoiding Sample Size Deflation for the Square Loss. Proceedings of the 41st International Conference on Machine Learning, in Proceedings of Machine Learning Research 235:62779-62802 Available from https://proceedings.mlr.press/v235/ziemann24a.html.

Sharp Rates in Dependent Learning Theory: Avoiding Sample Size Deflation for the Square Loss

Abstract

Cite this Paper

Related Material