Sharp Rates in Dependent Learning Theory: Avoiding Sample Size Deflation for the Square Loss

Ingvar Ziemann, Stephen Tu, George J. Pappas, Nikolai Matni
Proceedings of the 41st International Conference on Machine Learning, PMLR 235:62779-62802, 2024.

Abstract

In this work, we study statistical learning with dependent data and square loss in a hypothesis class with tail decay in Orlicz space: FLΨp. Our inquiry is motivated by the search for a sharp noise interaction term, or variance proxy, in learning with dependent (e.g. β-mixing) data. Typical non-asymptotic results exhibit variance proxies that are deflated multiplicatively in the mixing time of the underlying covariates process. We show that whenever the topologies of L2 and Ψp are comparable on our hypothesis class F, the empirical risk minimizer achieves a rate that only depends on the complexity of the class and second order statistics in its leading term. We refer to this as a near mixing-free rate, since direct dependence on mixing is relegated to an additive higher order term. Our approach, reliant on mixed tail generic chaining, allows us to obtain sharp, instance-optimal rates. Examples that satisfy our framework include for instance sub-Gaussian linear regression and bounded smoothness classes.

Cite this Paper


BibTeX
@InProceedings{pmlr-v235-ziemann24a, title = {Sharp Rates in Dependent Learning Theory: Avoiding Sample Size Deflation for the Square Loss}, author = {Ziemann, Ingvar and Tu, Stephen and Pappas, George J. and Matni, Nikolai}, booktitle = {Proceedings of the 41st International Conference on Machine Learning}, pages = {62779--62802}, year = {2024}, editor = {Salakhutdinov, Ruslan and Kolter, Zico and Heller, Katherine and Weller, Adrian and Oliver, Nuria and Scarlett, Jonathan and Berkenkamp, Felix}, volume = {235}, series = {Proceedings of Machine Learning Research}, month = {21--27 Jul}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v235/main/assets/ziemann24a/ziemann24a.pdf}, url = {https://proceedings.mlr.press/v235/ziemann24a.html}, abstract = {In this work, we study statistical learning with dependent data and square loss in a hypothesis class with tail decay in Orlicz space: $\mathscr{F}\subset L_{\Psi_p}$. Our inquiry is motivated by the search for a sharp noise interaction term, or variance proxy, in learning with dependent (e.g. $\beta$-mixing) data. Typical non-asymptotic results exhibit variance proxies that are deflated multiplicatively in the mixing time of the underlying covariates process. We show that whenever the topologies of $L^2$ and $\Psi_p$ are comparable on our hypothesis class $\mathscr{F}$, the empirical risk minimizer achieves a rate that only depends on the complexity of the class and second order statistics in its leading term. We refer to this as a near mixing-free rate, since direct dependence on mixing is relegated to an additive higher order term. Our approach, reliant on mixed tail generic chaining, allows us to obtain sharp, instance-optimal rates. Examples that satisfy our framework include for instance sub-Gaussian linear regression and bounded smoothness classes.} }
Endnote
%0 Conference Paper %T Sharp Rates in Dependent Learning Theory: Avoiding Sample Size Deflation for the Square Loss %A Ingvar Ziemann %A Stephen Tu %A George J. Pappas %A Nikolai Matni %B Proceedings of the 41st International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2024 %E Ruslan Salakhutdinov %E Zico Kolter %E Katherine Heller %E Adrian Weller %E Nuria Oliver %E Jonathan Scarlett %E Felix Berkenkamp %F pmlr-v235-ziemann24a %I PMLR %P 62779--62802 %U https://proceedings.mlr.press/v235/ziemann24a.html %V 235 %X In this work, we study statistical learning with dependent data and square loss in a hypothesis class with tail decay in Orlicz space: $\mathscr{F}\subset L_{\Psi_p}$. Our inquiry is motivated by the search for a sharp noise interaction term, or variance proxy, in learning with dependent (e.g. $\beta$-mixing) data. Typical non-asymptotic results exhibit variance proxies that are deflated multiplicatively in the mixing time of the underlying covariates process. We show that whenever the topologies of $L^2$ and $\Psi_p$ are comparable on our hypothesis class $\mathscr{F}$, the empirical risk minimizer achieves a rate that only depends on the complexity of the class and second order statistics in its leading term. We refer to this as a near mixing-free rate, since direct dependence on mixing is relegated to an additive higher order term. Our approach, reliant on mixed tail generic chaining, allows us to obtain sharp, instance-optimal rates. Examples that satisfy our framework include for instance sub-Gaussian linear regression and bounded smoothness classes.
APA
Ziemann, I., Tu, S., Pappas, G.J. & Matni, N.. (2024). Sharp Rates in Dependent Learning Theory: Avoiding Sample Size Deflation for the Square Loss. Proceedings of the 41st International Conference on Machine Learning, in Proceedings of Machine Learning Research 235:62779-62802 Available from https://proceedings.mlr.press/v235/ziemann24a.html.

Related Material