Sharp Rates in Dependent Learning Theory: Avoiding Sample Size Deflation for the Square Loss

Ingvar Ziemann, Stephen Tu, George J. Pappas, Nikolai Matni
Proceedings of the 41st International Conference on Machine Learning, PMLR 235:62779-62802, 2024.

Abstract

In this work, we study statistical learning with dependent data and square loss in a hypothesis class with tail decay in Orlicz space: $\mathscr{F}\subset L_{\Psi_p}$. Our inquiry is motivated by the search for a sharp noise interaction term, or variance proxy, in learning with dependent (e.g. $\beta$-mixing) data. Typical non-asymptotic results exhibit variance proxies that are deflated multiplicatively in the mixing time of the underlying covariates process. We show that whenever the topologies of $L^2$ and $\Psi_p$ are comparable on our hypothesis class $\mathscr{F}$, the empirical risk minimizer achieves a rate that only depends on the complexity of the class and second order statistics in its leading term. We refer to this as a near mixing-free rate, since direct dependence on mixing is relegated to an additive higher order term. Our approach, reliant on mixed tail generic chaining, allows us to obtain sharp, instance-optimal rates. Examples that satisfy our framework include for instance sub-Gaussian linear regression and bounded smoothness classes.

Cite this Paper


BibTeX
@InProceedings{pmlr-v235-ziemann24a, title = {Sharp Rates in Dependent Learning Theory: Avoiding Sample Size Deflation for the Square Loss}, author = {Ziemann, Ingvar and Tu, Stephen and Pappas, George J. and Matni, Nikolai}, booktitle = {Proceedings of the 41st International Conference on Machine Learning}, pages = {62779--62802}, year = {2024}, editor = {Salakhutdinov, Ruslan and Kolter, Zico and Heller, Katherine and Weller, Adrian and Oliver, Nuria and Scarlett, Jonathan and Berkenkamp, Felix}, volume = {235}, series = {Proceedings of Machine Learning Research}, month = {21--27 Jul}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v235/main/assets/ziemann24a/ziemann24a.pdf}, url = {https://proceedings.mlr.press/v235/ziemann24a.html}, abstract = {In this work, we study statistical learning with dependent data and square loss in a hypothesis class with tail decay in Orlicz space: $\mathscr{F}\subset L_{\Psi_p}$. Our inquiry is motivated by the search for a sharp noise interaction term, or variance proxy, in learning with dependent (e.g. $\beta$-mixing) data. Typical non-asymptotic results exhibit variance proxies that are deflated multiplicatively in the mixing time of the underlying covariates process. We show that whenever the topologies of $L^2$ and $\Psi_p$ are comparable on our hypothesis class $\mathscr{F}$, the empirical risk minimizer achieves a rate that only depends on the complexity of the class and second order statistics in its leading term. We refer to this as a near mixing-free rate, since direct dependence on mixing is relegated to an additive higher order term. Our approach, reliant on mixed tail generic chaining, allows us to obtain sharp, instance-optimal rates. Examples that satisfy our framework include for instance sub-Gaussian linear regression and bounded smoothness classes.} }
Endnote
%0 Conference Paper %T Sharp Rates in Dependent Learning Theory: Avoiding Sample Size Deflation for the Square Loss %A Ingvar Ziemann %A Stephen Tu %A George J. Pappas %A Nikolai Matni %B Proceedings of the 41st International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2024 %E Ruslan Salakhutdinov %E Zico Kolter %E Katherine Heller %E Adrian Weller %E Nuria Oliver %E Jonathan Scarlett %E Felix Berkenkamp %F pmlr-v235-ziemann24a %I PMLR %P 62779--62802 %U https://proceedings.mlr.press/v235/ziemann24a.html %V 235 %X In this work, we study statistical learning with dependent data and square loss in a hypothesis class with tail decay in Orlicz space: $\mathscr{F}\subset L_{\Psi_p}$. Our inquiry is motivated by the search for a sharp noise interaction term, or variance proxy, in learning with dependent (e.g. $\beta$-mixing) data. Typical non-asymptotic results exhibit variance proxies that are deflated multiplicatively in the mixing time of the underlying covariates process. We show that whenever the topologies of $L^2$ and $\Psi_p$ are comparable on our hypothesis class $\mathscr{F}$, the empirical risk minimizer achieves a rate that only depends on the complexity of the class and second order statistics in its leading term. We refer to this as a near mixing-free rate, since direct dependence on mixing is relegated to an additive higher order term. Our approach, reliant on mixed tail generic chaining, allows us to obtain sharp, instance-optimal rates. Examples that satisfy our framework include for instance sub-Gaussian linear regression and bounded smoothness classes.
APA
Ziemann, I., Tu, S., Pappas, G.J. & Matni, N.. (2024). Sharp Rates in Dependent Learning Theory: Avoiding Sample Size Deflation for the Square Loss. Proceedings of the 41st International Conference on Machine Learning, in Proceedings of Machine Learning Research 235:62779-62802 Available from https://proceedings.mlr.press/v235/ziemann24a.html.

Related Material