[edit]
Sharp Rates in Dependent Learning Theory: Avoiding Sample Size Deflation for the Square Loss
Proceedings of the 41st International Conference on Machine Learning, PMLR 235:62779-62802, 2024.
Abstract
In this work, we study statistical learning with dependent data and square loss in a hypothesis class with tail decay in Orlicz space: F⊂LΨp. Our inquiry is motivated by the search for a sharp noise interaction term, or variance proxy, in learning with dependent (e.g. β-mixing) data. Typical non-asymptotic results exhibit variance proxies that are deflated multiplicatively in the mixing time of the underlying covariates process. We show that whenever the topologies of L2 and Ψp are comparable on our hypothesis class F, the empirical risk minimizer achieves a rate that only depends on the complexity of the class and second order statistics in its leading term. We refer to this as a near mixing-free rate, since direct dependence on mixing is relegated to an additive higher order term. Our approach, reliant on mixed tail generic chaining, allows us to obtain sharp, instance-optimal rates. Examples that satisfy our framework include for instance sub-Gaussian linear regression and bounded smoothness classes.