Optimal Learning Rate Schedules under Functional Scaling Laws: Power Decay and Warmup–Stable–Decay (Extended Abstract)

Binghui Li; Zilin Wang; Fengling Chen; Shiyang Zhao; Ruiheng Zheng; Lei Wu

Optimal Learning Rate Schedules under Functional Scaling Laws: Power Decay and Warmup–Stable–Decay (Extended Abstract)

Binghui Li, Zilin Wang, Fengling Chen, Shiyang Zhao, Ruiheng Zheng, Lei Wu

Proceedings of Thirty Ninth Conference on Learning Theory, PMLR 336:4722-4723, 2026.

Abstract

We study optimal learning rate (LR) schedules under the functional scaling law (FSL) framework (Li et al., 2025), where loss dynamics are controlled by a source exponent $s>0$ for signal learning and a capacity exponent $\beta>1$ for noise forgetting. For a fixed training horizon $N$, we characterize the schedules that minimize the final-step loss under natural stability constraints and reveal a sharp phase transition. In the easy-task regime $s \ge 1 - 1/\beta$, the optimal schedule takes the power-decay form $\eta^*(z) = \eta_{\mathrm{peak}}(1 - z/N)^{2\beta - 1}$ with $\eta_{\mathrm{peak}}\asymp N^{-(s-1+1/\beta)/(s+1/\beta)}$. In contrast, in the hard-task regime $s < 1 - 1/\beta$, the optimal schedule exhibits a warmup–stable–decay (WSD)-like (Hu et al., 2024) structure: it maintains the largest admissible LR for most of training and decays only near the end, with the decay phase occupying a vanishing fraction of the horizon. We next study the practical setting where the decay shape is fixed and only the peak LR is tuned. To separate these two design choices, we introduce a family of fractional LR schedules that decouple peak-LR tuning from decay-shape design. We prove that fixed-shape schedules suffer from capacity saturation: each shape can adapt to the capacity exponent only up to a shape-dependent threshold, beyond which the achievable convergence rate no longer improves. This yields a principled criterion for evaluating commonly used schedules such as cosine and linear decay, revealing both their strengths and limitations. We then apply the FSL-optimal power-decay schedule to one-pass stochastic gradient descent (SGD) for kernel regression and show that the last iterate attains the exact minimax-optimal convergence rate, eliminating the logarithmic gap in prior analyses. Finally, experiments validate our theoretical predictions in controlled settings and illustrate their usefulness for practical LR-schedule design in neural network training.

Cite this Paper

BibTeX

@InProceedings{pmlr-v336-li26b,
  title = 	 {Optimal Learning Rate Schedules under Functional Scaling Laws: Power Decay and Warmup–Stable–Decay (Extended Abstract)},
  author =       {Li, Binghui and Wang, Zilin and Chen, Fengling and Zhao, Shiyang and Zheng, Ruiheng and Wu, Lei},
  booktitle = 	 {Proceedings of Thirty Ninth Conference on Learning Theory},
  pages = 	 {4722--4723},
  year = 	 {2026},
  editor = 	 {Hanneke, Steve and Lattimore, Tor},
  volume = 	 {336},
  series = 	 {Proceedings of Machine Learning Research},
  month = 	 {29 Jun--03 Jul},
  publisher =    {PMLR},
  pdf = 	 {https://raw.githubusercontent.com/mlresearch/v336/main/assets/li26b/li26b.pdf},
  url = 	 {https://proceedings.mlr.press/v336/li26b.html},
  abstract = 	 {We study optimal learning rate (LR) schedules under the functional scaling law (FSL) framework (Li et al., 2025), where loss dynamics are controlled by a source exponent $s>0$ for signal learning and a capacity exponent $\beta>1$ for noise forgetting. For a fixed training horizon $N$, we characterize the schedules that minimize the final-step loss under natural stability constraints and reveal a sharp phase transition. In the easy-task regime $s \ge 1 - 1/\beta$, the optimal schedule takes the power-decay form $\eta^*(z) = \eta_{\mathrm{peak}}(1 - z/N)^{2\beta - 1}$ with $\eta_{\mathrm{peak}}\asymp N^{-(s-1+1/\beta)/(s+1/\beta)}$. In contrast, in the hard-task regime $s < 1 - 1/\beta$, the optimal schedule exhibits a warmup–stable–decay (WSD)-like (Hu et al., 2024) structure: it maintains the largest admissible LR for most of training and decays only near the end, with the decay phase occupying a vanishing fraction of the horizon. We next study the practical setting where the decay shape is fixed and only the peak LR is tuned. To separate these two design choices, we introduce a family of fractional LR schedules that decouple peak-LR tuning from decay-shape design. We prove that fixed-shape schedules suffer from capacity saturation: each shape can adapt to the capacity exponent only up to a shape-dependent threshold, beyond which the achievable convergence rate no longer improves. This yields a principled criterion for evaluating commonly used schedules such as cosine and linear decay, revealing both their strengths and limitations. We then apply the FSL-optimal power-decay schedule to one-pass stochastic gradient descent (SGD) for kernel regression and show that the last iterate attains the exact minimax-optimal convergence rate, eliminating the logarithmic gap  in prior analyses. Finally, experiments validate our theoretical predictions in controlled settings and illustrate their usefulness for practical LR-schedule design in neural network training.}
}

Endnote

%0 Conference Paper
%T Optimal Learning Rate Schedules under Functional Scaling Laws: Power Decay and Warmup–Stable–Decay (Extended Abstract)
%A Binghui Li
%A Zilin Wang
%A Fengling Chen
%A Shiyang Zhao
%A Ruiheng Zheng
%A Lei Wu
%B Proceedings of Thirty Ninth Conference on Learning Theory
%C Proceedings of Machine Learning Research
%D 2026
%E Steve Hanneke
%E Tor Lattimore	
%F pmlr-v336-li26b
%I PMLR
%P 4722--4723
%U https://proceedings.mlr.press/v336/li26b.html
%V 336
%X We study optimal learning rate (LR) schedules under the functional scaling law (FSL) framework (Li et al., 2025), where loss dynamics are controlled by a source exponent $s>0$ for signal learning and a capacity exponent $\beta>1$ for noise forgetting. For a fixed training horizon $N$, we characterize the schedules that minimize the final-step loss under natural stability constraints and reveal a sharp phase transition. In the easy-task regime $s \ge 1 - 1/\beta$, the optimal schedule takes the power-decay form $\eta^*(z) = \eta_{\mathrm{peak}}(1 - z/N)^{2\beta - 1}$ with $\eta_{\mathrm{peak}}\asymp N^{-(s-1+1/\beta)/(s+1/\beta)}$. In contrast, in the hard-task regime $s < 1 - 1/\beta$, the optimal schedule exhibits a warmup–stable–decay (WSD)-like (Hu et al., 2024) structure: it maintains the largest admissible LR for most of training and decays only near the end, with the decay phase occupying a vanishing fraction of the horizon. We next study the practical setting where the decay shape is fixed and only the peak LR is tuned. To separate these two design choices, we introduce a family of fractional LR schedules that decouple peak-LR tuning from decay-shape design. We prove that fixed-shape schedules suffer from capacity saturation: each shape can adapt to the capacity exponent only up to a shape-dependent threshold, beyond which the achievable convergence rate no longer improves. This yields a principled criterion for evaluating commonly used schedules such as cosine and linear decay, revealing both their strengths and limitations. We then apply the FSL-optimal power-decay schedule to one-pass stochastic gradient descent (SGD) for kernel regression and show that the last iterate attains the exact minimax-optimal convergence rate, eliminating the logarithmic gap  in prior analyses. Finally, experiments validate our theoretical predictions in controlled settings and illustrate their usefulness for practical LR-schedule design in neural network training.

APA

Li, B., Wang, Z., Chen, F., Zhao, S., Zheng, R. & Wu, L.. (2026). Optimal Learning Rate Schedules under Functional Scaling Laws: Power Decay and Warmup–Stable–Decay (Extended Abstract). Proceedings of Thirty Ninth Conference on Learning Theory, in Proceedings of Machine Learning Research 336:4722-4723 Available from https://proceedings.mlr.press/v336/li26b.html.

Related Material

Download PDF