Variance Reduced Training with Stratified Sampling for Forecasting Models

Yucheng Lu, Youngsuk Park, Lifan Chen, Yuyang Wang, Christopher De Sa, Dean Foster
Proceedings of the 38th International Conference on Machine Learning, PMLR 139:7145-7155, 2021.

Abstract

In large-scale time series forecasting, one often encounters the situation where the temporal patterns of time series, while drifting over time, differ from one another in the same dataset. In this paper, we provably show under such heterogeneity, training a forecasting model with commonly used stochastic optimizers (e.g. SGD) potentially suffers large variance on gradient estimation, and thus incurs long-time training. We show that this issue can be efficiently alleviated via stratification, which allows the optimizer to sample from pre-grouped time series strata. For better trading-off gradient variance and computation complexity, we further propose SCott (Stochastic Stratified Control Variate Gradient Descent), a variance reduced SGD-style optimizer that utilizes stratified sampling via control variate. In theory, we provide the convergence guarantee of SCott on smooth non-convex objectives. Empirically, we evaluate SCott and other baseline optimizers on both synthetic and real-world time series forecasting problems, and demonstrate SCott converges faster with respect to both iterations and wall clock time.

Cite this Paper


BibTeX
@InProceedings{pmlr-v139-lu21d, title = {Variance Reduced Training with Stratified Sampling for Forecasting Models}, author = {Lu, Yucheng and Park, Youngsuk and Chen, Lifan and Wang, Yuyang and De Sa, Christopher and Foster, Dean}, booktitle = {Proceedings of the 38th International Conference on Machine Learning}, pages = {7145--7155}, year = {2021}, editor = {Meila, Marina and Zhang, Tong}, volume = {139}, series = {Proceedings of Machine Learning Research}, month = {18--24 Jul}, publisher = {PMLR}, pdf = {http://proceedings.mlr.press/v139/lu21d/lu21d.pdf}, url = {https://proceedings.mlr.press/v139/lu21d.html}, abstract = {In large-scale time series forecasting, one often encounters the situation where the temporal patterns of time series, while drifting over time, differ from one another in the same dataset. In this paper, we provably show under such heterogeneity, training a forecasting model with commonly used stochastic optimizers (e.g. SGD) potentially suffers large variance on gradient estimation, and thus incurs long-time training. We show that this issue can be efficiently alleviated via stratification, which allows the optimizer to sample from pre-grouped time series strata. For better trading-off gradient variance and computation complexity, we further propose SCott (Stochastic Stratified Control Variate Gradient Descent), a variance reduced SGD-style optimizer that utilizes stratified sampling via control variate. In theory, we provide the convergence guarantee of SCott on smooth non-convex objectives. Empirically, we evaluate SCott and other baseline optimizers on both synthetic and real-world time series forecasting problems, and demonstrate SCott converges faster with respect to both iterations and wall clock time.} }
Endnote
%0 Conference Paper %T Variance Reduced Training with Stratified Sampling for Forecasting Models %A Yucheng Lu %A Youngsuk Park %A Lifan Chen %A Yuyang Wang %A Christopher De Sa %A Dean Foster %B Proceedings of the 38th International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2021 %E Marina Meila %E Tong Zhang %F pmlr-v139-lu21d %I PMLR %P 7145--7155 %U https://proceedings.mlr.press/v139/lu21d.html %V 139 %X In large-scale time series forecasting, one often encounters the situation where the temporal patterns of time series, while drifting over time, differ from one another in the same dataset. In this paper, we provably show under such heterogeneity, training a forecasting model with commonly used stochastic optimizers (e.g. SGD) potentially suffers large variance on gradient estimation, and thus incurs long-time training. We show that this issue can be efficiently alleviated via stratification, which allows the optimizer to sample from pre-grouped time series strata. For better trading-off gradient variance and computation complexity, we further propose SCott (Stochastic Stratified Control Variate Gradient Descent), a variance reduced SGD-style optimizer that utilizes stratified sampling via control variate. In theory, we provide the convergence guarantee of SCott on smooth non-convex objectives. Empirically, we evaluate SCott and other baseline optimizers on both synthetic and real-world time series forecasting problems, and demonstrate SCott converges faster with respect to both iterations and wall clock time.
APA
Lu, Y., Park, Y., Chen, L., Wang, Y., De Sa, C. & Foster, D.. (2021). Variance Reduced Training with Stratified Sampling for Forecasting Models. Proceedings of the 38th International Conference on Machine Learning, in Proceedings of Machine Learning Research 139:7145-7155 Available from https://proceedings.mlr.press/v139/lu21d.html.

Related Material