[edit]
Value-Based Deep RL Scales Predictably
Proceedings of the 42nd International Conference on Machine Learning, PMLR 267:52422-52443, 2025.
Abstract
Scaling data and compute is critical in modern machine learning. However, scaling also demands predictability: we want methods to not only perform well with more compute or data, but also have their performance be predictable from low compute or low data runs, without ever running the large-scale experiment. In this paper, we show predictability of value-based off-policy deep RL. First, we show that data and compute requirements to reach a given performance level lie on a Pareto frontier, controlled by the updates-to-data (UTD) ratio. By estimating this frontier, we can extrapolate data requirements into a higher compute regime, and compute requirements into a higher data regime. Second, we determine the optimal allocation of total budget across data and compute to obtain given performance and use it to determine hyperparameters that maximize performance for a given budget. Third, this scaling behavior is enabled by first estimating predictable relationships between different hyperparameters, which is used to counteract effects of overfitting and plasticity loss unique to RL. We validate our approach using three algorithms: SAC, BRO, and PQL on DeepMind Control, OpenAI gym, and IsaacGym, when extrapolating to higher levels of data, compute, budget, or performance.