Averaging $n$-step Returns Reduces Variance in Reinforcement Learning

Brett Daley; Martha White; Marlos C. Machado

Averaging $n$ -step Returns Reduces Variance in Reinforcement Learning

Brett Daley, Martha White, Marlos C. Machado

Proceedings of the 41st International Conference on Machine Learning, PMLR 235:9904-9930, 2024.

Abstract

Multistep returns, such as

$n$ -step returns and

$\lambda$ -returns, are commonly used to improve the sample efficiency of reinforcement learning (RL) methods. The variance of the multistep returns becomes the limiting factor in their length; looking too far into the future increases variance and reverses the benefits of multistep learning. In our work, we demonstrate the ability of compound returns—weighted averages of

$n$ -step returns—to reduce variance. We prove for the first time that any compound return with the same contraction modulus as a given

$n$ -step return has strictly lower variance. We additionally prove that this variance-reduction property improves the finite-sample complexity of temporal-difference learning under linear function approximation. Because general compound returns can be expensive to implement, we introduce two-bootstrap returns which reduce variance while remaining efficient, even when using minibatched experience replay. We conduct experiments showing that compound returns often increase the sample efficiency of

$n$ -step deep RL agents like DQN and PPO.

Cite this Paper

BibTeX


@InProceedings{pmlr-v235-daley24a,
  title = 	 {Averaging $n$-step Returns Reduces Variance in Reinforcement Learning},
  author =       {Daley, Brett and White, Martha and C. Machado, Marlos},
  booktitle = 	 {Proceedings of the 41st International Conference on Machine Learning},
  pages = 	 {9904--9930},
  year = 	 {2024},
  editor = 	 {Salakhutdinov, Ruslan and Kolter, Zico and Heller, Katherine and Weller, Adrian and Oliver, Nuria and Scarlett, Jonathan and Berkenkamp, Felix},
  volume = 	 {235},
  series = 	 {Proceedings of Machine Learning Research},
  month = 	 {21--27 Jul},
  publisher =    {PMLR},
  pdf = 	 {https://raw.githubusercontent.com/mlresearch/v235/main/assets/daley24a/daley24a.pdf},
  url = 	 {https://proceedings.mlr.press/v235/daley24a.html},
  abstract = 	 {Multistep returns, such as $n$-step returns and $\lambda$-returns, are commonly used to improve the sample efficiency of reinforcement learning (RL) methods. The variance of the multistep returns becomes the limiting factor in their length; looking too far into the future increases variance and reverses the benefits of multistep learning. In our work, we demonstrate the ability of compound returns—weighted averages of $n$-step returns—to reduce variance. We prove for the first time that any compound return with the same contraction modulus as a given $n$-step return has strictly lower variance. We additionally prove that this variance-reduction property improves the finite-sample complexity of temporal-difference learning under linear function approximation. Because general compound returns can be expensive to implement, we introduce two-bootstrap returns which reduce variance while remaining efficient, even when using minibatched experience replay. We conduct experiments showing that compound returns often increase the sample efficiency of $n$-step deep RL agents like DQN and PPO.}
}

Endnote

%0 Conference Paper
%T Averaging $n$-step Returns Reduces Variance in Reinforcement Learning
%A Brett Daley
%A Martha White
%A Marlos C. Machado
%B Proceedings of the 41st International Conference on Machine Learning
%C Proceedings of Machine Learning Research
%D 2024
%E Ruslan Salakhutdinov
%E Zico Kolter
%E Katherine Heller
%E Adrian Weller
%E Nuria Oliver
%E Jonathan Scarlett
%E Felix Berkenkamp	
%F pmlr-v235-daley24a
%I PMLR
%P 9904--9930
%U https://proceedings.mlr.press/v235/daley24a.html
%V 235
%X Multistep returns, such as $n$-step returns and $\lambda$-returns, are commonly used to improve the sample efficiency of reinforcement learning (RL) methods. The variance of the multistep returns becomes the limiting factor in their length; looking too far into the future increases variance and reverses the benefits of multistep learning. In our work, we demonstrate the ability of compound returns—weighted averages of $n$-step returns—to reduce variance. We prove for the first time that any compound return with the same contraction modulus as a given $n$-step return has strictly lower variance. We additionally prove that this variance-reduction property improves the finite-sample complexity of temporal-difference learning under linear function approximation. Because general compound returns can be expensive to implement, we introduce two-bootstrap returns which reduce variance while remaining efficient, even when using minibatched experience replay. We conduct experiments showing that compound returns often increase the sample efficiency of $n$-step deep RL agents like DQN and PPO.

APA


Daley, B., White, M. & C. Machado, M.. (2024). Averaging $n$-step Returns Reduces Variance in Reinforcement Learning. Proceedings of the 41st International Conference on Machine Learning, in Proceedings of Machine Learning Research 235:9904-9930 Available from https://proceedings.mlr.press/v235/daley24a.html.

Averaging nn-step Returns Reduces Variance in Reinforcement Learning

Abstract

Cite this Paper

Related Material

Averaging $n$ -step Returns Reduces Variance in Reinforcement Learning