Trust-Region Twisted Policy Improvement

Joery A. De Vries, Jinke He, Yaniv Oren, Matthijs T. J. Spaan
Proceedings of the 42nd International Conference on Machine Learning, PMLR 267:12901-12923, 2025.

Abstract

Monte-Carlo tree search (MCTS) has driven many recent breakthroughs in deep reinforcement learning (RL). However, scaling MCTS to parallel compute has proven challenging in practice which has motivated alternative planners like sequential Monte-Carlo (SMC). Many of these SMC methods adopt particle filters for smoothing through a reformulation of RL as a policy inference problem. Yet, persisting design choices of these particle filters often conflict with the aim of online planning in RL, which is to obtain a policy improvement at the start of planning. Drawing inspiration from MCTS, we tailor SMC planners specifically to RL by improving data generation within the planner through constrained action sampling and explicit terminal state handling, as well as improving policy and value target estimation. This leads to our Trust-Region Twisted SMC (TRT-SMC), which shows improved runtime and sample-efficiency over baseline MCTS and SMC methods in both discrete and continuous domains.

Cite this Paper


BibTeX
@InProceedings{pmlr-v267-de-vries25a, title = {Trust-Region Twisted Policy Improvement}, author = {De Vries, Joery A. and He, Jinke and Oren, Yaniv and Spaan, Matthijs T. J.}, booktitle = {Proceedings of the 42nd International Conference on Machine Learning}, pages = {12901--12923}, year = {2025}, editor = {Singh, Aarti and Fazel, Maryam and Hsu, Daniel and Lacoste-Julien, Simon and Berkenkamp, Felix and Maharaj, Tegan and Wagstaff, Kiri and Zhu, Jerry}, volume = {267}, series = {Proceedings of Machine Learning Research}, month = {13--19 Jul}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v267/main/assets/de-vries25a/de-vries25a.pdf}, url = {https://proceedings.mlr.press/v267/de-vries25a.html}, abstract = {Monte-Carlo tree search (MCTS) has driven many recent breakthroughs in deep reinforcement learning (RL). However, scaling MCTS to parallel compute has proven challenging in practice which has motivated alternative planners like sequential Monte-Carlo (SMC). Many of these SMC methods adopt particle filters for smoothing through a reformulation of RL as a policy inference problem. Yet, persisting design choices of these particle filters often conflict with the aim of online planning in RL, which is to obtain a policy improvement at the start of planning. Drawing inspiration from MCTS, we tailor SMC planners specifically to RL by improving data generation within the planner through constrained action sampling and explicit terminal state handling, as well as improving policy and value target estimation. This leads to our Trust-Region Twisted SMC (TRT-SMC), which shows improved runtime and sample-efficiency over baseline MCTS and SMC methods in both discrete and continuous domains.} }
Endnote
%0 Conference Paper %T Trust-Region Twisted Policy Improvement %A Joery A. De Vries %A Jinke He %A Yaniv Oren %A Matthijs T. J. Spaan %B Proceedings of the 42nd International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2025 %E Aarti Singh %E Maryam Fazel %E Daniel Hsu %E Simon Lacoste-Julien %E Felix Berkenkamp %E Tegan Maharaj %E Kiri Wagstaff %E Jerry Zhu %F pmlr-v267-de-vries25a %I PMLR %P 12901--12923 %U https://proceedings.mlr.press/v267/de-vries25a.html %V 267 %X Monte-Carlo tree search (MCTS) has driven many recent breakthroughs in deep reinforcement learning (RL). However, scaling MCTS to parallel compute has proven challenging in practice which has motivated alternative planners like sequential Monte-Carlo (SMC). Many of these SMC methods adopt particle filters for smoothing through a reformulation of RL as a policy inference problem. Yet, persisting design choices of these particle filters often conflict with the aim of online planning in RL, which is to obtain a policy improvement at the start of planning. Drawing inspiration from MCTS, we tailor SMC planners specifically to RL by improving data generation within the planner through constrained action sampling and explicit terminal state handling, as well as improving policy and value target estimation. This leads to our Trust-Region Twisted SMC (TRT-SMC), which shows improved runtime and sample-efficiency over baseline MCTS and SMC methods in both discrete and continuous domains.
APA
De Vries, J.A., He, J., Oren, Y. & Spaan, M.T.J.. (2025). Trust-Region Twisted Policy Improvement. Proceedings of the 42nd International Conference on Machine Learning, in Proceedings of Machine Learning Research 267:12901-12923 Available from https://proceedings.mlr.press/v267/de-vries25a.html.

Related Material