Scalable Safe Policy Improvement via Monte Carlo Tree Search

Alberto Castellini; Federico Bianchi; Edoardo Zorzi; Thiago D. Simão; Alessandro Farinelli; Matthijs T. J. Spaan

Scalable Safe Policy Improvement via Monte Carlo Tree Search

Alberto Castellini, Federico Bianchi, Edoardo Zorzi, Thiago D. Simão, Alessandro Farinelli, Matthijs T. J. Spaan

Proceedings of the 40th International Conference on Machine Learning, PMLR 202:3732-3756, 2023.

Abstract

Algorithms for safely improving policies are important to deploy reinforcement learning approaches in real-world scenarios. In this work, we propose an algorithm, called MCTS-SPIBB, that computes safe policy improvement online using a Monte Carlo Tree Search based strategy. We theoretically prove that the policy generated by MCTS-SPIBB converges, as the number of simulations grows, to the optimal safely improved policy generated by Safe Policy Improvement with Baseline Bootstrapping (SPIBB), a popular algorithm based on policy iteration. Moreover, our empirical analysis performed on three standard benchmark domains shows that MCTS-SPIBB scales to significantly larger problems than SPIBB because it computes the policy online and locally, i.e., only in the states actually visited by the agent.

Cite this Paper

BibTeX


@InProceedings{pmlr-v202-castellini23a,
  title = 	 {Scalable Safe Policy Improvement via {M}onte {C}arlo Tree Search},
  author =       {Castellini, Alberto and Bianchi, Federico and Zorzi, Edoardo and Sim\~{a}o, Thiago D. and Farinelli, Alessandro and Spaan, Matthijs T. J.},
  booktitle = 	 {Proceedings of the 40th International Conference on Machine Learning},
  pages = 	 {3732--3756},
  year = 	 {2023},
  editor = 	 {Krause, Andreas and Brunskill, Emma and Cho, Kyunghyun and Engelhardt, Barbara and Sabato, Sivan and Scarlett, Jonathan},
  volume = 	 {202},
  series = 	 {Proceedings of Machine Learning Research},
  month = 	 {23--29 Jul},
  publisher =    {PMLR},
  pdf = 	 {https://proceedings.mlr.press/v202/castellini23a/castellini23a.pdf},
  url = 	 {https://proceedings.mlr.press/v202/castellini23a.html},
  abstract = 	 {Algorithms for safely improving policies are important to deploy reinforcement learning approaches in real-world scenarios. In this work, we propose an algorithm, called MCTS-SPIBB, that computes safe policy improvement online using a Monte Carlo Tree Search based strategy. We theoretically prove that the policy generated by MCTS-SPIBB converges, as the number of simulations grows, to the optimal safely improved policy generated by Safe Policy Improvement with Baseline Bootstrapping (SPIBB), a popular algorithm based on policy iteration. Moreover, our empirical analysis performed on three standard benchmark domains shows that MCTS-SPIBB scales to significantly larger problems than SPIBB because it computes the policy online and locally, i.e., only in the states actually visited by the agent.}
}

Endnote

%0 Conference Paper
%T Scalable Safe Policy Improvement via Monte Carlo Tree Search
%A Alberto Castellini
%A Federico Bianchi
%A Edoardo Zorzi
%A Thiago D. Simão
%A Alessandro Farinelli
%A Matthijs T. J. Spaan
%B Proceedings of the 40th International Conference on Machine Learning
%C Proceedings of Machine Learning Research
%D 2023
%E Andreas Krause
%E Emma Brunskill
%E Kyunghyun Cho
%E Barbara Engelhardt
%E Sivan Sabato
%E Jonathan Scarlett	
%F pmlr-v202-castellini23a
%I PMLR
%P 3732--3756
%U https://proceedings.mlr.press/v202/castellini23a.html
%V 202
%X Algorithms for safely improving policies are important to deploy reinforcement learning approaches in real-world scenarios. In this work, we propose an algorithm, called MCTS-SPIBB, that computes safe policy improvement online using a Monte Carlo Tree Search based strategy. We theoretically prove that the policy generated by MCTS-SPIBB converges, as the number of simulations grows, to the optimal safely improved policy generated by Safe Policy Improvement with Baseline Bootstrapping (SPIBB), a popular algorithm based on policy iteration. Moreover, our empirical analysis performed on three standard benchmark domains shows that MCTS-SPIBB scales to significantly larger problems than SPIBB because it computes the policy online and locally, i.e., only in the states actually visited by the agent.

APA


Castellini, A., Bianchi, F., Zorzi, E., Simão, T.D., Farinelli, A. & Spaan, M.T.J.. (2023). Scalable Safe Policy Improvement via Monte Carlo Tree Search. Proceedings of the 40th International Conference on Machine Learning, in Proceedings of Machine Learning Research 202:3732-3756 Available from https://proceedings.mlr.press/v202/castellini23a.html.

Scalable Safe Policy Improvement via Monte Carlo Tree Search

Abstract

Cite this Paper

Related Material