[edit]
Power Mean Estimation in Stochastic Continuous Monte-Carlo Tree Search
Proceedings of the 42nd International Conference on Machine Learning, PMLR 267:12344-12376, 2025.
Abstract
Monte Carlo Tree Search (MCTS) has demonstrated success in online planning for deterministic environments, yet significant challenges remain in adapting it to stochastic Markov Decision Processes (MDPs), particularly in continuous state-action spaces. Existing methods, such as HOOT, which combines MCTS with the Hierarchical Optimistic Optimization (HOO) bandit strategy, address continuous spaces but rely on a logarithmic exploration bonus that lacks theoretical guarantees in non-stationary, stochastic settings. Recent advancements, such as POLY-HOOT, introduced a polynomial bonus term to achieve convergence in deterministic MDPs, though a similar theory for stochastic MDPs remains undeveloped. In this paper, we propose a novel MCTS algorithm, Stochastic-Power-HOOT, designed for continuous, stochastic MDPs. Stochastic-Power-HOOT integrates a power mean as a value backup operator, alongside a polynomial exploration bonus to address the non-stationarity inherent in continuous action spaces. Our theoretical analysis establishes that Stochastic-Power-HOOT converges at a polynomial rate of $\mathcal{O}(n^{-\zeta})$, $\zeta \in (0,1/2)$, where $ n $ is the number of visited trajectories, thereby extending the non-asymptotic convergence guarantees of POLY-HOOT to stochastic environments. Experimental results on stochastic tasks validate our theoretical findings, demonstrating the effectiveness of Stochastic-Power-HOOT in continuous, stochastic domains.