[edit]
Horizon-free Learning for Markov Decision Processes and Games: Stochastically Bounded Rewards and Improved Bounds
Proceedings of the 40th International Conference on Machine Learning, PMLR 202:20221-20252, 2023.
Abstract
Horizon dependence is an important difference between reinforcement learning and other machine learning paradigms. Yet, existing results tackling the (exact) horizon dependence either assume that the reward is bounded per step, introducing unfair comparison, or assume strict total boundedness that requires the sum of rewards to be bounded almost surely – allowing only restricted noise on the reward observation. This paper addresses these limitations by introducing a new relaxation – expected boundedness on rewards, where we allow the reward to be stochastic with only boundedness on the expected sum – opening the door to study horizon-dependence with a much broader set of reward functions with noises. We establish a novel generic algorithm that achieves no-horizon dependence in terms of sample complexity for both Markov Decision Processes (MDP) and Games, via reduction to a good-conditioned auxiliary Markovian environment, in which only “important” state-action pairs are preserved. The algorithm takes only $\tilde{O}(\frac{S^2A}{\epsilon^2})$ episodes interacting with such an environment to achieve an $\epsilon$-optimal policy/strategy (with high probability), improving (zhang, 2022) (which only applies to MDPs with deterministic rewards). Here $S$ is the number of states and $A$ is the number of actions, and the bound is independent of the horizon $H$.