[edit]
Near Optimal Reward-Free Reinforcement Learning
Proceedings of the 38th International Conference on Machine Learning, PMLR 139:12402-12412, 2021.
Abstract
We study the reward-free reinforcement learning framework, which is particularly suitable for batch reinforcement learning and scenarios where one needs policies for multiple reward functions. This framework has two phases: in the exploration phase, the agent collects trajectories by interacting with the environment without using any reward signal; in the planning phase, the agent needs to return a near-optimal policy for arbitrary reward functions. %This framework is suitable for batch RL setting and the setting where there are multiple reward functions of interes We give a new efficient algorithm, \textbf{S}taged \textbf{S}ampling + \textbf{T}runcated \textbf{P}lanning (\algoname), which interacts with the environment at most O(S2Aϵ2\polylog(SAHϵ)) episodes in the exploration phase, and guarantees to output a near-optimal policy for arbitrary reward functions in the planning phase, where S is the size of state space, A is the size of action space, H is the planning horizon, and ϵ is the target accuracy relative to the total reward. Notably, our sample complexity scales only \emph{logarithmically} with H, in contrast to all existing results which scale \emph{polynomially} with H. Furthermore, this bound matches the minimax lower bound Ω(S2Aϵ2) up to logarithmic factors. Our results rely on three new techniques : 1) A new sufficient condition for the dataset to plan for an ϵ-suboptimal policy % for any totally bounded reward function ; 2) A new way to plan efficiently under the proposed condition using soft-truncated planning; 3) Constructing extended MDP to maximize the truncated accumulative rewards efficiently.