[edit]
Reward-Mixing MDPs with Few Latent Contexts are Learnable
Proceedings of the 40th International Conference on Machine Learning, PMLR 202:18057-18082, 2023.
Abstract
We consider episodic reinforcement learning in reward-mixing Markov decision processes (RMMDPs): at the beginning of every episode nature randomly picks a latent reward model among M candidates and an agent interacts with the MDP throughout the episode for H time steps. Our goal is to learn a near-optimal policy that nearly maximizes the H time-step cumulative rewards in such a model. Prior work established an upper bound for RMMDPs with M=2. In this work, we resolve several open questions for the general RMMDP setting. We consider an arbitrary M≥2 and provide a sample-efficient algorithm–EM2–that outputs an ϵ-optimal policy using O(ϵ−2⋅SdAd⋅poly(H,Z)d) episodes, where S,A are the number of states and actions respectively, H is the time-horizon, Z is the support size of reward distributions and d=O(min. We also provide a (SA)^{\Omega(\sqrt{M})} / \epsilon^{2} lower bound, supporting that super-polynomial sample complexity in M is necessary.