[edit]
Improved Algorithm for Adversarial Linear Mixture MDPs with Bandit Feedback and Unknown Transition
Proceedings of The 27th International Conference on Artificial Intelligence and Statistics, PMLR 238:3061-3069, 2024.
Abstract
We study reinforcement learning with linear function approximation, unknown transition, and adversarial losses in the bandit feedback setting. Specifically, we focus on linear mixture MDPs whose transition kernel is a linear mixture model. We propose a new algorithm that attains an ˜O(d√HS3K+√HSAK) regret with high probability, where d is the dimension of feature mappings, S is the size of state space, A is the size of action space, H is the episode length and K is the number of episodes. Our result strictly improves the previous best-known ˜O(dS2√K+√HSAK) result in Zhao et al. (2023a) since H≤S holds by the layered MDP structure. Our advancements are primarily attributed to (\romannumeral1) a new least square estimator for the transition parameter that leverages the visit information of all states, as opposed to only one state in prior work, and (\romannumeral2) a new self-normalized concentration tailored specifically to handle non-independent noises, originally proposed in the dynamic assortment area and firstly applied in reinforcement learning to handle correlations between different states.