Better Rates for Any Adversarial Deterministic MDP
Proceedings of the 30th International Conference on Machine Learning, PMLR 28(3):675-683, 2013.
We consider regret minimization in adversarial deterministic Markov Decision Processes (ADMDPs) with bandit feedback. We devise a new algorithm that pushes the state-of-the-art forward in two ways: First, it attains a regret of O(T^2/3) with respect to the best fixed policy in hindsight, whereas the previous best regret bound was O(T^3/4). Second, the algorithm and its analysis are compatible with any feasible ADMDP graph topology, while all previous approaches required additional restrictions on the graph topology.