Logarithmic regret in communicating MDPs: Leveraging known dynamics with bandits

Hassan Saber; Fabien Pesquerel; Odalric-Ambrym Maillard; Mohammad Sadegh Talebi

Logarithmic regret in communicating MDPs: Leveraging known dynamics with bandits

Hassan Saber, Fabien Pesquerel, Odalric-Ambrym Maillard, Mohammad Sadegh Talebi

Proceedings of the 15th Asian Conference on Machine Learning, PMLR 222:1167-1182, 2024.

Abstract

We study regret minimization in an average-reward and communicating Markov Decision Process (MDP) with known dynamics, but unknown reward function. Although learning in such MDPs is a priori easier than in fully unknown ones, they are still largely challenging as they include as special cases large classes of problems such as combinatorial semi-bandits. Leveraging the knowledge on transition function in regret minimization, in a statistically efficient way, appears largely unexplored. As it is conjectured that achieving exact optimality in generic MDPs is NP-hard, even with known transitions, we focus on a computationally efficient relaxation, at the cost of achieving order-optimal logarithmic regret instead of exact optimality. We contribute to filling this gap by introducing a novel algorithm based on the popular Indexed Minimum Empirical Divergence strategy for bandits. A key component of the proposed algorithm is a carefully designed stopping criterion leveraging the recurrent classes induced by stationary policies. We derive a non-asymptotic, problem-dependent, and logarithmic regret bound for this algorithm, which relies on a novel regret decomposition leveraging the structure. We further provide an efficient implementation and experiments illustrating its promising empirical performance.

Cite this Paper

BibTeX


@InProceedings{pmlr-v222-saber24a,
  title = 	 {Logarithmic regret in communicating {MDPs}: {L}everaging known dynamics with bandits},
  author =       {Saber, Hassan and Pesquerel, Fabien and Maillard, Odalric-Ambrym and Talebi, Mohammad Sadegh},
  booktitle = 	 {Proceedings of the 15th Asian Conference on Machine Learning},
  pages = 	 {1167--1182},
  year = 	 {2024},
  editor = 	 {Yanıkoğlu, Berrin and Buntine, Wray},
  volume = 	 {222},
  series = 	 {Proceedings of Machine Learning Research},
  month = 	 {11--14 Nov},
  publisher =    {PMLR},
  pdf = 	 {https://proceedings.mlr.press/v222/saber24a/saber24a.pdf},
  url = 	 {https://proceedings.mlr.press/v222/saber24a.html},
  abstract = 	 {We study regret minimization in an average-reward and communicating Markov Decision Process (MDP) with known dynamics, but unknown reward function. Although learning in such MDPs is a priori easier than in fully unknown ones, they are still largely challenging as they include as special cases large classes of problems such as combinatorial semi-bandits. Leveraging the knowledge on transition function in regret minimization, in a statistically efficient way, appears largely unexplored. As it is conjectured that achieving exact optimality in generic MDPs is NP-hard, even with known transitions, we focus on a computationally efficient relaxation, at the cost of achieving order-optimal logarithmic regret instead of exact optimality. We contribute to filling this gap by introducing a novel algorithm based on the popular Indexed Minimum Empirical Divergence strategy for bandits. A key component of the proposed algorithm is a carefully designed stopping criterion leveraging the recurrent classes induced by stationary policies. We derive a non-asymptotic, problem-dependent, and logarithmic regret bound for this algorithm, which relies on a novel regret decomposition leveraging the structure. We further provide an efficient implementation and experiments illustrating its promising empirical performance.}
}

Endnote

%0 Conference Paper
%T Logarithmic regret in communicating MDPs: Leveraging known dynamics with bandits
%A Hassan Saber
%A Fabien Pesquerel
%A Odalric-Ambrym Maillard
%A Mohammad Sadegh Talebi
%B Proceedings of the 15th Asian Conference on Machine Learning
%C Proceedings of Machine Learning Research
%D 2024
%E Berrin Yanıkoğlu
%E Wray Buntine	
%F pmlr-v222-saber24a
%I PMLR
%P 1167--1182
%U https://proceedings.mlr.press/v222/saber24a.html
%V 222
%X We study regret minimization in an average-reward and communicating Markov Decision Process (MDP) with known dynamics, but unknown reward function. Although learning in such MDPs is a priori easier than in fully unknown ones, they are still largely challenging as they include as special cases large classes of problems such as combinatorial semi-bandits. Leveraging the knowledge on transition function in regret minimization, in a statistically efficient way, appears largely unexplored. As it is conjectured that achieving exact optimality in generic MDPs is NP-hard, even with known transitions, we focus on a computationally efficient relaxation, at the cost of achieving order-optimal logarithmic regret instead of exact optimality. We contribute to filling this gap by introducing a novel algorithm based on the popular Indexed Minimum Empirical Divergence strategy for bandits. A key component of the proposed algorithm is a carefully designed stopping criterion leveraging the recurrent classes induced by stationary policies. We derive a non-asymptotic, problem-dependent, and logarithmic regret bound for this algorithm, which relies on a novel regret decomposition leveraging the structure. We further provide an efficient implementation and experiments illustrating its promising empirical performance.

APA


Saber, H., Pesquerel, F., Maillard, O. & Talebi, M.S.. (2024). Logarithmic regret in communicating MDPs: Leveraging known dynamics with bandits. Proceedings of the 15th Asian Conference on Machine Learning, in Proceedings of Machine Learning Research 222:1167-1182 Available from https://proceedings.mlr.press/v222/saber24a.html.

Logarithmic regret in communicating MDPs: Leveraging known dynamics with bandits

Abstract

Cite this Paper

Related Material