Improved Exploration in Factored Average-Reward MDPs

Mohammad Sadegh Talebi, Anders Jonsson, Odalric Maillard
Proceedings of The 24th International Conference on Artificial Intelligence and Statistics, PMLR 130:3988-3996, 2021.

Abstract

We consider a regret minimization task under the average-reward criterion in an unknown Factored Markov Decision Process (FMDP). More specifically, we consider an FMDP where the state-action space X and the state-space S admit the respective factored forms of X=ni=1Xi and S=mi=1Si, and the transition and reward functions are factored over X and S. Assuming a known a factorization structure, we introduce a novel regret minimization strategy inspired by the popular UCRL strategy, called DBN-UCRL, which relies on Bernstein-type confidence sets defined for individual elements of the transition function. We show that for a generic factorization structure, DBN-UCRL achieves a regret bound, whose leading term strictly improves over existing regret bounds in terms of the dependencies on the size of \cSi’s and the diameter. We further show that when the factorization structure corresponds to the Cartesian product of some base MDPs, the regret of DBN-UCRL is upper bounded by the sum of regret of the base MDPs. We demonstrate, through numerical experiments on standard environments, that DBN-UCRL enjoys a substantially improved regret empirically over existing algorithms that have frequentist regret guarantees.

Cite this Paper


BibTeX
@InProceedings{pmlr-v130-sadegh-talebi21a, title = { Improved Exploration in Factored Average-Reward MDPs }, author = {Sadegh Talebi, Mohammad and Jonsson, Anders and Maillard, Odalric}, booktitle = {Proceedings of The 24th International Conference on Artificial Intelligence and Statistics}, pages = {3988--3996}, year = {2021}, editor = {Banerjee, Arindam and Fukumizu, Kenji}, volume = {130}, series = {Proceedings of Machine Learning Research}, month = {13--15 Apr}, publisher = {PMLR}, pdf = {http://proceedings.mlr.press/v130/talebi21a/talebi21a.pdf}, url = {https://proceedings.mlr.press/v130/sadegh-talebi21a.html}, abstract = { We consider a regret minimization task under the average-reward criterion in an unknown Factored Markov Decision Process (FMDP). More specifically, we consider an FMDP where the state-action space $\mathcal X$ and the state-space $\mathcal S$ admit the respective factored forms of $\mathcal X = \otimes_{i=1}^n \mathcal X_i$ and $\mathcal S=\otimes_{i=1}^m \mathcal S_i$, and the transition and reward functions are factored over $\mathcal X$ and $\mathcal S$. Assuming a known a factorization structure, we introduce a novel regret minimization strategy inspired by the popular UCRL strategy, called DBN-UCRL, which relies on Bernstein-type confidence sets defined for individual elements of the transition function. We show that for a generic factorization structure, DBN-UCRL achieves a regret bound, whose leading term strictly improves over existing regret bounds in terms of the dependencies on the size of $\cS_i$’s and the diameter. We further show that when the factorization structure corresponds to the Cartesian product of some base MDPs, the regret of DBN-UCRL is upper bounded by the sum of regret of the base MDPs. We demonstrate, through numerical experiments on standard environments, that DBN-UCRL enjoys a substantially improved regret empirically over existing algorithms that have frequentist regret guarantees. } }
Endnote
%0 Conference Paper %T Improved Exploration in Factored Average-Reward MDPs %A Mohammad Sadegh Talebi %A Anders Jonsson %A Odalric Maillard %B Proceedings of The 24th International Conference on Artificial Intelligence and Statistics %C Proceedings of Machine Learning Research %D 2021 %E Arindam Banerjee %E Kenji Fukumizu %F pmlr-v130-sadegh-talebi21a %I PMLR %P 3988--3996 %U https://proceedings.mlr.press/v130/sadegh-talebi21a.html %V 130 %X We consider a regret minimization task under the average-reward criterion in an unknown Factored Markov Decision Process (FMDP). More specifically, we consider an FMDP where the state-action space $\mathcal X$ and the state-space $\mathcal S$ admit the respective factored forms of $\mathcal X = \otimes_{i=1}^n \mathcal X_i$ and $\mathcal S=\otimes_{i=1}^m \mathcal S_i$, and the transition and reward functions are factored over $\mathcal X$ and $\mathcal S$. Assuming a known a factorization structure, we introduce a novel regret minimization strategy inspired by the popular UCRL strategy, called DBN-UCRL, which relies on Bernstein-type confidence sets defined for individual elements of the transition function. We show that for a generic factorization structure, DBN-UCRL achieves a regret bound, whose leading term strictly improves over existing regret bounds in terms of the dependencies on the size of $\cS_i$’s and the diameter. We further show that when the factorization structure corresponds to the Cartesian product of some base MDPs, the regret of DBN-UCRL is upper bounded by the sum of regret of the base MDPs. We demonstrate, through numerical experiments on standard environments, that DBN-UCRL enjoys a substantially improved regret empirically over existing algorithms that have frequentist regret guarantees.
APA
Sadegh Talebi, M., Jonsson, A. & Maillard, O.. (2021). Improved Exploration in Factored Average-Reward MDPs . Proceedings of The 24th International Conference on Artificial Intelligence and Statistics, in Proceedings of Machine Learning Research 130:3988-3996 Available from https://proceedings.mlr.press/v130/sadegh-talebi21a.html.

Related Material