Model-Based Reinforcement Learning Exploiting State-Action Equivalence

Mahsa Asadi; Mohammad Sadegh Talebi; Hippolyte Bourel; Odalric-Ambrym Maillard

Model-Based Reinforcement Learning Exploiting State-Action Equivalence

Mahsa Asadi, Mohammad Sadegh Talebi, Hippolyte Bourel, Odalric-Ambrym Maillard

Proceedings of The Eleventh Asian Conference on Machine Learning, PMLR 101:204-219, 2019.

Abstract

Leveraging an equivalence property in the state-space of a Markov Decision Process (MDP) has been investigated in several studies. This paper studies equivalence structure in the reinforcement learning (RL) setup, where transition distributions are no longer assumed to be known. We present a notion of similarity between transition probabilities of various state-action pairs of an MDP, which naturally defines an equivalence structure in the state-action space. We present equivalence-aware confidence sets for the case where the learner knows the underlying structure in advance. These sets are provably smaller than their corresponding equivalence-oblivious counterparts. In the more challenging case of an unknown equivalence structure, we present an algorithm called ApproxEquivalence that seeks to find an (approximate) equivalence structure, and define confidence sets using the approximate equivalence. To illustrate the efficacy of the presented confidence sets, we present C-UCRL, as a natural modification of UCRL2 for RL in undiscounted MDPs. In the case of a known equivalence structure, we show that C-UCRL improves over UCRL2 in terms of \emph{regret} by a factor of

$\sqrt{SA/C}$ , in any communicating MDP with

$S$ states,

$A$ actions, and

$C$ classes, which corresponds to a massive improvement when

$C\ll SA$ . To the best of our knowledge, this is the first work providing regret bounds for RL when an equivalence structure in the MDP is efficiently exploited. In the case of an unknown equivalence structure, we show through numerical experiments that C-UCRL combined with ApproxEquivalence outperforms UCRL2 in ergodic MDPs.

Cite this Paper

BibTeX


@InProceedings{pmlr-v101-asadi19a,
  title = 	 {Model-Based Reinforcement Learning Exploiting State-Action Equivalence},
  author =       {Asadi, Mahsa and Talebi, Mohammad Sadegh and Bourel, Hippolyte and Maillard, Odalric-Ambrym},
  booktitle = 	 {Proceedings of The Eleventh Asian Conference on Machine Learning},
  pages = 	 {204--219},
  year = 	 {2019},
  editor = 	 {Lee, Wee Sun and Suzuki, Taiji},
  volume = 	 {101},
  series = 	 {Proceedings of Machine Learning Research},
  month = 	 {17--19 Nov},
  publisher =    {PMLR},
  pdf = 	 {http://proceedings.mlr.press/v101/asadi19a/asadi19a.pdf},
  url = 	 {https://proceedings.mlr.press/v101/asadi19a.html},
  abstract = 	 {Leveraging an equivalence property in the state-space of a Markov Decision Process (MDP) has been investigated in several studies. This paper studies equivalence structure in the reinforcement learning (RL) setup, where transition distributions are no longer assumed to be known. We present a notion of similarity between transition probabilities of various state-action pairs of an MDP, which naturally defines an equivalence structure in the state-action space. We present equivalence-aware confidence sets for the case where the learner knows the underlying structure in advance. These sets are provably smaller than their corresponding equivalence-oblivious counterparts. In the more challenging case of an unknown equivalence structure, we present an algorithm called ApproxEquivalence that seeks to find an (approximate) equivalence structure, and define confidence sets using the approximate equivalence. To illustrate the efficacy of the presented confidence sets, we present C-UCRL, as a natural modification of UCRL2 for RL in undiscounted MDPs. In the case of a known equivalence structure, we show that C-UCRL improves over UCRL2 in terms of \emph{regret} by a factor of $\sqrt{SA/C}$, in any communicating MDP with $S$ states, $A$ actions, and $C$ classes, which corresponds to a massive improvement when $C\ll SA$. To the best of our knowledge, this is the first work providing regret bounds for RL when an equivalence structure in the MDP is efficiently exploited. In the case of an unknown equivalence structure, we show through numerical experiments that C-UCRL combined with ApproxEquivalence outperforms UCRL2 in ergodic MDPs.}
}

Endnote

%0 Conference Paper
%T Model-Based Reinforcement Learning Exploiting State-Action Equivalence
%A Mahsa Asadi
%A Mohammad Sadegh Talebi
%A Hippolyte Bourel
%A Odalric-Ambrym Maillard
%B Proceedings of The Eleventh Asian Conference on Machine Learning
%C Proceedings of Machine Learning Research
%D 2019
%E Wee Sun Lee
%E Taiji Suzuki	
%F pmlr-v101-asadi19a
%I PMLR
%P 204--219
%U https://proceedings.mlr.press/v101/asadi19a.html
%V 101
%X Leveraging an equivalence property in the state-space of a Markov Decision Process (MDP) has been investigated in several studies. This paper studies equivalence structure in the reinforcement learning (RL) setup, where transition distributions are no longer assumed to be known. We present a notion of similarity between transition probabilities of various state-action pairs of an MDP, which naturally defines an equivalence structure in the state-action space. We present equivalence-aware confidence sets for the case where the learner knows the underlying structure in advance. These sets are provably smaller than their corresponding equivalence-oblivious counterparts. In the more challenging case of an unknown equivalence structure, we present an algorithm called ApproxEquivalence that seeks to find an (approximate) equivalence structure, and define confidence sets using the approximate equivalence. To illustrate the efficacy of the presented confidence sets, we present C-UCRL, as a natural modification of UCRL2 for RL in undiscounted MDPs. In the case of a known equivalence structure, we show that C-UCRL improves over UCRL2 in terms of \emph{regret} by a factor of $\sqrt{SA/C}$, in any communicating MDP with $S$ states, $A$ actions, and $C$ classes, which corresponds to a massive improvement when $C\ll SA$. To the best of our knowledge, this is the first work providing regret bounds for RL when an equivalence structure in the MDP is efficiently exploited. In the case of an unknown equivalence structure, we show through numerical experiments that C-UCRL combined with ApproxEquivalence outperforms UCRL2 in ergodic MDPs.

APA


Asadi, M., Talebi, M.S., Bourel, H. & Maillard, O.. (2019). Model-Based Reinforcement Learning Exploiting State-Action Equivalence. Proceedings of The Eleventh Asian Conference on Machine Learning, in Proceedings of Machine Learning Research 101:204-219 Available from https://proceedings.mlr.press/v101/asadi19a.html.

Model-Based Reinforcement Learning Exploiting State-Action Equivalence

Abstract

Cite this Paper

Related Material