EPMC: Every Visit Preference Monte Carlo for Reinforcement Learning

Christian Wirth; Johannes Fürnkranz

EPMC: Every Visit Preference Monte Carlo for Reinforcement Learning

Christian Wirth, Johannes Fürnkranz

Proceedings of the 5th Asian Conference on Machine Learning, PMLR 29:483-497, 2013.

Abstract

Reinforcement learning algorithms are usually hard to use for non expert users. It is required to consider several aspects like the definition of state-, action- and reward-space as well as the algorithms hyperparameters. Preference based approaches try to address these problems by omitting the requirement for exact rewards, replacing them with preferences over solutions. Some algorithms have been proposed within this framework, but they are usually requiring parameterized policies which is again a hinderance for their application. Monte Carlo based approaches do not have this restriction and are also model free. Hence, we present a new preference-based reinforcement learning algorithm, utilizing Monte Carlo estimates. The main idea is to estimate the relative Q-value of two actions for the same state within a every-visit framework. This means, preferences are used to estimate the Q-value of state-action pairs within a trajectory, based on the feedback concerning the complete trajectory. The algorithm is evaluated on three common benchmark problems, namely mountain car, inverted pendulum and acrobot, showing its advantage over a closely related algorithm which is also using estimates for intermediate states, but based on a probability theorem. In comparison to SARSA(λ), EPMC converges somewhat slower, but computes policies that are almost as good or better.

Cite this Paper

BibTeX


@InProceedings{pmlr-v29-Wirth13,
  title = 	 {EPMC: Every Visit Preference Monte Carlo for Reinforcement Learning},
  author = 	 {Wirth, Christian and Fürnkranz, Johannes},
  booktitle = 	 {Proceedings of the 5th Asian Conference on Machine Learning},
  pages = 	 {483--497},
  year = 	 {2013},
  editor = 	 {Ong, Cheng Soon and Ho, Tu Bao},
  volume = 	 {29},
  series = 	 {Proceedings of Machine Learning Research},
  address = 	 {Australian National University, Canberra, Australia},
  month = 	 {13--15 Nov},
  publisher =    {PMLR},
  pdf = 	 {http://proceedings.mlr.press/v29/Wirth13.pdf},
  url = 	 {https://proceedings.mlr.press/v29/Wirth13.html},
  abstract = 	 {Reinforcement learning algorithms are usually hard to use for non expert users. It is required to consider several aspects like the definition of state-, action- and reward-space as well as the algorithms hyperparameters. Preference based approaches try to address these problems by omitting the requirement for exact rewards, replacing them with preferences over solutions. Some algorithms have been proposed within this framework, but they are usually requiring parameterized policies which is again a hinderance for their application. Monte Carlo based approaches do not have this restriction and are also model free. Hence, we present a new preference-based reinforcement learning algorithm, utilizing Monte Carlo estimates. The main idea is to estimate the relative Q-value of two actions for the same state within a every-visit framework. This means, preferences are used to estimate the Q-value of state-action pairs within a trajectory, based on the feedback concerning the complete trajectory. The algorithm is evaluated on three common benchmark problems, namely mountain car, inverted pendulum and acrobot, showing its advantage over a closely related algorithm which is also using estimates for intermediate states, but based on a probability theorem. In comparison to SARSA(λ), EPMC converges somewhat slower, but computes policies that are almost as good or better.}
}

Endnote

%0 Conference Paper
%T EPMC: Every Visit Preference Monte Carlo for Reinforcement Learning
%A Christian Wirth
%A Johannes Fürnkranz
%B Proceedings of the 5th Asian Conference on Machine Learning
%C Proceedings of Machine Learning Research
%D 2013
%E Cheng Soon Ong
%E Tu Bao Ho	
%F pmlr-v29-Wirth13
%I PMLR
%P 483--497
%U https://proceedings.mlr.press/v29/Wirth13.html
%V 29
%X Reinforcement learning algorithms are usually hard to use for non expert users. It is required to consider several aspects like the definition of state-, action- and reward-space as well as the algorithms hyperparameters. Preference based approaches try to address these problems by omitting the requirement for exact rewards, replacing them with preferences over solutions. Some algorithms have been proposed within this framework, but they are usually requiring parameterized policies which is again a hinderance for their application. Monte Carlo based approaches do not have this restriction and are also model free. Hence, we present a new preference-based reinforcement learning algorithm, utilizing Monte Carlo estimates. The main idea is to estimate the relative Q-value of two actions for the same state within a every-visit framework. This means, preferences are used to estimate the Q-value of state-action pairs within a trajectory, based on the feedback concerning the complete trajectory. The algorithm is evaluated on three common benchmark problems, namely mountain car, inverted pendulum and acrobot, showing its advantage over a closely related algorithm which is also using estimates for intermediate states, but based on a probability theorem. In comparison to SARSA(λ), EPMC converges somewhat slower, but computes policies that are almost as good or better.

RIS


TY  - CPAPER
TI  - EPMC: Every Visit Preference Monte Carlo for Reinforcement Learning
AU  - Christian Wirth
AU  - Johannes Fürnkranz
BT  - Proceedings of the 5th Asian Conference on Machine Learning
DA  - 2013/10/21
ED  - Cheng Soon Ong
ED  - Tu Bao Ho	
ID  - pmlr-v29-Wirth13
PB  - PMLR
DP  - Proceedings of Machine Learning Research
VL  - 29
SP  - 483
EP  - 497
L1  - http://proceedings.mlr.press/v29/Wirth13.pdf
UR  - https://proceedings.mlr.press/v29/Wirth13.html
AB  - Reinforcement learning algorithms are usually hard to use for non expert users. It is required to consider several aspects like the definition of state-, action- and reward-space as well as the algorithms hyperparameters. Preference based approaches try to address these problems by omitting the requirement for exact rewards, replacing them with preferences over solutions. Some algorithms have been proposed within this framework, but they are usually requiring parameterized policies which is again a hinderance for their application. Monte Carlo based approaches do not have this restriction and are also model free. Hence, we present a new preference-based reinforcement learning algorithm, utilizing Monte Carlo estimates. The main idea is to estimate the relative Q-value of two actions for the same state within a every-visit framework. This means, preferences are used to estimate the Q-value of state-action pairs within a trajectory, based on the feedback concerning the complete trajectory. The algorithm is evaluated on three common benchmark problems, namely mountain car, inverted pendulum and acrobot, showing its advantage over a closely related algorithm which is also using estimates for intermediate states, but based on a probability theorem. In comparison to SARSA(λ), EPMC converges somewhat slower, but computes policies that are almost as good or better.
ER  -

APA


Wirth, C. & Fürnkranz, J.. (2013). EPMC: Every Visit Preference Monte Carlo for Reinforcement Learning. Proceedings of the 5th Asian Conference on Machine Learning, in Proceedings of Machine Learning Research 29:483-497 Available from https://proceedings.mlr.press/v29/Wirth13.html.

Related Material

Download PDF