EPMC: Every Visit Preference Monte Carlo for Reinforcement Learning

Christian Wirth, Johannes Fürnkranz
Proceedings of the 5th Asian Conference on Machine Learning, PMLR 29:483-497, 2013.

Abstract

Reinforcement learning algorithms are usually hard to use for non expert users. It is required to consider several aspects like the definition of state-, action- and reward-space as well as the algorithms hyperparameters. Preference based approaches try to address these problems by omitting the requirement for exact rewards, replacing them with preferences over solutions. Some algorithms have been proposed within this framework, but they are usually requiring parameterized policies which is again a hinderance for their application. Monte Carlo based approaches do not have this restriction and are also model free. Hence, we present a new preference-based reinforcement learning algorithm, utilizing Monte Carlo estimates. The main idea is to estimate the relative Q-value of two actions for the same state within a every-visit framework. This means, preferences are used to estimate the Q-value of state-action pairs within a trajectory, based on the feedback concerning the complete trajectory. The algorithm is evaluated on three common benchmark problems, namely mountain car, inverted pendulum and acrobot, showing its advantage over a closely related algorithm which is also using estimates for intermediate states, but based on a probability theorem. In comparison to SARSA(λ), EPMC converges somewhat slower, but computes policies that are almost as good or better.

Cite this Paper


BibTeX
@InProceedings{pmlr-v29-Wirth13, title = {EPMC: Every Visit Preference Monte Carlo for Reinforcement Learning}, author = {Wirth, Christian and Fürnkranz, Johannes}, booktitle = {Proceedings of the 5th Asian Conference on Machine Learning}, pages = {483--497}, year = {2013}, editor = {Ong, Cheng Soon and Ho, Tu Bao}, volume = {29}, series = {Proceedings of Machine Learning Research}, address = {Australian National University, Canberra, Australia}, month = {13--15 Nov}, publisher = {PMLR}, pdf = {http://proceedings.mlr.press/v29/Wirth13.pdf}, url = {https://proceedings.mlr.press/v29/Wirth13.html}, abstract = {Reinforcement learning algorithms are usually hard to use for non expert users. It is required to consider several aspects like the definition of state-, action- and reward-space as well as the algorithms hyperparameters. Preference based approaches try to address these problems by omitting the requirement for exact rewards, replacing them with preferences over solutions. Some algorithms have been proposed within this framework, but they are usually requiring parameterized policies which is again a hinderance for their application. Monte Carlo based approaches do not have this restriction and are also model free. Hence, we present a new preference-based reinforcement learning algorithm, utilizing Monte Carlo estimates. The main idea is to estimate the relative Q-value of two actions for the same state within a every-visit framework. This means, preferences are used to estimate the Q-value of state-action pairs within a trajectory, based on the feedback concerning the complete trajectory. The algorithm is evaluated on three common benchmark problems, namely mountain car, inverted pendulum and acrobot, showing its advantage over a closely related algorithm which is also using estimates for intermediate states, but based on a probability theorem. In comparison to SARSA(λ), EPMC converges somewhat slower, but computes policies that are almost as good or better.} }
Endnote
%0 Conference Paper %T EPMC: Every Visit Preference Monte Carlo for Reinforcement Learning %A Christian Wirth %A Johannes Fürnkranz %B Proceedings of the 5th Asian Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2013 %E Cheng Soon Ong %E Tu Bao Ho %F pmlr-v29-Wirth13 %I PMLR %P 483--497 %U https://proceedings.mlr.press/v29/Wirth13.html %V 29 %X Reinforcement learning algorithms are usually hard to use for non expert users. It is required to consider several aspects like the definition of state-, action- and reward-space as well as the algorithms hyperparameters. Preference based approaches try to address these problems by omitting the requirement for exact rewards, replacing them with preferences over solutions. Some algorithms have been proposed within this framework, but they are usually requiring parameterized policies which is again a hinderance for their application. Monte Carlo based approaches do not have this restriction and are also model free. Hence, we present a new preference-based reinforcement learning algorithm, utilizing Monte Carlo estimates. The main idea is to estimate the relative Q-value of two actions for the same state within a every-visit framework. This means, preferences are used to estimate the Q-value of state-action pairs within a trajectory, based on the feedback concerning the complete trajectory. The algorithm is evaluated on three common benchmark problems, namely mountain car, inverted pendulum and acrobot, showing its advantage over a closely related algorithm which is also using estimates for intermediate states, but based on a probability theorem. In comparison to SARSA(λ), EPMC converges somewhat slower, but computes policies that are almost as good or better.
RIS
TY - CPAPER TI - EPMC: Every Visit Preference Monte Carlo for Reinforcement Learning AU - Christian Wirth AU - Johannes Fürnkranz BT - Proceedings of the 5th Asian Conference on Machine Learning DA - 2013/10/21 ED - Cheng Soon Ong ED - Tu Bao Ho ID - pmlr-v29-Wirth13 PB - PMLR DP - Proceedings of Machine Learning Research VL - 29 SP - 483 EP - 497 L1 - http://proceedings.mlr.press/v29/Wirth13.pdf UR - https://proceedings.mlr.press/v29/Wirth13.html AB - Reinforcement learning algorithms are usually hard to use for non expert users. It is required to consider several aspects like the definition of state-, action- and reward-space as well as the algorithms hyperparameters. Preference based approaches try to address these problems by omitting the requirement for exact rewards, replacing them with preferences over solutions. Some algorithms have been proposed within this framework, but they are usually requiring parameterized policies which is again a hinderance for their application. Monte Carlo based approaches do not have this restriction and are also model free. Hence, we present a new preference-based reinforcement learning algorithm, utilizing Monte Carlo estimates. The main idea is to estimate the relative Q-value of two actions for the same state within a every-visit framework. This means, preferences are used to estimate the Q-value of state-action pairs within a trajectory, based on the feedback concerning the complete trajectory. The algorithm is evaluated on three common benchmark problems, namely mountain car, inverted pendulum and acrobot, showing its advantage over a closely related algorithm which is also using estimates for intermediate states, but based on a probability theorem. In comparison to SARSA(λ), EPMC converges somewhat slower, but computes policies that are almost as good or better. ER -
APA
Wirth, C. & Fürnkranz, J.. (2013). EPMC: Every Visit Preference Monte Carlo for Reinforcement Learning. Proceedings of the 5th Asian Conference on Machine Learning, in Proceedings of Machine Learning Research 29:483-497 Available from https://proceedings.mlr.press/v29/Wirth13.html.

Related Material