Semi-Supervised Apprenticeship Learning

Michal Valko, Mohammad Ghavamzadeh, Alessandro Lazaric
; Proceedings of the Tenth European Workshop on Reinforcement Learning, PMLR 24:131-142, 2013.

Abstract

In apprenticeship learning we aim to learn a good policy by observing the behavior of an expert or a set of experts. In particular, we consider the case where the expert acts so as to maximize an unknown reward function defined as a linear combination of a set of state features. In this paper, we consider the setting where we observe many sample trajectories (i.e., sequences of states) but only one or a few of them are labeled as experts' trajectories. We investigate the conditions under which the remaining unlabeled trajectories can help in learning a policy with a good performance. In particular, we define an extension to the max-margin inverse reinforcement learning proposed by Abbeel and Ng [2004] where, at each iteration, the max-margin optimization step is replaced by a semi-supervised optimiza- tion problem which favors classifiers separating clusters of trajectories. Finally, we report empirical results on two grid-world domains showing that the semi-supervised algorithm is able to output a better policy in fewer iterations than the related algorithm that does not take the unlabeled trajectories into account.

Cite this Paper


BibTeX
@InProceedings{pmlr-v24-valko12a, title = {Semi-Supervised Apprenticeship Learning}, author = {Michal Valko and Mohammad Ghavamzadeh and Alessandro Lazaric}, booktitle = {Proceedings of the Tenth European Workshop on Reinforcement Learning}, pages = {131--142}, year = {2013}, editor = {Marc Peter Deisenroth and Csaba Szepesvári and Jan Peters}, volume = {24}, series = {Proceedings of Machine Learning Research}, address = {Edinburgh, Scotland}, month = {30 Jun--01 Jul}, publisher = {PMLR}, pdf = {http://proceedings.mlr.press/v24/valko12a/valko12a.pdf}, url = {http://proceedings.mlr.press/v24/valko12a.html}, abstract = {In apprenticeship learning we aim to learn a good policy by observing the behavior of an expert or a set of experts. In particular, we consider the case where the expert acts so as to maximize an unknown reward function defined as a linear combination of a set of state features. In this paper, we consider the setting where we observe many sample trajectories (i.e., sequences of states) but only one or a few of them are labeled as experts' trajectories. We investigate the conditions under which the remaining unlabeled trajectories can help in learning a policy with a good performance. In particular, we define an extension to the max-margin inverse reinforcement learning proposed by Abbeel and Ng [2004] where, at each iteration, the max-margin optimization step is replaced by a semi-supervised optimiza- tion problem which favors classifiers separating clusters of trajectories. Finally, we report empirical results on two grid-world domains showing that the semi-supervised algorithm is able to output a better policy in fewer iterations than the related algorithm that does not take the unlabeled trajectories into account.} }
Endnote
%0 Conference Paper %T Semi-Supervised Apprenticeship Learning %A Michal Valko %A Mohammad Ghavamzadeh %A Alessandro Lazaric %B Proceedings of the Tenth European Workshop on Reinforcement Learning %C Proceedings of Machine Learning Research %D 2013 %E Marc Peter Deisenroth %E Csaba Szepesvári %E Jan Peters %F pmlr-v24-valko12a %I PMLR %J Proceedings of Machine Learning Research %P 131--142 %U http://proceedings.mlr.press %V 24 %W PMLR %X In apprenticeship learning we aim to learn a good policy by observing the behavior of an expert or a set of experts. In particular, we consider the case where the expert acts so as to maximize an unknown reward function defined as a linear combination of a set of state features. In this paper, we consider the setting where we observe many sample trajectories (i.e., sequences of states) but only one or a few of them are labeled as experts' trajectories. We investigate the conditions under which the remaining unlabeled trajectories can help in learning a policy with a good performance. In particular, we define an extension to the max-margin inverse reinforcement learning proposed by Abbeel and Ng [2004] where, at each iteration, the max-margin optimization step is replaced by a semi-supervised optimiza- tion problem which favors classifiers separating clusters of trajectories. Finally, we report empirical results on two grid-world domains showing that the semi-supervised algorithm is able to output a better policy in fewer iterations than the related algorithm that does not take the unlabeled trajectories into account.
RIS
TY - CPAPER TI - Semi-Supervised Apprenticeship Learning AU - Michal Valko AU - Mohammad Ghavamzadeh AU - Alessandro Lazaric BT - Proceedings of the Tenth European Workshop on Reinforcement Learning PY - 2013/01/12 DA - 2013/01/12 ED - Marc Peter Deisenroth ED - Csaba Szepesvári ED - Jan Peters ID - pmlr-v24-valko12a PB - PMLR SP - 131 DP - PMLR EP - 142 L1 - http://proceedings.mlr.press/v24/valko12a/valko12a.pdf UR - http://proceedings.mlr.press/v24/valko12a.html AB - In apprenticeship learning we aim to learn a good policy by observing the behavior of an expert or a set of experts. In particular, we consider the case where the expert acts so as to maximize an unknown reward function defined as a linear combination of a set of state features. In this paper, we consider the setting where we observe many sample trajectories (i.e., sequences of states) but only one or a few of them are labeled as experts' trajectories. We investigate the conditions under which the remaining unlabeled trajectories can help in learning a policy with a good performance. In particular, we define an extension to the max-margin inverse reinforcement learning proposed by Abbeel and Ng [2004] where, at each iteration, the max-margin optimization step is replaced by a semi-supervised optimiza- tion problem which favors classifiers separating clusters of trajectories. Finally, we report empirical results on two grid-world domains showing that the semi-supervised algorithm is able to output a better policy in fewer iterations than the related algorithm that does not take the unlabeled trajectories into account. ER -
APA
Valko, M., Ghavamzadeh, M. & Lazaric, A.. (2013). Semi-Supervised Apprenticeship Learning. Proceedings of the Tenth European Workshop on Reinforcement Learning, in PMLR 24:131-142

Related Material