Semi-Supervised Apprenticeship Learning

Michal Valko; Mohammad Ghavamzadeh; Alessandro Lazaric

Semi-Supervised Apprenticeship Learning

Michal Valko, Mohammad Ghavamzadeh, Alessandro Lazaric

Proceedings of the Tenth European Workshop on Reinforcement Learning, PMLR 24:131-142, 2013.

Abstract

In apprenticeship learning we aim to learn a good policy by observing the behavior of an expert or a set of experts. In particular, we consider the case where the expert acts so as to maximize an unknown reward function defined as a linear combination of a set of state features. In this paper, we consider the setting where we observe many sample trajectories (i.e., sequences of states) but only one or a few of them are labeled as experts' trajectories. We investigate the conditions under which the remaining unlabeled trajectories can help in learning a policy with a good performance. In particular, we define an extension to the max-margin inverse reinforcement learning proposed by Abbeel and Ng [2004] where, at each iteration, the max-margin optimization step is replaced by a semi-supervised optimiza- tion problem which favors classifiers separating clusters of trajectories. Finally, we report empirical results on two grid-world domains showing that the semi-supervised algorithm is able to output a better policy in fewer iterations than the related algorithm that does not take the unlabeled trajectories into account.

Cite this Paper

BibTeX


@InProceedings{pmlr-v24-valko12a,
  title = 	 {Semi-Supervised Apprenticeship Learning},
  author = 	 {Valko, Michal and Ghavamzadeh, Mohammad and Lazaric, Alessandro},
  booktitle = 	 {Proceedings of the Tenth European Workshop on Reinforcement Learning},
  pages = 	 {131--142},
  year = 	 {2013},
  editor = 	 {Deisenroth, Marc Peter and Szepesvári, Csaba and Peters, Jan},
  volume = 	 {24},
  series = 	 {Proceedings of Machine Learning Research},
  address = 	 {Edinburgh, Scotland},
  month = 	 {30 Jun--01 Jul},
  publisher =    {PMLR},
  pdf = 	 {http://proceedings.mlr.press/v24/valko12a/valko12a.pdf},
  url = 	 {https://proceedings.mlr.press/v24/valko12a.html},
  abstract = 	 {In apprenticeship learning we aim to learn a good policy by observing the behavior of an expert or a set of experts. In particular, we consider the case where the expert acts so as to maximize an unknown reward function defined as a linear combination of a set of state features. In this paper, we consider the setting where we observe many sample trajectories (i.e., sequences of states) but only one or a few of them are labeled as experts' trajectories. We investigate the conditions under which the remaining unlabeled trajectories can help in learning a policy with a good performance. In particular, we define an extension to the max-margin inverse reinforcement learning proposed by Abbeel and Ng [2004] where, at each iteration, the max-margin optimization step is replaced by a semi-supervised optimiza- tion problem which favors classifiers separating clusters of trajectories. Finally, we report empirical results on two grid-world domains showing that the semi-supervised algorithm is able to output a better policy in fewer iterations than the related algorithm that does not take the unlabeled trajectories into account.}
}

Endnote

%0 Conference Paper
%T Semi-Supervised Apprenticeship Learning
%A Michal Valko
%A Mohammad Ghavamzadeh
%A Alessandro Lazaric
%B Proceedings of the Tenth European Workshop on Reinforcement Learning
%C Proceedings of Machine Learning Research
%D 2013
%E Marc Peter Deisenroth
%E Csaba Szepesvári
%E Jan Peters	
%F pmlr-v24-valko12a
%I PMLR
%P 131--142
%U https://proceedings.mlr.press/v24/valko12a.html
%V 24
%X In apprenticeship learning we aim to learn a good policy by observing the behavior of an expert or a set of experts. In particular, we consider the case where the expert acts so as to maximize an unknown reward function defined as a linear combination of a set of state features. In this paper, we consider the setting where we observe many sample trajectories (i.e., sequences of states) but only one or a few of them are labeled as experts' trajectories. We investigate the conditions under which the remaining unlabeled trajectories can help in learning a policy with a good performance. In particular, we define an extension to the max-margin inverse reinforcement learning proposed by Abbeel and Ng [2004] where, at each iteration, the max-margin optimization step is replaced by a semi-supervised optimiza- tion problem which favors classifiers separating clusters of trajectories. Finally, we report empirical results on two grid-world domains showing that the semi-supervised algorithm is able to output a better policy in fewer iterations than the related algorithm that does not take the unlabeled trajectories into account.

RIS


TY  - CPAPER
TI  - Semi-Supervised Apprenticeship Learning
AU  - Michal Valko
AU  - Mohammad Ghavamzadeh
AU  - Alessandro Lazaric
BT  - Proceedings of the Tenth European Workshop on Reinforcement Learning
DA  - 2013/01/12
ED  - Marc Peter Deisenroth
ED  - Csaba Szepesvári
ED  - Jan Peters	
ID  - pmlr-v24-valko12a
PB  - PMLR
DP  - Proceedings of Machine Learning Research
VL  - 24
SP  - 131
EP  - 142
L1  - http://proceedings.mlr.press/v24/valko12a/valko12a.pdf
UR  - https://proceedings.mlr.press/v24/valko12a.html
AB  - In apprenticeship learning we aim to learn a good policy by observing the behavior of an expert or a set of experts. In particular, we consider the case where the expert acts so as to maximize an unknown reward function defined as a linear combination of a set of state features. In this paper, we consider the setting where we observe many sample trajectories (i.e., sequences of states) but only one or a few of them are labeled as experts' trajectories. We investigate the conditions under which the remaining unlabeled trajectories can help in learning a policy with a good performance. In particular, we define an extension to the max-margin inverse reinforcement learning proposed by Abbeel and Ng [2004] where, at each iteration, the max-margin optimization step is replaced by a semi-supervised optimiza- tion problem which favors classifiers separating clusters of trajectories. Finally, we report empirical results on two grid-world domains showing that the semi-supervised algorithm is able to output a better policy in fewer iterations than the related algorithm that does not take the unlabeled trajectories into account.
ER  -

APA


Valko, M., Ghavamzadeh, M. & Lazaric, A.. (2013). Semi-Supervised Apprenticeship Learning. Proceedings of the Tenth European Workshop on Reinforcement Learning, in Proceedings of Machine Learning Research 24:131-142 Available from https://proceedings.mlr.press/v24/valko12a.html.

Related Material

Download PDF