Learning from a Learner

Alexis Jacq, Matthieu Geist, Ana Paiva, Olivier Pietquin
Proceedings of the 36th International Conference on Machine Learning, PMLR 97:2990-2999, 2019.

Abstract

In this paper, we propose a novel setting for Inverse Reinforcement Learning (IRL), namely "Learning from a Learner" (LfL). As opposed to standard IRL, it does not consist in learning a reward by observing an optimal agent but from observations of another learning (and thus sub-optimal) agent. To do so, we leverage the fact that the observed agent’s policy is assumed to improve over time. The ultimate goal of this approach is to recover the actual environment’s reward and to allow the observer to outperform the learner. To recover that reward in practice, we propose methods based on the entropy-regularized policy iteration framework. We discuss different approaches to learn solely from trajectories in the state-action space. We demonstrate the genericity of our method by observing agents implementing various reinforcement learning algorithms. Finally, we show that, on both discrete and continuous state/action tasks, the observer’s performance (that optimizes the recovered reward) can surpass those of the observed agent.

Cite this Paper


BibTeX
@InProceedings{pmlr-v97-jacq19a, title = {Learning from a Learner}, author = {Jacq, Alexis and Geist, Matthieu and Paiva, Ana and Pietquin, Olivier}, booktitle = {Proceedings of the 36th International Conference on Machine Learning}, pages = {2990--2999}, year = {2019}, editor = {Chaudhuri, Kamalika and Salakhutdinov, Ruslan}, volume = {97}, series = {Proceedings of Machine Learning Research}, month = {09--15 Jun}, publisher = {PMLR}, pdf = {http://proceedings.mlr.press/v97/jacq19a/jacq19a.pdf}, url = {https://proceedings.mlr.press/v97/jacq19a.html}, abstract = {In this paper, we propose a novel setting for Inverse Reinforcement Learning (IRL), namely "Learning from a Learner" (LfL). As opposed to standard IRL, it does not consist in learning a reward by observing an optimal agent but from observations of another learning (and thus sub-optimal) agent. To do so, we leverage the fact that the observed agent’s policy is assumed to improve over time. The ultimate goal of this approach is to recover the actual environment’s reward and to allow the observer to outperform the learner. To recover that reward in practice, we propose methods based on the entropy-regularized policy iteration framework. We discuss different approaches to learn solely from trajectories in the state-action space. We demonstrate the genericity of our method by observing agents implementing various reinforcement learning algorithms. Finally, we show that, on both discrete and continuous state/action tasks, the observer’s performance (that optimizes the recovered reward) can surpass those of the observed agent.} }
Endnote
%0 Conference Paper %T Learning from a Learner %A Alexis Jacq %A Matthieu Geist %A Ana Paiva %A Olivier Pietquin %B Proceedings of the 36th International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2019 %E Kamalika Chaudhuri %E Ruslan Salakhutdinov %F pmlr-v97-jacq19a %I PMLR %P 2990--2999 %U https://proceedings.mlr.press/v97/jacq19a.html %V 97 %X In this paper, we propose a novel setting for Inverse Reinforcement Learning (IRL), namely "Learning from a Learner" (LfL). As opposed to standard IRL, it does not consist in learning a reward by observing an optimal agent but from observations of another learning (and thus sub-optimal) agent. To do so, we leverage the fact that the observed agent’s policy is assumed to improve over time. The ultimate goal of this approach is to recover the actual environment’s reward and to allow the observer to outperform the learner. To recover that reward in practice, we propose methods based on the entropy-regularized policy iteration framework. We discuss different approaches to learn solely from trajectories in the state-action space. We demonstrate the genericity of our method by observing agents implementing various reinforcement learning algorithms. Finally, we show that, on both discrete and continuous state/action tasks, the observer’s performance (that optimizes the recovered reward) can surpass those of the observed agent.
APA
Jacq, A., Geist, M., Paiva, A. & Pietquin, O.. (2019). Learning from a Learner. Proceedings of the 36th International Conference on Machine Learning, in Proceedings of Machine Learning Research 97:2990-2999 Available from https://proceedings.mlr.press/v97/jacq19a.html.

Related Material