Learning without Knowing: Unobserved Context in Continuous Transfer Reinforcement Learning

Chenyu Liu; Yan Zhang; Yi Shen; Michael M. Zavlanos

Learning without Knowing: Unobserved Context in Continuous Transfer Reinforcement Learning

Chenyu Liu, Yan Zhang, Yi Shen, Michael M. Zavlanos

Proceedings of the 3rd Conference on Learning for Dynamics and Control, PMLR 144:791-802, 2021.

Abstract

In this paper, we consider a transfer Reinforcement Learning (RL) problem in continuous state and action spaces, under unobserved contextual information. The context here can represent a specific unique mental view of the world that an expert agent has formed through past interactions with this world. We assume that this context is not accessible to a learner agent who can only observe the expert data and does not know how they were generated. Then, our goal is to use the context-aware continuous expert data to learn an optimal context-unaware policy for the learner using only a few new data samples. To this date, such problems are typically solved using imitation learning that assumes that both the expert and learner agents have access to the same information. However, if the learner does not know the expert context, using the expert data alone will result in a biased learner policy and will require many new data samples to improve. To address this challenge, in this paper, we formulate the learning problem that the learner agent solves as a causal bound-constrained Multi-Armed-Bandit (MAB) problem. The arms of this MAB correspond to a set of basis policy functions that can be initialized in an unsupervised way using the expert data and represent the different expert behaviors affected by the unobserved context. On the other hand, the MAB constraints correspond to causal bounds on the accumulated rewards of these basis policy functions that we also compute from the expert data. The solution to this MAB allows the learner agent to select the best basis policy and improve it online. And the use of causal bounds reduces the exploration variance and, therefore, improves the learning rate. We provide numerical experiments on an autonomous driving example that show that our proposed transfer RL method improves the learner’s policy faster compared to imitation learning methods and enjoys much lower variance during training.

Cite this Paper

BibTeX

@InProceedings{pmlr-v144-liu21a,
  title = 	 {Learning without Knowing: Unobserved Context in Continuous Transfer Reinforcement Learning},
  author =       {Liu, Chenyu and Zhang, Yan and Shen, Yi and Zavlanos, Michael M.},
  booktitle = 	 {Proceedings of the 3rd Conference on Learning for Dynamics and Control},
  pages = 	 {791--802},
  year = 	 {2021},
  editor = 	 {Jadbabaie, Ali and Lygeros, John and Pappas, George J. and A. Parrilo, Pablo and Recht, Benjamin and Tomlin, Claire J. and Zeilinger, Melanie N.},
  volume = 	 {144},
  series = 	 {Proceedings of Machine Learning Research},
  month = 	 {07 -- 08 June},
  publisher =    {PMLR},
  pdf = 	 {http://proceedings.mlr.press/v144/liu21a/liu21a.pdf},
  url = 	 {https://proceedings.mlr.press/v144/liu21a.html},
  abstract = 	 {In this paper, we consider a transfer Reinforcement Learning (RL) problem in continuous state and action spaces, under unobserved contextual information. The context here can represent a specific unique mental view of the world that an expert agent has formed through past interactions with this world. We assume that this context is not accessible to a learner agent who can only observe the expert data and does not know how they were generated. Then, our goal is to use the context-aware continuous expert data to learn an optimal context-unaware policy for the learner using only a few new data samples. To this date, such problems are typically solved using imitation learning that assumes that both the expert and learner agents have access to the same information. However, if the learner does not know the expert context, using the expert data alone will result in a biased learner policy and will require many new data samples to improve. To address this challenge, in this paper, we formulate the learning problem that the learner agent solves as a causal bound-constrained Multi-Armed-Bandit (MAB) problem. The arms of this MAB correspond to a set of basis policy functions that can be initialized in an unsupervised way using the expert data and represent the different expert behaviors affected by the unobserved context. On the other hand, the MAB constraints correspond to causal bounds on the accumulated rewards of these basis policy functions that we also compute from the expert data. The solution to this MAB allows the learner agent to select the best basis policy and improve it online. And the use of causal bounds reduces the exploration variance and, therefore, improves the learning rate. We provide numerical experiments on an autonomous driving example that show that our proposed transfer RL method improves the learner’s policy faster compared to imitation learning methods and enjoys much lower variance during training.}
}

Endnote

%0 Conference Paper
%T Learning without Knowing: Unobserved Context in Continuous Transfer Reinforcement Learning
%A Chenyu Liu
%A Yan Zhang
%A Yi Shen
%A Michael M. Zavlanos
%B Proceedings of the 3rd Conference on Learning for Dynamics and Control
%C Proceedings of Machine Learning Research
%D 2021
%E Ali Jadbabaie
%E John Lygeros
%E George J. Pappas
%E Pablo A. Parrilo
%E Benjamin Recht
%E Claire J. Tomlin
%E Melanie N. Zeilinger	
%F pmlr-v144-liu21a
%I PMLR
%P 791--802
%U https://proceedings.mlr.press/v144/liu21a.html
%V 144
%X In this paper, we consider a transfer Reinforcement Learning (RL) problem in continuous state and action spaces, under unobserved contextual information. The context here can represent a specific unique mental view of the world that an expert agent has formed through past interactions with this world. We assume that this context is not accessible to a learner agent who can only observe the expert data and does not know how they were generated. Then, our goal is to use the context-aware continuous expert data to learn an optimal context-unaware policy for the learner using only a few new data samples. To this date, such problems are typically solved using imitation learning that assumes that both the expert and learner agents have access to the same information. However, if the learner does not know the expert context, using the expert data alone will result in a biased learner policy and will require many new data samples to improve. To address this challenge, in this paper, we formulate the learning problem that the learner agent solves as a causal bound-constrained Multi-Armed-Bandit (MAB) problem. The arms of this MAB correspond to a set of basis policy functions that can be initialized in an unsupervised way using the expert data and represent the different expert behaviors affected by the unobserved context. On the other hand, the MAB constraints correspond to causal bounds on the accumulated rewards of these basis policy functions that we also compute from the expert data. The solution to this MAB allows the learner agent to select the best basis policy and improve it online. And the use of causal bounds reduces the exploration variance and, therefore, improves the learning rate. We provide numerical experiments on an autonomous driving example that show that our proposed transfer RL method improves the learner’s policy faster compared to imitation learning methods and enjoys much lower variance during training.

APA

Liu, C., Zhang, Y., Shen, Y. & Zavlanos, M.M.. (2021). Learning without Knowing: Unobserved Context in Continuous Transfer Reinforcement Learning. Proceedings of the 3rd Conference on Learning for Dynamics and Control, in Proceedings of Machine Learning Research 144:791-802 Available from https://proceedings.mlr.press/v144/liu21a.html.

Related Material

Download PDF