Robust Asymmetric Learning in POMDPs

Andrew Warrington, Jonathan W Lavington, Adam Scibior, Mark Schmidt, Frank Wood
Proceedings of the 38th International Conference on Machine Learning, PMLR 139:11013-11023, 2021.

Abstract

Policies for partially observed Markov decision processes can be efficiently learned by imitating expert policies generated using asymmetric information. Unfortunately, existing approaches for this kind of imitation learning have a serious flaw: the expert does not know what the trainee cannot see, and as a result may encourage actions that are sub-optimal or unsafe under partial information. To address this issue, we derive an update which, when applied iteratively to an expert, maximizes the expected reward of the trainee’s policy. Using this update, we construct a computationally efficient algorithm, adaptive asymmetric DAgger (A2D), that jointly trains the expert and trainee policies. We then show that A2D allows the trainee to safely imitate the modified expert, and outperforms policies learned either by imitating a fixed expert or through direct reinforcement learning.

Cite this Paper


BibTeX
@InProceedings{pmlr-v139-warrington21a, title = {Robust Asymmetric Learning in POMDPs}, author = {Warrington, Andrew and Lavington, Jonathan W and Scibior, Adam and Schmidt, Mark and Wood, Frank}, booktitle = {Proceedings of the 38th International Conference on Machine Learning}, pages = {11013--11023}, year = {2021}, editor = {Meila, Marina and Zhang, Tong}, volume = {139}, series = {Proceedings of Machine Learning Research}, month = {18--24 Jul}, publisher = {PMLR}, pdf = {http://proceedings.mlr.press/v139/warrington21a/warrington21a.pdf}, url = {https://proceedings.mlr.press/v139/warrington21a.html}, abstract = {Policies for partially observed Markov decision processes can be efficiently learned by imitating expert policies generated using asymmetric information. Unfortunately, existing approaches for this kind of imitation learning have a serious flaw: the expert does not know what the trainee cannot see, and as a result may encourage actions that are sub-optimal or unsafe under partial information. To address this issue, we derive an update which, when applied iteratively to an expert, maximizes the expected reward of the trainee’s policy. Using this update, we construct a computationally efficient algorithm, adaptive asymmetric DAgger (A2D), that jointly trains the expert and trainee policies. We then show that A2D allows the trainee to safely imitate the modified expert, and outperforms policies learned either by imitating a fixed expert or through direct reinforcement learning.} }
Endnote
%0 Conference Paper %T Robust Asymmetric Learning in POMDPs %A Andrew Warrington %A Jonathan W Lavington %A Adam Scibior %A Mark Schmidt %A Frank Wood %B Proceedings of the 38th International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2021 %E Marina Meila %E Tong Zhang %F pmlr-v139-warrington21a %I PMLR %P 11013--11023 %U https://proceedings.mlr.press/v139/warrington21a.html %V 139 %X Policies for partially observed Markov decision processes can be efficiently learned by imitating expert policies generated using asymmetric information. Unfortunately, existing approaches for this kind of imitation learning have a serious flaw: the expert does not know what the trainee cannot see, and as a result may encourage actions that are sub-optimal or unsafe under partial information. To address this issue, we derive an update which, when applied iteratively to an expert, maximizes the expected reward of the trainee’s policy. Using this update, we construct a computationally efficient algorithm, adaptive asymmetric DAgger (A2D), that jointly trains the expert and trainee policies. We then show that A2D allows the trainee to safely imitate the modified expert, and outperforms policies learned either by imitating a fixed expert or through direct reinforcement learning.
APA
Warrington, A., Lavington, J.W., Scibior, A., Schmidt, M. & Wood, F.. (2021). Robust Asymmetric Learning in POMDPs. Proceedings of the 38th International Conference on Machine Learning, in Proceedings of Machine Learning Research 139:11013-11023 Available from https://proceedings.mlr.press/v139/warrington21a.html.

Related Material