Policy Gradient Bayesian Robust Optimization for Imitation Learning

Zaynah Javed; Daniel S Brown; Satvik Sharma; Jerry Zhu; Ashwin Balakrishna; Marek Petrik; Anca Dragan; Ken Goldberg

Policy Gradient Bayesian Robust Optimization for Imitation Learning

Zaynah Javed, Daniel S Brown, Satvik Sharma, Jerry Zhu, Ashwin Balakrishna, Marek Petrik, Anca Dragan, Ken Goldberg

Proceedings of the 38th International Conference on Machine Learning, PMLR 139:4785-4796, 2021.

Abstract

The difficulty in specifying rewards for many real-world problems has led to an increased focus on learning rewards from human feedback, such as demonstrations. However, there are often many different reward functions that explain the human feedback, leaving agents with uncertainty over what the true reward function is. While most policy optimization approaches handle this uncertainty by optimizing for expected performance, many applications demand risk-averse behavior. We derive a novel policy gradient-style robust optimization approach, PG-BROIL, that optimizes a soft-robust objective that balances expected performance and risk. To the best of our knowledge, PG-BROIL is the first policy optimization algorithm robust to a distribution of reward hypotheses which can scale to continuous MDPs. Results suggest that PG-BROIL can produce a family of behaviors ranging from risk-neutral to risk-averse and outperforms state-of-the-art imitation learning algorithms when learning from ambiguous demonstrations by hedging against uncertainty, rather than seeking to uniquely identify the demonstrator’s reward function.

Cite this Paper

BibTeX

@InProceedings{pmlr-v139-javed21a,
  title = 	 {Policy Gradient Bayesian Robust Optimization for Imitation Learning},
  author =       {Javed, Zaynah and Brown, Daniel S and Sharma, Satvik and Zhu, Jerry and Balakrishna, Ashwin and Petrik, Marek and Dragan, Anca and Goldberg, Ken},
  booktitle = 	 {Proceedings of the 38th International Conference on Machine Learning},
  pages = 	 {4785--4796},
  year = 	 {2021},
  editor = 	 {Meila, Marina and Zhang, Tong},
  volume = 	 {139},
  series = 	 {Proceedings of Machine Learning Research},
  month = 	 {18--24 Jul},
  publisher =    {PMLR},
  pdf = 	 {http://proceedings.mlr.press/v139/javed21a/javed21a.pdf},
  url = 	 {https://proceedings.mlr.press/v139/javed21a.html},
  abstract = 	 {The difficulty in specifying rewards for many real-world problems has led to an increased focus on learning rewards from human feedback, such as demonstrations. However, there are often many different reward functions that explain the human feedback, leaving agents with uncertainty over what the true reward function is. While most policy optimization approaches handle this uncertainty by optimizing for expected performance, many applications demand risk-averse behavior. We derive a novel policy gradient-style robust optimization approach, PG-BROIL, that optimizes a soft-robust objective that balances expected performance and risk. To the best of our knowledge, PG-BROIL is the first policy optimization algorithm robust to a distribution of reward hypotheses which can scale to continuous MDPs. Results suggest that PG-BROIL can produce a family of behaviors ranging from risk-neutral to risk-averse and outperforms state-of-the-art imitation learning algorithms when learning from ambiguous demonstrations by hedging against uncertainty, rather than seeking to uniquely identify the demonstrator’s reward function.}
}

Endnote

%0 Conference Paper
%T Policy Gradient Bayesian Robust Optimization for Imitation Learning
%A Zaynah Javed
%A Daniel S Brown
%A Satvik Sharma
%A Jerry Zhu
%A Ashwin Balakrishna
%A Marek Petrik
%A Anca Dragan
%A Ken Goldberg
%B Proceedings of the 38th International Conference on Machine Learning
%C Proceedings of Machine Learning Research
%D 2021
%E Marina Meila
%E Tong Zhang	
%F pmlr-v139-javed21a
%I PMLR
%P 4785--4796
%U https://proceedings.mlr.press/v139/javed21a.html
%V 139
%X The difficulty in specifying rewards for many real-world problems has led to an increased focus on learning rewards from human feedback, such as demonstrations. However, there are often many different reward functions that explain the human feedback, leaving agents with uncertainty over what the true reward function is. While most policy optimization approaches handle this uncertainty by optimizing for expected performance, many applications demand risk-averse behavior. We derive a novel policy gradient-style robust optimization approach, PG-BROIL, that optimizes a soft-robust objective that balances expected performance and risk. To the best of our knowledge, PG-BROIL is the first policy optimization algorithm robust to a distribution of reward hypotheses which can scale to continuous MDPs. Results suggest that PG-BROIL can produce a family of behaviors ranging from risk-neutral to risk-averse and outperforms state-of-the-art imitation learning algorithms when learning from ambiguous demonstrations by hedging against uncertainty, rather than seeking to uniquely identify the demonstrator’s reward function.

APA

Javed, Z., Brown, D.S., Sharma, S., Zhu, J., Balakrishna, A., Petrik, M., Dragan, A. & Goldberg, K.. (2021). Policy Gradient Bayesian Robust Optimization for Imitation Learning. Proceedings of the 38th International Conference on Machine Learning, in Proceedings of Machine Learning Research 139:4785-4796 Available from https://proceedings.mlr.press/v139/javed21a.html.

Policy Gradient Bayesian Robust Optimization for Imitation Learning

Abstract

Cite this Paper

Related Material