Learning from Suboptimal Demonstration via Self-Supervised Reward Regression

Letian Chen; Rohan Paleja; Matthew Gombolay

Learning from Suboptimal Demonstration via Self-Supervised Reward Regression

Letian Chen, Rohan Paleja, Matthew Gombolay

Proceedings of the 2020 Conference on Robot Learning, PMLR 155:1262-1277, 2021.

Abstract

Learning from Demonstration (LfD) seeks to democratize robotics by enabling non-roboticist end-users to teach robots to perform a task by providing a human demonstration. However, modern LfD techniques, e.g. inverse reinforcement learning (IRL), assume users provide at least stochastically optimal demonstrations. This assumption fails to hold in most real-world scenarios. Recent attempts to learn from sub-optimal demonstration leverage pairwise rankings and following the Luce-Shepard rule. However, we show these approaches make incorrect assumptions and thus suffer from brittle, degraded performance. We overcome these limitations in developing a novel approach that bootstraps off suboptimal demonstrations to synthesize optimality-parameterized data to train an idealized reward function. We empirically validate we learn an idealized reward function with 0.95 correlation with ground-truth reward versus 0.75 for prior work. We can then train policies achieving 200% improvement over the suboptimal demonstration and 90% improvement over prior work. We present a physical demonstration of teaching a robot a topspin strike in table tennis that achieves 32% faster returns and 40% more topspin than user demonstration.

Cite this Paper

BibTeX


@InProceedings{pmlr-v155-chen21b,
  title = 	 {Learning from Suboptimal Demonstration via Self-Supervised Reward Regression},
  author =       {Chen, Letian and Paleja, Rohan and Gombolay, Matthew},
  booktitle = 	 {Proceedings of the 2020 Conference on Robot Learning},
  pages = 	 {1262--1277},
  year = 	 {2021},
  editor = 	 {Kober, Jens and Ramos, Fabio and Tomlin, Claire},
  volume = 	 {155},
  series = 	 {Proceedings of Machine Learning Research},
  month = 	 {16--18 Nov},
  publisher =    {PMLR},
  pdf = 	 {https://proceedings.mlr.press/v155/chen21b/chen21b.pdf},
  url = 	 {https://proceedings.mlr.press/v155/chen21b.html},
  abstract = 	 {Learning from Demonstration (LfD) seeks to democratize robotics by enabling non-roboticist end-users to teach robots to perform a task by providing a human demonstration. However, modern LfD techniques, e.g. inverse reinforcement learning (IRL), assume users provide at least stochastically optimal demonstrations. This assumption fails to hold in most real-world scenarios. Recent attempts to learn from sub-optimal demonstration leverage pairwise rankings and following the Luce-Shepard rule. However, we show these approaches make incorrect assumptions and thus suffer from brittle, degraded performance. We overcome these limitations in developing a novel approach that bootstraps off suboptimal demonstrations to synthesize optimality-parameterized data to train an idealized reward function. We empirically validate we learn an idealized reward function with  0.95 correlation with ground-truth reward versus   0.75 for prior work. We can then train policies achieving  200% improvement over the suboptimal demonstration and  90% improvement over prior work. We present a physical demonstration of teaching a robot a topspin strike in table tennis that achieves 32% faster returns and 40% more topspin than user demonstration.}
}

Endnote

%0 Conference Paper
%T Learning from Suboptimal Demonstration via Self-Supervised Reward Regression
%A Letian Chen
%A Rohan Paleja
%A Matthew Gombolay
%B Proceedings of the 2020 Conference on Robot Learning
%C Proceedings of Machine Learning Research
%D 2021
%E Jens Kober
%E Fabio Ramos
%E Claire Tomlin	
%F pmlr-v155-chen21b
%I PMLR
%P 1262--1277
%U https://proceedings.mlr.press/v155/chen21b.html
%V 155
%X Learning from Demonstration (LfD) seeks to democratize robotics by enabling non-roboticist end-users to teach robots to perform a task by providing a human demonstration. However, modern LfD techniques, e.g. inverse reinforcement learning (IRL), assume users provide at least stochastically optimal demonstrations. This assumption fails to hold in most real-world scenarios. Recent attempts to learn from sub-optimal demonstration leverage pairwise rankings and following the Luce-Shepard rule. However, we show these approaches make incorrect assumptions and thus suffer from brittle, degraded performance. We overcome these limitations in developing a novel approach that bootstraps off suboptimal demonstrations to synthesize optimality-parameterized data to train an idealized reward function. We empirically validate we learn an idealized reward function with  0.95 correlation with ground-truth reward versus   0.75 for prior work. We can then train policies achieving  200% improvement over the suboptimal demonstration and  90% improvement over prior work. We present a physical demonstration of teaching a robot a topspin strike in table tennis that achieves 32% faster returns and 40% more topspin than user demonstration.

APA


Chen, L., Paleja, R. & Gombolay, M.. (2021). Learning from Suboptimal Demonstration via Self-Supervised Reward Regression. Proceedings of the 2020 Conference on Robot Learning, in Proceedings of Machine Learning Research 155:1262-1277 Available from https://proceedings.mlr.press/v155/chen21b.html.

Related Material

Download PDF