DROID: Learning from Offline Heterogeneous Demonstrations via Reward-Policy Distillation

Sravan Jayanthi, Letian Chen, Nadya Balabanska, Van Duong, Erik Scarlatescu, Ezra Ameperosa, Zulfiqar Haider Zaidi, Daniel Martin, Taylor Keith Del Matto, Masahiro Ono, Matthew Gombolay
Proceedings of The 7th Conference on Robot Learning, PMLR 229:1547-1571, 2023.

Abstract

Offline Learning from Demonstrations (OLfD) is valuable in domains where trial-and-error learning is infeasible or specifying a cost function is difficult, such as robotic surgery, autonomous driving, and path-finding for NASA’s Mars rovers. However, two key problems remain challenging in OLfD: 1) heterogeneity: demonstration data can be generated with diverse preferences and strategies, and 2) generalizability: the learned policy and reward must perform well beyond a limited training regime in unseen test settings. To overcome these challenges, we propose Dual Reward and policy Offline Inverse Distillation (DROID), where the key idea is to leverage diversity to improve generalization performance by decomposing common-task and individual-specific strategies and distilling knowledge in both the reward and policy spaces. We ground DROID in a novel and uniquely challenging Mars rover path-planning problem for NASA’s Mars Curiosity Rover. We also curate a novel dataset along 163 Sols (Martian days) and conduct a novel, empirical investigation to characterize heterogeneity in the dataset. We find DROID outperforms prior SOTA OLfD techniques, leading to a $26%$ improvement in modeling expert behaviors and $92%$ closer to the task objective of reaching the final destination. We also benchmark DROID on the OpenAI Gym Cartpole environment and find DROID achieves $55%$ (significantly) better performance modeling heterogeneous demonstrations.

Cite this Paper


BibTeX
@InProceedings{pmlr-v229-jayanthi23a, title = {DROID: Learning from Offline Heterogeneous Demonstrations via Reward-Policy Distillation}, author = {Jayanthi, Sravan and Chen, Letian and Balabanska, Nadya and Duong, Van and Scarlatescu, Erik and Ameperosa, Ezra and Zaidi, Zulfiqar Haider and Martin, Daniel and Matto, Taylor Keith Del and Ono, Masahiro and Gombolay, Matthew}, booktitle = {Proceedings of The 7th Conference on Robot Learning}, pages = {1547--1571}, year = {2023}, editor = {Tan, Jie and Toussaint, Marc and Darvish, Kourosh}, volume = {229}, series = {Proceedings of Machine Learning Research}, month = {06--09 Nov}, publisher = {PMLR}, pdf = {https://proceedings.mlr.press/v229/jayanthi23a/jayanthi23a.pdf}, url = {https://proceedings.mlr.press/v229/jayanthi23a.html}, abstract = {Offline Learning from Demonstrations (OLfD) is valuable in domains where trial-and-error learning is infeasible or specifying a cost function is difficult, such as robotic surgery, autonomous driving, and path-finding for NASA’s Mars rovers. However, two key problems remain challenging in OLfD: 1) heterogeneity: demonstration data can be generated with diverse preferences and strategies, and 2) generalizability: the learned policy and reward must perform well beyond a limited training regime in unseen test settings. To overcome these challenges, we propose Dual Reward and policy Offline Inverse Distillation (DROID), where the key idea is to leverage diversity to improve generalization performance by decomposing common-task and individual-specific strategies and distilling knowledge in both the reward and policy spaces. We ground DROID in a novel and uniquely challenging Mars rover path-planning problem for NASA’s Mars Curiosity Rover. We also curate a novel dataset along 163 Sols (Martian days) and conduct a novel, empirical investigation to characterize heterogeneity in the dataset. We find DROID outperforms prior SOTA OLfD techniques, leading to a $26%$ improvement in modeling expert behaviors and $92%$ closer to the task objective of reaching the final destination. We also benchmark DROID on the OpenAI Gym Cartpole environment and find DROID achieves $55%$ (significantly) better performance modeling heterogeneous demonstrations.} }
Endnote
%0 Conference Paper %T DROID: Learning from Offline Heterogeneous Demonstrations via Reward-Policy Distillation %A Sravan Jayanthi %A Letian Chen %A Nadya Balabanska %A Van Duong %A Erik Scarlatescu %A Ezra Ameperosa %A Zulfiqar Haider Zaidi %A Daniel Martin %A Taylor Keith Del Matto %A Masahiro Ono %A Matthew Gombolay %B Proceedings of The 7th Conference on Robot Learning %C Proceedings of Machine Learning Research %D 2023 %E Jie Tan %E Marc Toussaint %E Kourosh Darvish %F pmlr-v229-jayanthi23a %I PMLR %P 1547--1571 %U https://proceedings.mlr.press/v229/jayanthi23a.html %V 229 %X Offline Learning from Demonstrations (OLfD) is valuable in domains where trial-and-error learning is infeasible or specifying a cost function is difficult, such as robotic surgery, autonomous driving, and path-finding for NASA’s Mars rovers. However, two key problems remain challenging in OLfD: 1) heterogeneity: demonstration data can be generated with diverse preferences and strategies, and 2) generalizability: the learned policy and reward must perform well beyond a limited training regime in unseen test settings. To overcome these challenges, we propose Dual Reward and policy Offline Inverse Distillation (DROID), where the key idea is to leverage diversity to improve generalization performance by decomposing common-task and individual-specific strategies and distilling knowledge in both the reward and policy spaces. We ground DROID in a novel and uniquely challenging Mars rover path-planning problem for NASA’s Mars Curiosity Rover. We also curate a novel dataset along 163 Sols (Martian days) and conduct a novel, empirical investigation to characterize heterogeneity in the dataset. We find DROID outperforms prior SOTA OLfD techniques, leading to a $26%$ improvement in modeling expert behaviors and $92%$ closer to the task objective of reaching the final destination. We also benchmark DROID on the OpenAI Gym Cartpole environment and find DROID achieves $55%$ (significantly) better performance modeling heterogeneous demonstrations.
APA
Jayanthi, S., Chen, L., Balabanska, N., Duong, V., Scarlatescu, E., Ameperosa, E., Zaidi, Z.H., Martin, D., Matto, T.K.D., Ono, M. & Gombolay, M.. (2023). DROID: Learning from Offline Heterogeneous Demonstrations via Reward-Policy Distillation. Proceedings of The 7th Conference on Robot Learning, in Proceedings of Machine Learning Research 229:1547-1571 Available from https://proceedings.mlr.press/v229/jayanthi23a.html.

Related Material