Enhancing Inverse Reinforcement Learning through Encoding Dynamic Information in Reward Shaping

Simon Sinong Zhan, Philip Wang, Qingyuan Wu, Ruochen Jiao, Yixuan Wang, Chao Huang, Qi Zhu
Proceedings of The 8th Annual Learning for Dynamics and Control Conference, PMLR 331:2171-2206, 2026.

Abstract

Adversarial-based inverse reinforcement learning (IRL) has shown promising results using reward shaping under deterministic settings. However, it struggles in stochastic environments where existing theoretical results no longer apply, leading to degraded performance. To address this issue, we propose a novel maximum causal entropy based off-policy IRL method with transition-aware reward shaping framework. Our method integrates transition model estimation directly to learn stochastic-invariant rewards. We conduct a thorough theoretical analysis, establishing bounds on reward error and performance differences to validate the effectiveness of our method. The experimental results in continuous locomotion tasks (MuJoCo) show that our method can achieve superior performance in stochastic environments and competitive performance in deterministic environments, with significant improvement in sample efficiency, compared to existing baselines. Additionally, we extend our framework to high-dimensional vision-based tasks, where our method shows promising results on multiple stochastic Atari games. These results demonstrate that embedding transition awareness into reward learning is critical for robust IRL in realistic stochastic settings.

Cite this Paper


BibTeX
@InProceedings{pmlr-v331-zhan26a, title = {Enhancing Inverse Reinforcement Learning through Encoding Dynamic Information in Reward Shaping}, author = {Zhan, Simon Sinong and Wang, Philip and Wu, Qingyuan and Jiao, Ruochen and Wang, Yixuan and Huang, Chao and Zhu, Qi}, booktitle = {Proceedings of The 8th Annual Learning for Dynamics and Control Conference}, pages = {2171--2206}, year = {2026}, editor = {Sukhatme, Gaurav and Lindemann, Lars and Tu, Stephen and Wierman, Adam and Atanasov, Nikolay}, volume = {331}, series = {Proceedings of Machine Learning Research}, month = {17--19 Jun}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v331/main/assets/zhan26a/zhan26a.pdf}, url = {https://proceedings.mlr.press/v331/zhan26a.html}, abstract = {Adversarial-based inverse reinforcement learning (IRL) has shown promising results using reward shaping under deterministic settings. However, it struggles in stochastic environments where existing theoretical results no longer apply, leading to degraded performance. To address this issue, we propose a novel maximum causal entropy based off-policy IRL method with transition-aware reward shaping framework. Our method integrates transition model estimation directly to learn stochastic-invariant rewards. We conduct a thorough theoretical analysis, establishing bounds on reward error and performance differences to validate the effectiveness of our method. The experimental results in continuous locomotion tasks (MuJoCo) show that our method can achieve superior performance in stochastic environments and competitive performance in deterministic environments, with significant improvement in sample efficiency, compared to existing baselines. Additionally, we extend our framework to high-dimensional vision-based tasks, where our method shows promising results on multiple stochastic Atari games. These results demonstrate that embedding transition awareness into reward learning is critical for robust IRL in realistic stochastic settings.} }
Endnote
%0 Conference Paper %T Enhancing Inverse Reinforcement Learning through Encoding Dynamic Information in Reward Shaping %A Simon Sinong Zhan %A Philip Wang %A Qingyuan Wu %A Ruochen Jiao %A Yixuan Wang %A Chao Huang %A Qi Zhu %B Proceedings of The 8th Annual Learning for Dynamics and Control Conference %C Proceedings of Machine Learning Research %D 2026 %E Gaurav Sukhatme %E Lars Lindemann %E Stephen Tu %E Adam Wierman %E Nikolay Atanasov %F pmlr-v331-zhan26a %I PMLR %P 2171--2206 %U https://proceedings.mlr.press/v331/zhan26a.html %V 331 %X Adversarial-based inverse reinforcement learning (IRL) has shown promising results using reward shaping under deterministic settings. However, it struggles in stochastic environments where existing theoretical results no longer apply, leading to degraded performance. To address this issue, we propose a novel maximum causal entropy based off-policy IRL method with transition-aware reward shaping framework. Our method integrates transition model estimation directly to learn stochastic-invariant rewards. We conduct a thorough theoretical analysis, establishing bounds on reward error and performance differences to validate the effectiveness of our method. The experimental results in continuous locomotion tasks (MuJoCo) show that our method can achieve superior performance in stochastic environments and competitive performance in deterministic environments, with significant improvement in sample efficiency, compared to existing baselines. Additionally, we extend our framework to high-dimensional vision-based tasks, where our method shows promising results on multiple stochastic Atari games. These results demonstrate that embedding transition awareness into reward learning is critical for robust IRL in realistic stochastic settings.
APA
Zhan, S.S., Wang, P., Wu, Q., Jiao, R., Wang, Y., Huang, C. & Zhu, Q.. (2026). Enhancing Inverse Reinforcement Learning through Encoding Dynamic Information in Reward Shaping. Proceedings of The 8th Annual Learning for Dynamics and Control Conference, in Proceedings of Machine Learning Research 331:2171-2206 Available from https://proceedings.mlr.press/v331/zhan26a.html.

Related Material