SpOiLer: Offline reinforcement learning using scaled penalties

Padmanaba Srinivasan, William J. Knottenbelt
Proceedings of the 6th Annual Learning for Dynamics & Control Conference, PMLR 242:825-838, 2024.

Abstract

Offline Reinforcement Learning (RL) is a variant of off-policy learning where an optimal policy must be learned from a static dataset containing trajectories collected by an unknown behavior policy. In the offline setting, standard off-policy algorithms will overestimate values of out-of-distribution actions and a policy trained naively in this way will perform poorly in the environment due to distribution shift between the implied and real environment; this is especially likely when modelling complex and multi-modal data distributions. We propose Scaled-penalty Offline Learning (SpOiLer), an offline reinforcement learning algorithm that reduces the value of out-of-distribution actions relative to observed actions. The resultant pessimistic value function is a lower bound of the true value function and manipulates the policy towards selecting actions present in the dataset. Our method is a simple augmentation to the standard Bellman backup operator and implementation requires around 15 additional lines of code over soft actor-critic. We provide theoretical insights into how SpOiLer operates under the hood and show empirically that SpOiLer achieves remarkable performance against prior methods on a range of tasks.

Cite this Paper


BibTeX
@InProceedings{pmlr-v242-srinivasan24a, title = {{SpOiLer}: {O}ffline reinforcement learning using scaled penalties}, author = {Srinivasan, Padmanaba and Knottenbelt, William J.}, booktitle = {Proceedings of the 6th Annual Learning for Dynamics & Control Conference}, pages = {825--838}, year = {2024}, editor = {Abate, Alessandro and Cannon, Mark and Margellos, Kostas and Papachristodoulou, Antonis}, volume = {242}, series = {Proceedings of Machine Learning Research}, month = {15--17 Jul}, publisher = {PMLR}, pdf = {https://proceedings.mlr.press/v242/srinivasan24a/srinivasan24a.pdf}, url = {https://proceedings.mlr.press/v242/srinivasan24a.html}, abstract = {Offline Reinforcement Learning (RL) is a variant of off-policy learning where an optimal policy must be learned from a static dataset containing trajectories collected by an unknown behavior policy. In the offline setting, standard off-policy algorithms will overestimate values of out-of-distribution actions and a policy trained naively in this way will perform poorly in the environment due to distribution shift between the implied and real environment; this is especially likely when modelling complex and multi-modal data distributions. We propose Scaled-penalty Offline Learning (SpOiLer), an offline reinforcement learning algorithm that reduces the value of out-of-distribution actions relative to observed actions. The resultant pessimistic value function is a lower bound of the true value function and manipulates the policy towards selecting actions present in the dataset. Our method is a simple augmentation to the standard Bellman backup operator and implementation requires around 15 additional lines of code over soft actor-critic. We provide theoretical insights into how SpOiLer operates under the hood and show empirically that SpOiLer achieves remarkable performance against prior methods on a range of tasks.} }
Endnote
%0 Conference Paper %T SpOiLer: Offline reinforcement learning using scaled penalties %A Padmanaba Srinivasan %A William J. Knottenbelt %B Proceedings of the 6th Annual Learning for Dynamics & Control Conference %C Proceedings of Machine Learning Research %D 2024 %E Alessandro Abate %E Mark Cannon %E Kostas Margellos %E Antonis Papachristodoulou %F pmlr-v242-srinivasan24a %I PMLR %P 825--838 %U https://proceedings.mlr.press/v242/srinivasan24a.html %V 242 %X Offline Reinforcement Learning (RL) is a variant of off-policy learning where an optimal policy must be learned from a static dataset containing trajectories collected by an unknown behavior policy. In the offline setting, standard off-policy algorithms will overestimate values of out-of-distribution actions and a policy trained naively in this way will perform poorly in the environment due to distribution shift between the implied and real environment; this is especially likely when modelling complex and multi-modal data distributions. We propose Scaled-penalty Offline Learning (SpOiLer), an offline reinforcement learning algorithm that reduces the value of out-of-distribution actions relative to observed actions. The resultant pessimistic value function is a lower bound of the true value function and manipulates the policy towards selecting actions present in the dataset. Our method is a simple augmentation to the standard Bellman backup operator and implementation requires around 15 additional lines of code over soft actor-critic. We provide theoretical insights into how SpOiLer operates under the hood and show empirically that SpOiLer achieves remarkable performance against prior methods on a range of tasks.
APA
Srinivasan, P. & Knottenbelt, W.J.. (2024). SpOiLer: Offline reinforcement learning using scaled penalties. Proceedings of the 6th Annual Learning for Dynamics & Control Conference, in Proceedings of Machine Learning Research 242:825-838 Available from https://proceedings.mlr.press/v242/srinivasan24a.html.

Related Material