SpOiLer: Offline reinforcement learning using scaled penalties

Padmanaba Srinivasan; William J. Knottenbelt

SpOiLer: Offline reinforcement learning using scaled penalties

Padmanaba Srinivasan, William J. Knottenbelt

Proceedings of the 6th Annual Learning for Dynamics & Control Conference, PMLR 242:825-838, 2024.

Abstract

Offline Reinforcement Learning (RL) is a variant of off-policy learning where an optimal policy must be learned from a static dataset containing trajectories collected by an unknown behavior policy. In the offline setting, standard off-policy algorithms will overestimate values of out-of-distribution actions and a policy trained naively in this way will perform poorly in the environment due to distribution shift between the implied and real environment; this is especially likely when modelling complex and multi-modal data distributions. We propose Scaled-penalty Offline Learning (SpOiLer), an offline reinforcement learning algorithm that reduces the value of out-of-distribution actions relative to observed actions. The resultant pessimistic value function is a lower bound of the true value function and manipulates the policy towards selecting actions present in the dataset. Our method is a simple augmentation to the standard Bellman backup operator and implementation requires around 15 additional lines of code over soft actor-critic. We provide theoretical insights into how SpOiLer operates under the hood and show empirically that SpOiLer achieves remarkable performance against prior methods on a range of tasks.

Cite this Paper

BibTeX


@InProceedings{pmlr-v242-srinivasan24a,
  title = 	 {{SpOiLer}: {O}ffline reinforcement learning using scaled penalties},
  author =       {Srinivasan, Padmanaba and Knottenbelt, William J.},
  booktitle = 	 {Proceedings of the 6th Annual Learning for Dynamics & Control Conference},
  pages = 	 {825--838},
  year = 	 {2024},
  editor = 	 {Abate, Alessandro and Cannon, Mark and Margellos, Kostas and Papachristodoulou, Antonis},
  volume = 	 {242},
  series = 	 {Proceedings of Machine Learning Research},
  month = 	 {15--17 Jul},
  publisher =    {PMLR},
  pdf = 	 {https://proceedings.mlr.press/v242/srinivasan24a/srinivasan24a.pdf},
  url = 	 {https://proceedings.mlr.press/v242/srinivasan24a.html},
  abstract = 	 {Offline Reinforcement Learning (RL) is a variant of off-policy learning where an optimal policy must be learned from a static dataset containing trajectories collected by an unknown behavior policy. In the offline setting, standard off-policy algorithms will overestimate values of out-of-distribution actions and a policy trained naively in this way will perform poorly in the environment due to distribution shift between the implied and real environment; this is especially likely when modelling complex and multi-modal data distributions. We propose Scaled-penalty Offline Learning (SpOiLer), an offline reinforcement learning algorithm that reduces the value of out-of-distribution actions relative to observed actions. The resultant pessimistic value function is a lower bound of the true value function and manipulates the policy towards selecting actions present in the dataset. Our method is a simple augmentation to the standard Bellman backup operator and implementation requires around 15 additional lines of code over soft actor-critic. We provide theoretical insights into how SpOiLer operates under the hood and show empirically that SpOiLer achieves remarkable performance against prior methods on a range of tasks.}
}

Endnote

%0 Conference Paper
%T SpOiLer: Offline reinforcement learning using scaled penalties
%A Padmanaba Srinivasan
%A William J. Knottenbelt
%B Proceedings of the 6th Annual Learning for Dynamics & Control Conference
%C Proceedings of Machine Learning Research
%D 2024
%E Alessandro Abate
%E Mark Cannon
%E Kostas Margellos
%E Antonis Papachristodoulou	
%F pmlr-v242-srinivasan24a
%I PMLR
%P 825--838
%U https://proceedings.mlr.press/v242/srinivasan24a.html
%V 242
%X Offline Reinforcement Learning (RL) is a variant of off-policy learning where an optimal policy must be learned from a static dataset containing trajectories collected by an unknown behavior policy. In the offline setting, standard off-policy algorithms will overestimate values of out-of-distribution actions and a policy trained naively in this way will perform poorly in the environment due to distribution shift between the implied and real environment; this is especially likely when modelling complex and multi-modal data distributions. We propose Scaled-penalty Offline Learning (SpOiLer), an offline reinforcement learning algorithm that reduces the value of out-of-distribution actions relative to observed actions. The resultant pessimistic value function is a lower bound of the true value function and manipulates the policy towards selecting actions present in the dataset. Our method is a simple augmentation to the standard Bellman backup operator and implementation requires around 15 additional lines of code over soft actor-critic. We provide theoretical insights into how SpOiLer operates under the hood and show empirically that SpOiLer achieves remarkable performance against prior methods on a range of tasks.

APA


Srinivasan, P. & Knottenbelt, W.J.. (2024). SpOiLer: Offline reinforcement learning using scaled penalties. Proceedings of the 6th Annual Learning for Dynamics & Control Conference, in Proceedings of Machine Learning Research 242:825-838 Available from https://proceedings.mlr.press/v242/srinivasan24a.html.

Related Material

Download PDF