Safe Decision Transformer with Learning-based Constraints

Ruhan Wang, Dongruo Zhou
Proceedings of the 7th Annual Learning for Dynamics \& Control Conference, PMLR 283:245-258, 2025.

Abstract

In the field of safe offline reinforcement learning (RL), the objective is to utilize offline data to train a policy that maximizes long-term rewards while adhering to safety constraints. Recent work, such as the Constrained Decision Transformer (CDT) (Liu et al., 2023b), has utilized the Transformer (Vaswani, 2017) architecture to build a safe RL agent that is capable of dynamically adjusting the balance between safety and task rewards. However, it often lacks the stitching ability to output policies that are better than those existing in the offline dataset, similar to other Transformer-based RL agents like the Decision Transformer (DT) (Chen et al., 2021). We introduce the Constrained Q-learning Decision Transformer (CQDT) to address this issue. At the core of our approach is a novel trajectory relabeling scheme that utilizes learned value functions, with careful consideration of the trade-off between safety and cumulative rewards. Experimental results show that our proposed algorithm outperforms several baselines across a variety of safe offline RL benchmarks.

Cite this Paper


BibTeX
@InProceedings{pmlr-v283-wang25a, title = {Safe Decision Transformer with Learning-based Constraints}, author = {Wang, Ruhan and Zhou, Dongruo}, booktitle = {Proceedings of the 7th Annual Learning for Dynamics \& Control Conference}, pages = {245--258}, year = {2025}, editor = {Ozay, Necmiye and Balzano, Laura and Panagou, Dimitra and Abate, Alessandro}, volume = {283}, series = {Proceedings of Machine Learning Research}, month = {04--06 Jun}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v283/main/assets/wang25a/wang25a.pdf}, url = {https://proceedings.mlr.press/v283/wang25a.html}, abstract = {In the field of safe offline reinforcement learning (RL), the objective is to utilize offline data to train a policy that maximizes long-term rewards while adhering to safety constraints. Recent work, such as the Constrained Decision Transformer (CDT) (Liu et al., 2023b), has utilized the Transformer (Vaswani, 2017) architecture to build a safe RL agent that is capable of dynamically adjusting the balance between safety and task rewards. However, it often lacks the stitching ability to output policies that are better than those existing in the offline dataset, similar to other Transformer-based RL agents like the Decision Transformer (DT) (Chen et al., 2021). We introduce the Constrained Q-learning Decision Transformer (CQDT) to address this issue. At the core of our approach is a novel trajectory relabeling scheme that utilizes learned value functions, with careful consideration of the trade-off between safety and cumulative rewards. Experimental results show that our proposed algorithm outperforms several baselines across a variety of safe offline RL benchmarks.} }
Endnote
%0 Conference Paper %T Safe Decision Transformer with Learning-based Constraints %A Ruhan Wang %A Dongruo Zhou %B Proceedings of the 7th Annual Learning for Dynamics \& Control Conference %C Proceedings of Machine Learning Research %D 2025 %E Necmiye Ozay %E Laura Balzano %E Dimitra Panagou %E Alessandro Abate %F pmlr-v283-wang25a %I PMLR %P 245--258 %U https://proceedings.mlr.press/v283/wang25a.html %V 283 %X In the field of safe offline reinforcement learning (RL), the objective is to utilize offline data to train a policy that maximizes long-term rewards while adhering to safety constraints. Recent work, such as the Constrained Decision Transformer (CDT) (Liu et al., 2023b), has utilized the Transformer (Vaswani, 2017) architecture to build a safe RL agent that is capable of dynamically adjusting the balance between safety and task rewards. However, it often lacks the stitching ability to output policies that are better than those existing in the offline dataset, similar to other Transformer-based RL agents like the Decision Transformer (DT) (Chen et al., 2021). We introduce the Constrained Q-learning Decision Transformer (CQDT) to address this issue. At the core of our approach is a novel trajectory relabeling scheme that utilizes learned value functions, with careful consideration of the trade-off between safety and cumulative rewards. Experimental results show that our proposed algorithm outperforms several baselines across a variety of safe offline RL benchmarks.
APA
Wang, R. & Zhou, D.. (2025). Safe Decision Transformer with Learning-based Constraints. Proceedings of the 7th Annual Learning for Dynamics \& Control Conference, in Proceedings of Machine Learning Research 283:245-258 Available from https://proceedings.mlr.press/v283/wang25a.html.

Related Material