[edit]
Safe Decision Transformer with Learning-based Constraints
Proceedings of the 7th Annual Learning for Dynamics \& Control Conference, PMLR 283:245-258, 2025.
Abstract
In the field of safe offline reinforcement learning (RL), the objective is to utilize offline data to train a policy that maximizes long-term rewards while adhering to safety constraints. Recent work, such as the Constrained Decision Transformer (CDT) (Liu et al., 2023b), has utilized the Transformer (Vaswani, 2017) architecture to build a safe RL agent that is capable of dynamically adjusting the balance between safety and task rewards. However, it often lacks the stitching ability to output policies that are better than those existing in the offline dataset, similar to other Transformer-based RL agents like the Decision Transformer (DT) (Chen et al., 2021). We introduce the Constrained Q-learning Decision Transformer (CQDT) to address this issue. At the core of our approach is a novel trajectory relabeling scheme that utilizes learned value functions, with careful consideration of the trade-off between safety and cumulative rewards. Experimental results show that our proposed algorithm outperforms several baselines across a variety of safe offline RL benchmarks.