[edit]

# Designing Optimal Dynamic Treatment Regimes: A Causal Reinforcement Learning Approach

*Proceedings of the 37th International Conference on Machine Learning*, PMLR 119:11012-11022, 2020.

#### Abstract

A dynamic treatment regime (DTR) consists of a sequence of decision rules, one per stage of intervention, that dictates how to determine the treatment assignment to patients based on evolving treatments and covariates’ history. These regimes are particularly effective for managing chronic disorders and is arguably one of the critical ingredients underlying more personalized decision-making systems. All reinforcement learning algorithms for finding the optimal DTR in online settings will suffer O(\sqrt{|D_{X, S}|T}) regret on some environments, where T is the number of experiments, and D_{X, S} is the domains of treatments X and covariates S. This implies T = O (|D_{X, S}|) trials to generate an optimal DTR. In many applications, domains of X and S could be so enormous that the time required to ensure appropriate learning may be unattainable. We show that, if the causal diagram of the underlying environment is provided, one could achieve regret that is exponentially smaller than D_{X, S}. In particular, we develop two online algorithms that satisfy such regret bounds by exploiting the causal structure underlying the DTR; one is based on the principle of optimism in the face of uncertainty (OFU-DTR), and the other uses the posterior sampling learning (PS-DTR). Finally, we introduce efficient methods to accelerate these online learning procedures by leveraging the abundant, yet biased observational (non-experimental) data.