Markov Decision Processes with Continuous Side Information

Aditya Modi, Nan Jiang, Satinder Singh, Ambuj Tewari
Proceedings of Algorithmic Learning Theory, PMLR 83:597-618, 2018.

Abstract

We consider a reinforcement learning (RL) setting in which the agent interacts with a sequence of episodic MDPs. At the start of each episode the agent has access to some side-information or context that determines the dynamics of the MDP for that episode. Our setting is motivated by applications in healthcare where baseline measurements of a patient at the start of a treatment episode form the context that may provide information about how the patient might respond to treatment decisions. We propose algorithms for learning in such Contextual Markov Decision Processes (CMDPs) under an assumption that the unobserved MDP parameters vary smoothly with the observed context. We give lower and upper PAC bounds under the smoothness assumption. Because our lower bound has an exponential dependence on the dimension, we also consider a tractable linear setting where the context creates linear combinations of a finite set of MDPs. For the linear setting, we give a PAC learning algorithm based on KWIK learning techniques.

Cite this Paper


BibTeX
@InProceedings{pmlr-v83-modi18a, title = {Markov Decision Processes with Continuous Side Information}, author = {Modi, Aditya and Jiang, Nan and Singh, Satinder and Tewari, Ambuj}, booktitle = {Proceedings of Algorithmic Learning Theory}, pages = {597--618}, year = {2018}, editor = {Janoos, Firdaus and Mohri, Mehryar and Sridharan, Karthik}, volume = {83}, series = {Proceedings of Machine Learning Research}, month = {07--09 Apr}, publisher = {PMLR}, pdf = {http://proceedings.mlr.press/v83/modi18a/modi18a.pdf}, url = {https://proceedings.mlr.press/v83/modi18a.html}, abstract = { We consider a reinforcement learning (RL) setting in which the agent interacts with a sequence of episodic MDPs. At the start of each episode the agent has access to some side-information or context that determines the dynamics of the MDP for that episode. Our setting is motivated by applications in healthcare where baseline measurements of a patient at the start of a treatment episode form the context that may provide information about how the patient might respond to treatment decisions. We propose algorithms for learning in such Contextual Markov Decision Processes (CMDPs) under an assumption that the unobserved MDP parameters vary smoothly with the observed context. We give lower and upper PAC bounds under the smoothness assumption. Because our lower bound has an exponential dependence on the dimension, we also consider a tractable linear setting where the context creates linear combinations of a finite set of MDPs. For the linear setting, we give a PAC learning algorithm based on KWIK learning techniques.} }
Endnote
%0 Conference Paper %T Markov Decision Processes with Continuous Side Information %A Aditya Modi %A Nan Jiang %A Satinder Singh %A Ambuj Tewari %B Proceedings of Algorithmic Learning Theory %C Proceedings of Machine Learning Research %D 2018 %E Firdaus Janoos %E Mehryar Mohri %E Karthik Sridharan %F pmlr-v83-modi18a %I PMLR %P 597--618 %U https://proceedings.mlr.press/v83/modi18a.html %V 83 %X We consider a reinforcement learning (RL) setting in which the agent interacts with a sequence of episodic MDPs. At the start of each episode the agent has access to some side-information or context that determines the dynamics of the MDP for that episode. Our setting is motivated by applications in healthcare where baseline measurements of a patient at the start of a treatment episode form the context that may provide information about how the patient might respond to treatment decisions. We propose algorithms for learning in such Contextual Markov Decision Processes (CMDPs) under an assumption that the unobserved MDP parameters vary smoothly with the observed context. We give lower and upper PAC bounds under the smoothness assumption. Because our lower bound has an exponential dependence on the dimension, we also consider a tractable linear setting where the context creates linear combinations of a finite set of MDPs. For the linear setting, we give a PAC learning algorithm based on KWIK learning techniques.
APA
Modi, A., Jiang, N., Singh, S. & Tewari, A.. (2018). Markov Decision Processes with Continuous Side Information. Proceedings of Algorithmic Learning Theory, in Proceedings of Machine Learning Research 83:597-618 Available from https://proceedings.mlr.press/v83/modi18a.html.

Related Material