Counterfactual Off-Policy Evaluation with Gumbel-Max Structural Causal Models

Michael Oberst, David Sontag
Proceedings of the 36th International Conference on Machine Learning, PMLR 97:4881-4890, 2019.

Abstract

We introduce an off-policy evaluation procedure for highlighting episodes where applying a reinforcement learned (RL) policy is likely to have produced a substantially different outcome than the observed policy. In particular, we introduce a class of structural causal models (SCMs) for generating counterfactual trajectories in finite partially observable Markov Decision Processes (POMDPs). We see this as a useful procedure for off-policy “debugging” in high-risk settings (e.g., healthcare); by decomposing the expected difference in reward between the RL and observed policy into specific episodes, we can identify episodes where the counterfactual difference in reward is most dramatic. This in turn can be used to facilitate review of specific episodes by domain experts. We demonstrate the utility of this procedure with a synthetic environment of sepsis management.

Cite this Paper


BibTeX
@InProceedings{pmlr-v97-oberst19a, title = {Counterfactual Off-Policy Evaluation with {G}umbel-Max Structural Causal Models}, author = {Oberst, Michael and Sontag, David}, booktitle = {Proceedings of the 36th International Conference on Machine Learning}, pages = {4881--4890}, year = {2019}, editor = {Kamalika Chaudhuri and Ruslan Salakhutdinov}, volume = {97}, series = {Proceedings of Machine Learning Research}, month = {09--15 Jun}, publisher = {PMLR}, pdf = {http://proceedings.mlr.press/v97/oberst19a/oberst19a.pdf}, url = { http://proceedings.mlr.press/v97/oberst19a.html }, abstract = {We introduce an off-policy evaluation procedure for highlighting episodes where applying a reinforcement learned (RL) policy is likely to have produced a substantially different outcome than the observed policy. In particular, we introduce a class of structural causal models (SCMs) for generating counterfactual trajectories in finite partially observable Markov Decision Processes (POMDPs). We see this as a useful procedure for off-policy “debugging” in high-risk settings (e.g., healthcare); by decomposing the expected difference in reward between the RL and observed policy into specific episodes, we can identify episodes where the counterfactual difference in reward is most dramatic. This in turn can be used to facilitate review of specific episodes by domain experts. We demonstrate the utility of this procedure with a synthetic environment of sepsis management.} }
Endnote
%0 Conference Paper %T Counterfactual Off-Policy Evaluation with Gumbel-Max Structural Causal Models %A Michael Oberst %A David Sontag %B Proceedings of the 36th International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2019 %E Kamalika Chaudhuri %E Ruslan Salakhutdinov %F pmlr-v97-oberst19a %I PMLR %P 4881--4890 %U http://proceedings.mlr.press/v97/oberst19a.html %V 97 %X We introduce an off-policy evaluation procedure for highlighting episodes where applying a reinforcement learned (RL) policy is likely to have produced a substantially different outcome than the observed policy. In particular, we introduce a class of structural causal models (SCMs) for generating counterfactual trajectories in finite partially observable Markov Decision Processes (POMDPs). We see this as a useful procedure for off-policy “debugging” in high-risk settings (e.g., healthcare); by decomposing the expected difference in reward between the RL and observed policy into specific episodes, we can identify episodes where the counterfactual difference in reward is most dramatic. This in turn can be used to facilitate review of specific episodes by domain experts. We demonstrate the utility of this procedure with a synthetic environment of sepsis management.
APA
Oberst, M. & Sontag, D.. (2019). Counterfactual Off-Policy Evaluation with Gumbel-Max Structural Causal Models. Proceedings of the 36th International Conference on Machine Learning, in Proceedings of Machine Learning Research 97:4881-4890 Available from http://proceedings.mlr.press/v97/oberst19a.html .

Related Material