Counterfactual Data-Fusion for Online Reinforcement Learners

Andrew Forney; Judea Pearl; Elias Bareinboim

Counterfactual Data-Fusion for Online Reinforcement Learners

Andrew Forney, Judea Pearl, Elias Bareinboim

Proceedings of the 34th International Conference on Machine Learning, PMLR 70:1156-1164, 2017.

Abstract

The Multi-Armed Bandit problem with Unobserved Confounders (MABUC) considers decision-making settings where unmeasured variables can influence both the agent’s decisions and received rewards (Bareinboim et al., 2015). Recent findings showed that unobserved confounders (UCs) pose a unique challenge to algorithms based on standard randomization (i.e., experimental data); if UCs are naively averaged out, these algorithms behave sub-optimally, possibly incurring infinite regret. In this paper, we show how counterfactual-based decision-making circumvents these problems and leads to a coherent fusion of observational and experimental data. We then demonstrate this new strategy in an enhanced Thompson Sampling bandit player, and support our findings’ efficacy with extensive simulations.

Cite this Paper

BibTeX


@InProceedings{pmlr-v70-forney17a,
  title = 	 {Counterfactual Data-Fusion for Online Reinforcement Learners},
  author =       {Andrew Forney and Judea Pearl and Elias Bareinboim},
  booktitle = 	 {Proceedings of the 34th International Conference on Machine Learning},
  pages = 	 {1156--1164},
  year = 	 {2017},
  editor = 	 {Precup, Doina and Teh, Yee Whye},
  volume = 	 {70},
  series = 	 {Proceedings of Machine Learning Research},
  month = 	 {06--11 Aug},
  publisher =    {PMLR},
  pdf = 	 {http://proceedings.mlr.press/v70/forney17a/forney17a.pdf},
  url = 	 {https://proceedings.mlr.press/v70/forney17a.html},
  abstract = 	 {The Multi-Armed Bandit problem with Unobserved Confounders (MABUC) considers decision-making settings where unmeasured variables can influence both the agent’s decisions and received rewards (Bareinboim et al., 2015). Recent findings showed that unobserved confounders (UCs) pose a unique challenge to algorithms based on standard randomization (i.e., experimental data); if UCs are naively averaged out, these algorithms behave sub-optimally, possibly incurring infinite regret. In this paper, we show how counterfactual-based decision-making circumvents these problems and leads to a coherent fusion of observational and experimental data. We then demonstrate this new strategy in an enhanced Thompson Sampling bandit player, and support our findings’ efficacy with extensive simulations.}
}

Endnote

%0 Conference Paper
%T Counterfactual Data-Fusion for Online Reinforcement Learners
%A Andrew Forney
%A Judea Pearl
%A Elias Bareinboim
%B Proceedings of the 34th International Conference on Machine Learning
%C Proceedings of Machine Learning Research
%D 2017
%E Doina Precup
%E Yee Whye Teh	
%F pmlr-v70-forney17a
%I PMLR
%P 1156--1164
%U https://proceedings.mlr.press/v70/forney17a.html
%V 70
%X The Multi-Armed Bandit problem with Unobserved Confounders (MABUC) considers decision-making settings where unmeasured variables can influence both the agent’s decisions and received rewards (Bareinboim et al., 2015). Recent findings showed that unobserved confounders (UCs) pose a unique challenge to algorithms based on standard randomization (i.e., experimental data); if UCs are naively averaged out, these algorithms behave sub-optimally, possibly incurring infinite regret. In this paper, we show how counterfactual-based decision-making circumvents these problems and leads to a coherent fusion of observational and experimental data. We then demonstrate this new strategy in an enhanced Thompson Sampling bandit player, and support our findings’ efficacy with extensive simulations.

APA


Forney, A., Pearl, J. & Bareinboim, E.. (2017). Counterfactual Data-Fusion for Online Reinforcement Learners. Proceedings of the 34th International Conference on Machine Learning, in Proceedings of Machine Learning Research 70:1156-1164 Available from https://proceedings.mlr.press/v70/forney17a.html.

Counterfactual Data-Fusion for Online Reinforcement Learners

Abstract

Cite this Paper

Related Material