Counterfactual Data-Fusion for Online Reinforcement Learners

Andrew Forney, Judea Pearl, Elias Bareinboim
Proceedings of the 34th International Conference on Machine Learning, PMLR 70:1156-1164, 2017.

Abstract

The Multi-Armed Bandit problem with Unobserved Confounders (MABUC) considers decision-making settings where unmeasured variables can influence both the agent’s decisions and received rewards (Bareinboim et al., 2015). Recent findings showed that unobserved confounders (UCs) pose a unique challenge to algorithms based on standard randomization (i.e., experimental data); if UCs are naively averaged out, these algorithms behave sub-optimally, possibly incurring infinite regret. In this paper, we show how counterfactual-based decision-making circumvents these problems and leads to a coherent fusion of observational and experimental data. We then demonstrate this new strategy in an enhanced Thompson Sampling bandit player, and support our findings’ efficacy with extensive simulations.

Cite this Paper


BibTeX
@InProceedings{pmlr-v70-forney17a, title = {Counterfactual Data-Fusion for Online Reinforcement Learners}, author = {Andrew Forney and Judea Pearl and Elias Bareinboim}, booktitle = {Proceedings of the 34th International Conference on Machine Learning}, pages = {1156--1164}, year = {2017}, editor = {Precup, Doina and Teh, Yee Whye}, volume = {70}, series = {Proceedings of Machine Learning Research}, month = {06--11 Aug}, publisher = {PMLR}, pdf = {http://proceedings.mlr.press/v70/forney17a/forney17a.pdf}, url = {https://proceedings.mlr.press/v70/forney17a.html}, abstract = {The Multi-Armed Bandit problem with Unobserved Confounders (MABUC) considers decision-making settings where unmeasured variables can influence both the agent’s decisions and received rewards (Bareinboim et al., 2015). Recent findings showed that unobserved confounders (UCs) pose a unique challenge to algorithms based on standard randomization (i.e., experimental data); if UCs are naively averaged out, these algorithms behave sub-optimally, possibly incurring infinite regret. In this paper, we show how counterfactual-based decision-making circumvents these problems and leads to a coherent fusion of observational and experimental data. We then demonstrate this new strategy in an enhanced Thompson Sampling bandit player, and support our findings’ efficacy with extensive simulations.} }
Endnote
%0 Conference Paper %T Counterfactual Data-Fusion for Online Reinforcement Learners %A Andrew Forney %A Judea Pearl %A Elias Bareinboim %B Proceedings of the 34th International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2017 %E Doina Precup %E Yee Whye Teh %F pmlr-v70-forney17a %I PMLR %P 1156--1164 %U https://proceedings.mlr.press/v70/forney17a.html %V 70 %X The Multi-Armed Bandit problem with Unobserved Confounders (MABUC) considers decision-making settings where unmeasured variables can influence both the agent’s decisions and received rewards (Bareinboim et al., 2015). Recent findings showed that unobserved confounders (UCs) pose a unique challenge to algorithms based on standard randomization (i.e., experimental data); if UCs are naively averaged out, these algorithms behave sub-optimally, possibly incurring infinite regret. In this paper, we show how counterfactual-based decision-making circumvents these problems and leads to a coherent fusion of observational and experimental data. We then demonstrate this new strategy in an enhanced Thompson Sampling bandit player, and support our findings’ efficacy with extensive simulations.
APA
Forney, A., Pearl, J. & Bareinboim, E.. (2017). Counterfactual Data-Fusion for Online Reinforcement Learners. Proceedings of the 34th International Conference on Machine Learning, in Proceedings of Machine Learning Research 70:1156-1164 Available from https://proceedings.mlr.press/v70/forney17a.html.

Related Material