Distributionally Robust Policy Evaluation and Learning in Offline Contextual Bandits

Nian Si, Fan Zhang, Zhengyuan Zhou, Jose Blanchet
Proceedings of the 37th International Conference on Machine Learning, PMLR 119:8884-8894, 2020.

Abstract

Policy learning using historical observational data is an important problem that has found widespread applications. However, existing literature rests on the crucial assumption that the future environment where the learned policy will be deployed is the same as the past environment that has generated the data{–}an assumption that is often false or too coarse an approximation. In this paper, we lift this assumption and aim to learn a distributionally robust policy with bandit observational data. We propose a novel learning algorithm that is able to learn a robust policy to adversarial perturbations and unknown covariate shifts. We first present a policy evaluation procedure in the ambiguous environment and also give a heuristic algorithm to solve the distributionally robust policy learning problems efficiently. Additionally, we provide extensive simulations to demonstrate the robustness of our policy.

Cite this Paper


BibTeX
@InProceedings{pmlr-v119-si20a, title = {Distributionally Robust Policy Evaluation and Learning in Offline Contextual Bandits}, author = {Si, Nian and Zhang, Fan and Zhou, Zhengyuan and Blanchet, Jose}, booktitle = {Proceedings of the 37th International Conference on Machine Learning}, pages = {8884--8894}, year = {2020}, editor = {III, Hal Daumé and Singh, Aarti}, volume = {119}, series = {Proceedings of Machine Learning Research}, month = {13--18 Jul}, publisher = {PMLR}, pdf = {http://proceedings.mlr.press/v119/si20a/si20a.pdf}, url = {http://proceedings.mlr.press/v119/si20a.html}, abstract = {Policy learning using historical observational data is an important problem that has found widespread applications. However, existing literature rests on the crucial assumption that the future environment where the learned policy will be deployed is the same as the past environment that has generated the data{–}an assumption that is often false or too coarse an approximation. In this paper, we lift this assumption and aim to learn a distributionally robust policy with bandit observational data. We propose a novel learning algorithm that is able to learn a robust policy to adversarial perturbations and unknown covariate shifts. We first present a policy evaluation procedure in the ambiguous environment and also give a heuristic algorithm to solve the distributionally robust policy learning problems efficiently. Additionally, we provide extensive simulations to demonstrate the robustness of our policy.} }
Endnote
%0 Conference Paper %T Distributionally Robust Policy Evaluation and Learning in Offline Contextual Bandits %A Nian Si %A Fan Zhang %A Zhengyuan Zhou %A Jose Blanchet %B Proceedings of the 37th International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2020 %E Hal Daumé III %E Aarti Singh %F pmlr-v119-si20a %I PMLR %P 8884--8894 %U http://proceedings.mlr.press/v119/si20a.html %V 119 %X Policy learning using historical observational data is an important problem that has found widespread applications. However, existing literature rests on the crucial assumption that the future environment where the learned policy will be deployed is the same as the past environment that has generated the data{–}an assumption that is often false or too coarse an approximation. In this paper, we lift this assumption and aim to learn a distributionally robust policy with bandit observational data. We propose a novel learning algorithm that is able to learn a robust policy to adversarial perturbations and unknown covariate shifts. We first present a policy evaluation procedure in the ambiguous environment and also give a heuristic algorithm to solve the distributionally robust policy learning problems efficiently. Additionally, we provide extensive simulations to demonstrate the robustness of our policy.
APA
Si, N., Zhang, F., Zhou, Z. & Blanchet, J.. (2020). Distributionally Robust Policy Evaluation and Learning in Offline Contextual Bandits. Proceedings of the 37th International Conference on Machine Learning, in Proceedings of Machine Learning Research 119:8884-8894 Available from http://proceedings.mlr.press/v119/si20a.html.

Related Material