Distributionally Robust Policy Gradient for Offline Contextual Bandits

Zhouhao Yang, Yihong Guo, Pan Xu, Anqi Liu, Animashree Anandkumar
Proceedings of The 26th International Conference on Artificial Intelligence and Statistics, PMLR 206:6443-6462, 2023.

Abstract

Learning an optimal policy from offline data is notoriously challenging, which requires the evaluation of the learning policy using data pre-collected from a static logging policy. We study the policy optimization problem in offline contextual bandits using policy gradient methods. We employ a distributionally robust policy gradient method, DROPO, to account for the distributional shift between the static logging policy and the learning policy in policy gradient. Our approach conservatively estimates the conditional reward distributional and updates the policy accordingly. We show that our algorithm converges to a stationary point with rate $O(1/T)$, where $T$ is the number of time steps. We conduct experiments on real-world datasets under various scenarios of logging policies to compare our proposed algorithm with baseline methods in offline contextual bandits. We also propose a variant of our algorithm, DROPO-exp, to further improve the performance when a limited amount of online interaction is allowed. Our results demonstrate the effectiveness and robustness of the proposed algorithms, especially under heavily biased offline data.

Cite this Paper


BibTeX
@InProceedings{pmlr-v206-yang23f, title = {Distributionally Robust Policy Gradient for Offline Contextual Bandits}, author = {Yang, Zhouhao and Guo, Yihong and Xu, Pan and Liu, Anqi and Anandkumar, Animashree}, booktitle = {Proceedings of The 26th International Conference on Artificial Intelligence and Statistics}, pages = {6443--6462}, year = {2023}, editor = {Ruiz, Francisco and Dy, Jennifer and van de Meent, Jan-Willem}, volume = {206}, series = {Proceedings of Machine Learning Research}, month = {25--27 Apr}, publisher = {PMLR}, pdf = {https://proceedings.mlr.press/v206/yang23f/yang23f.pdf}, url = {https://proceedings.mlr.press/v206/yang23f.html}, abstract = {Learning an optimal policy from offline data is notoriously challenging, which requires the evaluation of the learning policy using data pre-collected from a static logging policy. We study the policy optimization problem in offline contextual bandits using policy gradient methods. We employ a distributionally robust policy gradient method, DROPO, to account for the distributional shift between the static logging policy and the learning policy in policy gradient. Our approach conservatively estimates the conditional reward distributional and updates the policy accordingly. We show that our algorithm converges to a stationary point with rate $O(1/T)$, where $T$ is the number of time steps. We conduct experiments on real-world datasets under various scenarios of logging policies to compare our proposed algorithm with baseline methods in offline contextual bandits. We also propose a variant of our algorithm, DROPO-exp, to further improve the performance when a limited amount of online interaction is allowed. Our results demonstrate the effectiveness and robustness of the proposed algorithms, especially under heavily biased offline data.} }
Endnote
%0 Conference Paper %T Distributionally Robust Policy Gradient for Offline Contextual Bandits %A Zhouhao Yang %A Yihong Guo %A Pan Xu %A Anqi Liu %A Animashree Anandkumar %B Proceedings of The 26th International Conference on Artificial Intelligence and Statistics %C Proceedings of Machine Learning Research %D 2023 %E Francisco Ruiz %E Jennifer Dy %E Jan-Willem van de Meent %F pmlr-v206-yang23f %I PMLR %P 6443--6462 %U https://proceedings.mlr.press/v206/yang23f.html %V 206 %X Learning an optimal policy from offline data is notoriously challenging, which requires the evaluation of the learning policy using data pre-collected from a static logging policy. We study the policy optimization problem in offline contextual bandits using policy gradient methods. We employ a distributionally robust policy gradient method, DROPO, to account for the distributional shift between the static logging policy and the learning policy in policy gradient. Our approach conservatively estimates the conditional reward distributional and updates the policy accordingly. We show that our algorithm converges to a stationary point with rate $O(1/T)$, where $T$ is the number of time steps. We conduct experiments on real-world datasets under various scenarios of logging policies to compare our proposed algorithm with baseline methods in offline contextual bandits. We also propose a variant of our algorithm, DROPO-exp, to further improve the performance when a limited amount of online interaction is allowed. Our results demonstrate the effectiveness and robustness of the proposed algorithms, especially under heavily biased offline data.
APA
Yang, Z., Guo, Y., Xu, P., Liu, A. & Anandkumar, A.. (2023). Distributionally Robust Policy Gradient for Offline Contextual Bandits. Proceedings of The 26th International Conference on Artificial Intelligence and Statistics, in Proceedings of Machine Learning Research 206:6443-6462 Available from https://proceedings.mlr.press/v206/yang23f.html.

Related Material