Non-Stationary Off-Policy Optimization

Joey Hong; Branislav Kveton; Manzil Zaheer; Yinlam Chow; Amr Ahmed

Non-Stationary Off-Policy Optimization

Joey Hong, Branislav Kveton, Manzil Zaheer, Yinlam Chow, Amr Ahmed

Proceedings of The 24th International Conference on Artificial Intelligence and Statistics, PMLR 130:2494-2502, 2021.

Abstract

Off-policy learning is a framework for evaluating and optimizing policies without deploying them, from data collected by another policy. Real-world environments are typically non-stationary and the offline learned policies should adapt to these changes. To address this challenge, we study the novel problem of off-policy optimization in piecewise-stationary contextual bandits. Our proposed solution has two phases. In the offline learning phase, we partition logged data into categorical latent states and learn a near-optimal sub-policy for each state. In the online deployment phase, we adaptively switch between the learned sub-policies based on their performance. This approach is practical and analyzable, and we provide guarantees on both the quality of off-policy optimization and the regret during online deployment. To show the effectiveness of our approach, we compare it to state-of-the-art baselines on both synthetic and real-world datasets. Our approach outperforms methods that act only on observed context.

Cite this Paper

BibTeX


@InProceedings{pmlr-v130-hong21a,
  title = 	 { Non-Stationary Off-Policy Optimization },
  author =       {Hong, Joey and Kveton, Branislav and Zaheer, Manzil and Chow, Yinlam and Ahmed, Amr},
  booktitle = 	 {Proceedings of The 24th International Conference on Artificial Intelligence and Statistics},
  pages = 	 {2494--2502},
  year = 	 {2021},
  editor = 	 {Banerjee, Arindam and Fukumizu, Kenji},
  volume = 	 {130},
  series = 	 {Proceedings of Machine Learning Research},
  month = 	 {13--15 Apr},
  publisher =    {PMLR},
  pdf = 	 {http://proceedings.mlr.press/v130/hong21a/hong21a.pdf},
  url = 	 {https://proceedings.mlr.press/v130/hong21a.html},
  abstract = 	 { Off-policy learning is a framework for evaluating and optimizing policies without deploying them, from data collected by another policy. Real-world environments are typically non-stationary and the offline learned policies should adapt to these changes. To address this challenge, we study the novel problem of off-policy optimization in piecewise-stationary contextual bandits. Our proposed solution has two phases. In the offline learning phase, we partition logged data into categorical latent states and learn a near-optimal sub-policy for each state. In the online deployment phase, we adaptively switch between the learned sub-policies based on their performance. This approach is practical and analyzable, and we provide guarantees on both the quality of off-policy optimization and the regret during online deployment. To show the effectiveness of our approach, we compare it to state-of-the-art baselines on both synthetic and real-world datasets. Our approach outperforms methods that act only on observed context. }
}

Endnote

%0 Conference Paper
%T  Non-Stationary Off-Policy Optimization 
%A Joey Hong
%A Branislav Kveton
%A Manzil Zaheer
%A Yinlam Chow
%A Amr Ahmed
%B Proceedings of The 24th International Conference on Artificial Intelligence and Statistics
%C Proceedings of Machine Learning Research
%D 2021
%E Arindam Banerjee
%E Kenji Fukumizu	
%F pmlr-v130-hong21a
%I PMLR
%P 2494--2502
%U https://proceedings.mlr.press/v130/hong21a.html
%V 130
%X  Off-policy learning is a framework for evaluating and optimizing policies without deploying them, from data collected by another policy. Real-world environments are typically non-stationary and the offline learned policies should adapt to these changes. To address this challenge, we study the novel problem of off-policy optimization in piecewise-stationary contextual bandits. Our proposed solution has two phases. In the offline learning phase, we partition logged data into categorical latent states and learn a near-optimal sub-policy for each state. In the online deployment phase, we adaptively switch between the learned sub-policies based on their performance. This approach is practical and analyzable, and we provide guarantees on both the quality of off-policy optimization and the regret during online deployment. To show the effectiveness of our approach, we compare it to state-of-the-art baselines on both synthetic and real-world datasets. Our approach outperforms methods that act only on observed context.

APA


Hong, J., Kveton, B., Zaheer, M., Chow, Y. & Ahmed, A.. (2021).  Non-Stationary Off-Policy Optimization . Proceedings of The 24th International Conference on Artificial Intelligence and Statistics, in Proceedings of Machine Learning Research 130:2494-2502 Available from https://proceedings.mlr.press/v130/hong21a.html.

Non-Stationary Off-Policy Optimization

Abstract

Cite this Paper

Related Material