Offline Policy Selection under Uncertainty

Mengjiao Yang, Bo Dai, Ofir Nachum, George Tucker, Dale Schuurmans
Proceedings of The 25th International Conference on Artificial Intelligence and Statistics, PMLR 151:4376-4396, 2022.

Abstract

The presence of uncertainty in policy evaluation significantly complicates the process of policy ranking and selection in real-world settings. We formally consider offline policy selection as learning preferences over a set of policy prospects given a fixed experience dataset. While one can select or rank policies based on point estimates of their expected values or high-confidence intervals, access to the full distribution over one’s belief of the policy value enables more flexible selection algorithms under a wider range of downstream evaluation metrics. We propose a Bayesian approach for estimating this belief distribution in terms of posteriors of distribution correction ratios derived from stochastic constraints. Empirically, despite being Bayesian, the credible intervals obtained are competitive with state-of-the-art frequentist approaches in confidence interval estimation. More importantly, we show how the belief distribution may be used to rank policies with respect to arbitrary downstream policy selection metrics, and empirically demonstrate that this selection procedure significantly outperforms existing approaches, such as ranking policies according to mean or high-confidence lower bound value estimates.

Cite this Paper


BibTeX
@InProceedings{pmlr-v151-yang22a, title = { Offline Policy Selection under Uncertainty }, author = {Yang, Mengjiao and Dai, Bo and Nachum, Ofir and Tucker, George and Schuurmans, Dale}, booktitle = {Proceedings of The 25th International Conference on Artificial Intelligence and Statistics}, pages = {4376--4396}, year = {2022}, editor = {Camps-Valls, Gustau and Ruiz, Francisco J. R. and Valera, Isabel}, volume = {151}, series = {Proceedings of Machine Learning Research}, month = {28--30 Mar}, publisher = {PMLR}, pdf = {https://proceedings.mlr.press/v151/yang22a/yang22a.pdf}, url = {https://proceedings.mlr.press/v151/yang22a.html}, abstract = { The presence of uncertainty in policy evaluation significantly complicates the process of policy ranking and selection in real-world settings. We formally consider offline policy selection as learning preferences over a set of policy prospects given a fixed experience dataset. While one can select or rank policies based on point estimates of their expected values or high-confidence intervals, access to the full distribution over one’s belief of the policy value enables more flexible selection algorithms under a wider range of downstream evaluation metrics. We propose a Bayesian approach for estimating this belief distribution in terms of posteriors of distribution correction ratios derived from stochastic constraints. Empirically, despite being Bayesian, the credible intervals obtained are competitive with state-of-the-art frequentist approaches in confidence interval estimation. More importantly, we show how the belief distribution may be used to rank policies with respect to arbitrary downstream policy selection metrics, and empirically demonstrate that this selection procedure significantly outperforms existing approaches, such as ranking policies according to mean or high-confidence lower bound value estimates. } }
Endnote
%0 Conference Paper %T Offline Policy Selection under Uncertainty %A Mengjiao Yang %A Bo Dai %A Ofir Nachum %A George Tucker %A Dale Schuurmans %B Proceedings of The 25th International Conference on Artificial Intelligence and Statistics %C Proceedings of Machine Learning Research %D 2022 %E Gustau Camps-Valls %E Francisco J. R. Ruiz %E Isabel Valera %F pmlr-v151-yang22a %I PMLR %P 4376--4396 %U https://proceedings.mlr.press/v151/yang22a.html %V 151 %X The presence of uncertainty in policy evaluation significantly complicates the process of policy ranking and selection in real-world settings. We formally consider offline policy selection as learning preferences over a set of policy prospects given a fixed experience dataset. While one can select or rank policies based on point estimates of their expected values or high-confidence intervals, access to the full distribution over one’s belief of the policy value enables more flexible selection algorithms under a wider range of downstream evaluation metrics. We propose a Bayesian approach for estimating this belief distribution in terms of posteriors of distribution correction ratios derived from stochastic constraints. Empirically, despite being Bayesian, the credible intervals obtained are competitive with state-of-the-art frequentist approaches in confidence interval estimation. More importantly, we show how the belief distribution may be used to rank policies with respect to arbitrary downstream policy selection metrics, and empirically demonstrate that this selection procedure significantly outperforms existing approaches, such as ranking policies according to mean or high-confidence lower bound value estimates.
APA
Yang, M., Dai, B., Nachum, O., Tucker, G. & Schuurmans, D.. (2022). Offline Policy Selection under Uncertainty . Proceedings of The 25th International Conference on Artificial Intelligence and Statistics, in Proceedings of Machine Learning Research 151:4376-4396 Available from https://proceedings.mlr.press/v151/yang22a.html.

Related Material