Off-Policy Confidence Sequences

Nikos Karampatziakis; Paul Mineiro; Aaditya Ramdas

Off-Policy Confidence Sequences

Nikos Karampatziakis, Paul Mineiro, Aaditya Ramdas

Proceedings of the 38th International Conference on Machine Learning, PMLR 139:5301-5310, 2021.

Abstract

We develop confidence bounds that hold uniformly over time for off-policy evaluation in the contextual bandit setting. These confidence sequences are based on recent ideas from martingale analysis and are non-asymptotic, non-parametric, and valid at arbitrary stopping times. We provide algorithms for computing these confidence sequences that strike a good balance between computational and statistical efficiency. We empirically demonstrate the tightness of our approach in terms of failure probability and width and apply it to the “gated deployment” problem of safely upgrading a production contextual bandit system.

Cite this Paper

BibTeX

@InProceedings{pmlr-v139-karampatziakis21a,
  title = 	 {Off-Policy Confidence Sequences},
  author =       {Karampatziakis, Nikos and Mineiro, Paul and Ramdas, Aaditya},
  booktitle = 	 {Proceedings of the 38th International Conference on Machine Learning},
  pages = 	 {5301--5310},
  year = 	 {2021},
  editor = 	 {Meila, Marina and Zhang, Tong},
  volume = 	 {139},
  series = 	 {Proceedings of Machine Learning Research},
  month = 	 {18--24 Jul},
  publisher =    {PMLR},
  pdf = 	 {http://proceedings.mlr.press/v139/karampatziakis21a/karampatziakis21a.pdf},
  url = 	 {https://proceedings.mlr.press/v139/karampatziakis21a.html},
  abstract = 	 {We develop confidence bounds that hold uniformly over time for off-policy evaluation in the contextual bandit setting. These confidence sequences are based on recent ideas from martingale analysis and are non-asymptotic, non-parametric, and valid at arbitrary stopping times. We provide algorithms for computing these confidence sequences that strike a good balance between computational and statistical efficiency. We empirically demonstrate the tightness of our approach in terms of failure probability and width and apply it to the “gated deployment” problem of safely upgrading a production contextual bandit system.}
}

Endnote

%0 Conference Paper
%T Off-Policy Confidence Sequences
%A Nikos Karampatziakis
%A Paul Mineiro
%A Aaditya Ramdas
%B Proceedings of the 38th International Conference on Machine Learning
%C Proceedings of Machine Learning Research
%D 2021
%E Marina Meila
%E Tong Zhang	
%F pmlr-v139-karampatziakis21a
%I PMLR
%P 5301--5310
%U https://proceedings.mlr.press/v139/karampatziakis21a.html
%V 139
%X We develop confidence bounds that hold uniformly over time for off-policy evaluation in the contextual bandit setting. These confidence sequences are based on recent ideas from martingale analysis and are non-asymptotic, non-parametric, and valid at arbitrary stopping times. We provide algorithms for computing these confidence sequences that strike a good balance between computational and statistical efficiency. We empirically demonstrate the tightness of our approach in terms of failure probability and width and apply it to the “gated deployment” problem of safely upgrading a production contextual bandit system.

APA

Karampatziakis, N., Mineiro, P. & Ramdas, A.. (2021). Off-Policy Confidence Sequences. Proceedings of the 38th International Conference on Machine Learning, in Proceedings of Machine Learning Research 139:5301-5310 Available from https://proceedings.mlr.press/v139/karampatziakis21a.html.

Off-Policy Confidence Sequences

Abstract

Cite this Paper

Related Material