Finite-Sample Regret Bound for Distributionally Robust Offline Tabular Reinforcement Learning

Zhengqing Zhou, Zhengyuan Zhou, Qinxun Bai, Linhai Qiu, Jose Blanchet, Peter Glynn
Proceedings of The 24th International Conference on Artificial Intelligence and Statistics, PMLR 130:3331-3339, 2021.

Abstract

While reinforcement learning has witnessed tremendous success recently in a wide range of domains, robustness–or the lack thereof–remains an important issue that remains inadequately addressed. In this paper, we provide a distributionally robust formulation of offline learning policy in tabular RL that aims to learn a policy from historical data (collected by some other behavior policy) that is robust to the future environment arising as a perturbation of the training environment. We first develop a novel policy evaluation scheme that accurately estimates the robust value (i.e. how robust it is in a perturbed environment) of any given policy and establish its finite-sample estimation error. Building on this, we then develop a novel and minimax-optimal distributionally robust learning algorithm that achieves $O_P\left(1/\sqrt{n}\right)$ regret, meaning that with high probability, the policy learned from using $n$ training data points will be $O\left(1/\sqrt{n}\right)$ close to the optimal distributionally robust policy. Finally, our simulation results demonstrate the superiority of our distributionally robust approach compared to non-robust RL algorithms.

Cite this Paper


BibTeX
@InProceedings{pmlr-v130-zhou21d, title = { Finite-Sample Regret Bound for Distributionally Robust Offline Tabular Reinforcement Learning }, author = {Zhou, Zhengqing and Zhou, Zhengyuan and Bai, Qinxun and Qiu, Linhai and Blanchet, Jose and Glynn, Peter}, booktitle = {Proceedings of The 24th International Conference on Artificial Intelligence and Statistics}, pages = {3331--3339}, year = {2021}, editor = {Banerjee, Arindam and Fukumizu, Kenji}, volume = {130}, series = {Proceedings of Machine Learning Research}, month = {13--15 Apr}, publisher = {PMLR}, pdf = {http://proceedings.mlr.press/v130/zhou21d/zhou21d.pdf}, url = {https://proceedings.mlr.press/v130/zhou21d.html}, abstract = { While reinforcement learning has witnessed tremendous success recently in a wide range of domains, robustness–or the lack thereof–remains an important issue that remains inadequately addressed. In this paper, we provide a distributionally robust formulation of offline learning policy in tabular RL that aims to learn a policy from historical data (collected by some other behavior policy) that is robust to the future environment arising as a perturbation of the training environment. We first develop a novel policy evaluation scheme that accurately estimates the robust value (i.e. how robust it is in a perturbed environment) of any given policy and establish its finite-sample estimation error. Building on this, we then develop a novel and minimax-optimal distributionally robust learning algorithm that achieves $O_P\left(1/\sqrt{n}\right)$ regret, meaning that with high probability, the policy learned from using $n$ training data points will be $O\left(1/\sqrt{n}\right)$ close to the optimal distributionally robust policy. Finally, our simulation results demonstrate the superiority of our distributionally robust approach compared to non-robust RL algorithms. } }
Endnote
%0 Conference Paper %T Finite-Sample Regret Bound for Distributionally Robust Offline Tabular Reinforcement Learning %A Zhengqing Zhou %A Zhengyuan Zhou %A Qinxun Bai %A Linhai Qiu %A Jose Blanchet %A Peter Glynn %B Proceedings of The 24th International Conference on Artificial Intelligence and Statistics %C Proceedings of Machine Learning Research %D 2021 %E Arindam Banerjee %E Kenji Fukumizu %F pmlr-v130-zhou21d %I PMLR %P 3331--3339 %U https://proceedings.mlr.press/v130/zhou21d.html %V 130 %X While reinforcement learning has witnessed tremendous success recently in a wide range of domains, robustness–or the lack thereof–remains an important issue that remains inadequately addressed. In this paper, we provide a distributionally robust formulation of offline learning policy in tabular RL that aims to learn a policy from historical data (collected by some other behavior policy) that is robust to the future environment arising as a perturbation of the training environment. We first develop a novel policy evaluation scheme that accurately estimates the robust value (i.e. how robust it is in a perturbed environment) of any given policy and establish its finite-sample estimation error. Building on this, we then develop a novel and minimax-optimal distributionally robust learning algorithm that achieves $O_P\left(1/\sqrt{n}\right)$ regret, meaning that with high probability, the policy learned from using $n$ training data points will be $O\left(1/\sqrt{n}\right)$ close to the optimal distributionally robust policy. Finally, our simulation results demonstrate the superiority of our distributionally robust approach compared to non-robust RL algorithms.
APA
Zhou, Z., Zhou, Z., Bai, Q., Qiu, L., Blanchet, J. & Glynn, P.. (2021). Finite-Sample Regret Bound for Distributionally Robust Offline Tabular Reinforcement Learning . Proceedings of The 24th International Conference on Artificial Intelligence and Statistics, in Proceedings of Machine Learning Research 130:3331-3339 Available from https://proceedings.mlr.press/v130/zhou21d.html.

Related Material