Delay-Adapted Policy Optimization and Improved Regret for Adversarial MDP with Delayed Bandit Feedback

Tal Lancewicki, Aviv Rosenberg, Dmitry Sotnikov
Proceedings of the 40th International Conference on Machine Learning, PMLR 202:18482-18534, 2023.

Abstract

Policy Optimization (PO) is one of the most popular methods in Reinforcement Learning (RL). Thus, theoretical guarantees for PO algorithms have become especially important to the RL community. In this paper, we study PO in adversarial MDPs with a challenge that arises in almost every real-world application – delayed bandit feedback. We give the first near-optimal regret bounds for PO in tabular MDPs, and may even surpass state-of-the-art (which uses less efficient methods). Our novel Delay-Adapted PO (DAPO) is easy to implement and to generalize, allowing us to extend our algorithm to: (i) infinite state space under the assumption of linear $Q$-function, proving the first regret bounds for delayed feedback with function approximation. (ii) deep RL, demonstrating its effectiveness in experiments on MuJoCo domains.

Cite this Paper


BibTeX
@InProceedings{pmlr-v202-lancewicki23a, title = {Delay-Adapted Policy Optimization and Improved Regret for Adversarial {MDP} with Delayed Bandit Feedback}, author = {Lancewicki, Tal and Rosenberg, Aviv and Sotnikov, Dmitry}, booktitle = {Proceedings of the 40th International Conference on Machine Learning}, pages = {18482--18534}, year = {2023}, editor = {Krause, Andreas and Brunskill, Emma and Cho, Kyunghyun and Engelhardt, Barbara and Sabato, Sivan and Scarlett, Jonathan}, volume = {202}, series = {Proceedings of Machine Learning Research}, month = {23--29 Jul}, publisher = {PMLR}, pdf = {https://proceedings.mlr.press/v202/lancewicki23a/lancewicki23a.pdf}, url = {https://proceedings.mlr.press/v202/lancewicki23a.html}, abstract = {Policy Optimization (PO) is one of the most popular methods in Reinforcement Learning (RL). Thus, theoretical guarantees for PO algorithms have become especially important to the RL community. In this paper, we study PO in adversarial MDPs with a challenge that arises in almost every real-world application – delayed bandit feedback. We give the first near-optimal regret bounds for PO in tabular MDPs, and may even surpass state-of-the-art (which uses less efficient methods). Our novel Delay-Adapted PO (DAPO) is easy to implement and to generalize, allowing us to extend our algorithm to: (i) infinite state space under the assumption of linear $Q$-function, proving the first regret bounds for delayed feedback with function approximation. (ii) deep RL, demonstrating its effectiveness in experiments on MuJoCo domains.} }
Endnote
%0 Conference Paper %T Delay-Adapted Policy Optimization and Improved Regret for Adversarial MDP with Delayed Bandit Feedback %A Tal Lancewicki %A Aviv Rosenberg %A Dmitry Sotnikov %B Proceedings of the 40th International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2023 %E Andreas Krause %E Emma Brunskill %E Kyunghyun Cho %E Barbara Engelhardt %E Sivan Sabato %E Jonathan Scarlett %F pmlr-v202-lancewicki23a %I PMLR %P 18482--18534 %U https://proceedings.mlr.press/v202/lancewicki23a.html %V 202 %X Policy Optimization (PO) is one of the most popular methods in Reinforcement Learning (RL). Thus, theoretical guarantees for PO algorithms have become especially important to the RL community. In this paper, we study PO in adversarial MDPs with a challenge that arises in almost every real-world application – delayed bandit feedback. We give the first near-optimal regret bounds for PO in tabular MDPs, and may even surpass state-of-the-art (which uses less efficient methods). Our novel Delay-Adapted PO (DAPO) is easy to implement and to generalize, allowing us to extend our algorithm to: (i) infinite state space under the assumption of linear $Q$-function, proving the first regret bounds for delayed feedback with function approximation. (ii) deep RL, demonstrating its effectiveness in experiments on MuJoCo domains.
APA
Lancewicki, T., Rosenberg, A. & Sotnikov, D.. (2023). Delay-Adapted Policy Optimization and Improved Regret for Adversarial MDP with Delayed Bandit Feedback. Proceedings of the 40th International Conference on Machine Learning, in Proceedings of Machine Learning Research 202:18482-18534 Available from https://proceedings.mlr.press/v202/lancewicki23a.html.

Related Material