Optimism and Delays in Episodic Reinforcement Learning

Benjamin Howson, Ciara Pike-Burke, Sarah Filippi
Proceedings of The 26th International Conference on Artificial Intelligence and Statistics, PMLR 206:6061-6094, 2023.

Abstract

There are many algorithms for regret minimisation in episodic reinforcement learning. This problem is well-understood from a theoretical perspective, providing that the sequences of states, actions and rewards associated with each episode are available to the algorithm updating the policy immediately after every interaction with the environment. However, feedback is almost always delayed in practice. In this paper, we study the impact of delayed feedback in episodic reinforcement learning from a theoretical perspective and propose two general-purpose approaches to handling the delays. The first involves updating as soon as new information becomes available, whereas the second waits before using newly observed information to update the policy. For the class of optimistic algorithms and either approach, we show that the regret increases by an additive term involving the number of states, actions, episode length, the expected delay and an algorithm-dependent constant. We empirically investigate the impact of various delay distributions on the regret of optimistic algorithms to validate our theoretical results.

Cite this Paper


BibTeX
@InProceedings{pmlr-v206-howson23a, title = {Optimism and Delays in Episodic Reinforcement Learning}, author = {Howson, Benjamin and Pike-Burke, Ciara and Filippi, Sarah}, booktitle = {Proceedings of The 26th International Conference on Artificial Intelligence and Statistics}, pages = {6061--6094}, year = {2023}, editor = {Ruiz, Francisco and Dy, Jennifer and van de Meent, Jan-Willem}, volume = {206}, series = {Proceedings of Machine Learning Research}, month = {25--27 Apr}, publisher = {PMLR}, pdf = {https://proceedings.mlr.press/v206/howson23a/howson23a.pdf}, url = {https://proceedings.mlr.press/v206/howson23a.html}, abstract = {There are many algorithms for regret minimisation in episodic reinforcement learning. This problem is well-understood from a theoretical perspective, providing that the sequences of states, actions and rewards associated with each episode are available to the algorithm updating the policy immediately after every interaction with the environment. However, feedback is almost always delayed in practice. In this paper, we study the impact of delayed feedback in episodic reinforcement learning from a theoretical perspective and propose two general-purpose approaches to handling the delays. The first involves updating as soon as new information becomes available, whereas the second waits before using newly observed information to update the policy. For the class of optimistic algorithms and either approach, we show that the regret increases by an additive term involving the number of states, actions, episode length, the expected delay and an algorithm-dependent constant. We empirically investigate the impact of various delay distributions on the regret of optimistic algorithms to validate our theoretical results.} }
Endnote
%0 Conference Paper %T Optimism and Delays in Episodic Reinforcement Learning %A Benjamin Howson %A Ciara Pike-Burke %A Sarah Filippi %B Proceedings of The 26th International Conference on Artificial Intelligence and Statistics %C Proceedings of Machine Learning Research %D 2023 %E Francisco Ruiz %E Jennifer Dy %E Jan-Willem van de Meent %F pmlr-v206-howson23a %I PMLR %P 6061--6094 %U https://proceedings.mlr.press/v206/howson23a.html %V 206 %X There are many algorithms for regret minimisation in episodic reinforcement learning. This problem is well-understood from a theoretical perspective, providing that the sequences of states, actions and rewards associated with each episode are available to the algorithm updating the policy immediately after every interaction with the environment. However, feedback is almost always delayed in practice. In this paper, we study the impact of delayed feedback in episodic reinforcement learning from a theoretical perspective and propose two general-purpose approaches to handling the delays. The first involves updating as soon as new information becomes available, whereas the second waits before using newly observed information to update the policy. For the class of optimistic algorithms and either approach, we show that the regret increases by an additive term involving the number of states, actions, episode length, the expected delay and an algorithm-dependent constant. We empirically investigate the impact of various delay distributions on the regret of optimistic algorithms to validate our theoretical results.
APA
Howson, B., Pike-Burke, C. & Filippi, S.. (2023). Optimism and Delays in Episodic Reinforcement Learning. Proceedings of The 26th International Conference on Artificial Intelligence and Statistics, in Proceedings of Machine Learning Research 206:6061-6094 Available from https://proceedings.mlr.press/v206/howson23a.html.

Related Material