The Regret of Exploration and the Control of Bad Episodes in Reinforcement Learning

Victor Boone, Bruno Gaujal
Proceedings of the 40th International Conference on Machine Learning, PMLR 202:2824-2856, 2023.

Abstract

The first contribution of this paper is the introduction of a new performance measure of a RL algorithm that is more discriminating than the regret, that we call the regret of exploration that measures the asymptotic cost of exploration. The second contribution is a new performance test (PT) to end episodes in RL optimistic algorithms. This test is based on the performance of the current policy with respect to the best policy over the current confidence set. This is in contrast with all existing RL algorithms whose episode lengths are only based on the number of visits to the states. This modification does not harm the regret and brings an additional property. We show that while all current episodic RL algorithms have a linear regret of exploration, our method has a $O(\log{T})$ regret of exploration for non-degenerate deterministic MDPs.

Cite this Paper


BibTeX
@InProceedings{pmlr-v202-boone23a, title = {The Regret of Exploration and the Control of Bad Episodes in Reinforcement Learning}, author = {Boone, Victor and Gaujal, Bruno}, booktitle = {Proceedings of the 40th International Conference on Machine Learning}, pages = {2824--2856}, year = {2023}, editor = {Krause, Andreas and Brunskill, Emma and Cho, Kyunghyun and Engelhardt, Barbara and Sabato, Sivan and Scarlett, Jonathan}, volume = {202}, series = {Proceedings of Machine Learning Research}, month = {23--29 Jul}, publisher = {PMLR}, pdf = {https://proceedings.mlr.press/v202/boone23a/boone23a.pdf}, url = {https://proceedings.mlr.press/v202/boone23a.html}, abstract = {The first contribution of this paper is the introduction of a new performance measure of a RL algorithm that is more discriminating than the regret, that we call the regret of exploration that measures the asymptotic cost of exploration. The second contribution is a new performance test (PT) to end episodes in RL optimistic algorithms. This test is based on the performance of the current policy with respect to the best policy over the current confidence set. This is in contrast with all existing RL algorithms whose episode lengths are only based on the number of visits to the states. This modification does not harm the regret and brings an additional property. We show that while all current episodic RL algorithms have a linear regret of exploration, our method has a $O(\log{T})$ regret of exploration for non-degenerate deterministic MDPs.} }
Endnote
%0 Conference Paper %T The Regret of Exploration and the Control of Bad Episodes in Reinforcement Learning %A Victor Boone %A Bruno Gaujal %B Proceedings of the 40th International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2023 %E Andreas Krause %E Emma Brunskill %E Kyunghyun Cho %E Barbara Engelhardt %E Sivan Sabato %E Jonathan Scarlett %F pmlr-v202-boone23a %I PMLR %P 2824--2856 %U https://proceedings.mlr.press/v202/boone23a.html %V 202 %X The first contribution of this paper is the introduction of a new performance measure of a RL algorithm that is more discriminating than the regret, that we call the regret of exploration that measures the asymptotic cost of exploration. The second contribution is a new performance test (PT) to end episodes in RL optimistic algorithms. This test is based on the performance of the current policy with respect to the best policy over the current confidence set. This is in contrast with all existing RL algorithms whose episode lengths are only based on the number of visits to the states. This modification does not harm the regret and brings an additional property. We show that while all current episodic RL algorithms have a linear regret of exploration, our method has a $O(\log{T})$ regret of exploration for non-degenerate deterministic MDPs.
APA
Boone, V. & Gaujal, B.. (2023). The Regret of Exploration and the Control of Bad Episodes in Reinforcement Learning. Proceedings of the 40th International Conference on Machine Learning, in Proceedings of Machine Learning Research 202:2824-2856 Available from https://proceedings.mlr.press/v202/boone23a.html.

Related Material