No-Regret Reinforcement Learning in Smooth MDPs

Davide Maran, Alberto Maria Metelli, Matteo Papini, Marcello Restelli
Proceedings of the 41st International Conference on Machine Learning, PMLR 235:34760-34789, 2024.

Abstract

Obtaining no-regret guarantees for reinforcement learning (RL) in the case of problems with continuous state and/or action spaces is still one of the major open challenges in the field. Recently, a variety of solutions have been proposed, but besides very specific settings, the general problem remains unsolved. In this paper, we introduce a novel structural assumption on the Markov decision processes (MDPs), namely $\nu-$smoothness, that generalizes most of the settings proposed so far (e.g., linear MDPs and Lipschitz MDPs). To face this challenging scenario, we propose two algorithms for regret minimization in $\nu-$smooth MDPs. Both algorithms build upon the idea of constructing an MDP representation through an orthogonal feature map based on Legendre polynomials. The first algorithm, Legendre-Eleanor, archives the no-regret property under weaker assumptions but is computationally inefficient, whereas the second one, Legendre-LSVI, runs in polynomial time, although for a smaller class of problems. After analyzing their regret properties, we compare our results with state-of-the-art ones from RL theory, showing that our algorithms achieve the best guarantees.

Cite this Paper


BibTeX
@InProceedings{pmlr-v235-maran24a, title = {No-Regret Reinforcement Learning in Smooth {MDP}s}, author = {Maran, Davide and Metelli, Alberto Maria and Papini, Matteo and Restelli, Marcello}, booktitle = {Proceedings of the 41st International Conference on Machine Learning}, pages = {34760--34789}, year = {2024}, editor = {Salakhutdinov, Ruslan and Kolter, Zico and Heller, Katherine and Weller, Adrian and Oliver, Nuria and Scarlett, Jonathan and Berkenkamp, Felix}, volume = {235}, series = {Proceedings of Machine Learning Research}, month = {21--27 Jul}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v235/main/assets/maran24a/maran24a.pdf}, url = {https://proceedings.mlr.press/v235/maran24a.html}, abstract = {Obtaining no-regret guarantees for reinforcement learning (RL) in the case of problems with continuous state and/or action spaces is still one of the major open challenges in the field. Recently, a variety of solutions have been proposed, but besides very specific settings, the general problem remains unsolved. In this paper, we introduce a novel structural assumption on the Markov decision processes (MDPs), namely $\nu-$smoothness, that generalizes most of the settings proposed so far (e.g., linear MDPs and Lipschitz MDPs). To face this challenging scenario, we propose two algorithms for regret minimization in $\nu-$smooth MDPs. Both algorithms build upon the idea of constructing an MDP representation through an orthogonal feature map based on Legendre polynomials. The first algorithm, Legendre-Eleanor, archives the no-regret property under weaker assumptions but is computationally inefficient, whereas the second one, Legendre-LSVI, runs in polynomial time, although for a smaller class of problems. After analyzing their regret properties, we compare our results with state-of-the-art ones from RL theory, showing that our algorithms achieve the best guarantees.} }
Endnote
%0 Conference Paper %T No-Regret Reinforcement Learning in Smooth MDPs %A Davide Maran %A Alberto Maria Metelli %A Matteo Papini %A Marcello Restelli %B Proceedings of the 41st International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2024 %E Ruslan Salakhutdinov %E Zico Kolter %E Katherine Heller %E Adrian Weller %E Nuria Oliver %E Jonathan Scarlett %E Felix Berkenkamp %F pmlr-v235-maran24a %I PMLR %P 34760--34789 %U https://proceedings.mlr.press/v235/maran24a.html %V 235 %X Obtaining no-regret guarantees for reinforcement learning (RL) in the case of problems with continuous state and/or action spaces is still one of the major open challenges in the field. Recently, a variety of solutions have been proposed, but besides very specific settings, the general problem remains unsolved. In this paper, we introduce a novel structural assumption on the Markov decision processes (MDPs), namely $\nu-$smoothness, that generalizes most of the settings proposed so far (e.g., linear MDPs and Lipschitz MDPs). To face this challenging scenario, we propose two algorithms for regret minimization in $\nu-$smooth MDPs. Both algorithms build upon the idea of constructing an MDP representation through an orthogonal feature map based on Legendre polynomials. The first algorithm, Legendre-Eleanor, archives the no-regret property under weaker assumptions but is computationally inefficient, whereas the second one, Legendre-LSVI, runs in polynomial time, although for a smaller class of problems. After analyzing their regret properties, we compare our results with state-of-the-art ones from RL theory, showing that our algorithms achieve the best guarantees.
APA
Maran, D., Metelli, A.M., Papini, M. & Restelli, M.. (2024). No-Regret Reinforcement Learning in Smooth MDPs. Proceedings of the 41st International Conference on Machine Learning, in Proceedings of Machine Learning Research 235:34760-34789 Available from https://proceedings.mlr.press/v235/maran24a.html.

Related Material