Efficient Model-Based Concave Utility Reinforcement Learning through Greedy Mirror Descent

Bianca M. Moreno, Margaux Bregere, Pierre Gaillard, Nadia Oudjane
Proceedings of The 27th International Conference on Artificial Intelligence and Statistics, PMLR 238:2206-2214, 2024.

Abstract

Many machine learning tasks can be solved by minimizing a convex function of an occupancy measure over the policies that generate them. These include reinforcement learning, imitation learning, among others. This more general paradigm is called the Concave Utility Reinforcement Learning problem (CURL). Since CURL invalidates classical Bellman equations, it requires new algorithms. We introduce MD-CURL, a new algorithm for CURL in a finite horizon Markov decision process. MD-CURL is inspired by mirror descent and uses a non-standard regularization to achieve convergence guarantees and a simple closed-form solution, eliminating the need for computationally expensive projection steps typically found in mirror descent approaches. We then extend CURL to an online learning scenario and present Greedy MD-CURL, a new method adapting MD-CURL to an online, episode-based setting with partially unknown dynamics. Like MD-CURL, the online version Greedy MD-CURL benefits from low computational complexity, while guaranteeing sub-linear or even logarithmic regret, depending on the level of information available on the underlying dynamics.

Cite this Paper


BibTeX
@InProceedings{pmlr-v238-moreno24a, title = {Efficient Model-Based Concave Utility Reinforcement Learning through Greedy Mirror Descent}, author = {Moreno, Bianca M. and Bregere, Margaux and Gaillard, Pierre and Oudjane, Nadia}, booktitle = {Proceedings of The 27th International Conference on Artificial Intelligence and Statistics}, pages = {2206--2214}, year = {2024}, editor = {Dasgupta, Sanjoy and Mandt, Stephan and Li, Yingzhen}, volume = {238}, series = {Proceedings of Machine Learning Research}, month = {02--04 May}, publisher = {PMLR}, pdf = {https://proceedings.mlr.press/v238/moreno24a/moreno24a.pdf}, url = {https://proceedings.mlr.press/v238/moreno24a.html}, abstract = {Many machine learning tasks can be solved by minimizing a convex function of an occupancy measure over the policies that generate them. These include reinforcement learning, imitation learning, among others. This more general paradigm is called the Concave Utility Reinforcement Learning problem (CURL). Since CURL invalidates classical Bellman equations, it requires new algorithms. We introduce MD-CURL, a new algorithm for CURL in a finite horizon Markov decision process. MD-CURL is inspired by mirror descent and uses a non-standard regularization to achieve convergence guarantees and a simple closed-form solution, eliminating the need for computationally expensive projection steps typically found in mirror descent approaches. We then extend CURL to an online learning scenario and present Greedy MD-CURL, a new method adapting MD-CURL to an online, episode-based setting with partially unknown dynamics. Like MD-CURL, the online version Greedy MD-CURL benefits from low computational complexity, while guaranteeing sub-linear or even logarithmic regret, depending on the level of information available on the underlying dynamics.} }
Endnote
%0 Conference Paper %T Efficient Model-Based Concave Utility Reinforcement Learning through Greedy Mirror Descent %A Bianca M. Moreno %A Margaux Bregere %A Pierre Gaillard %A Nadia Oudjane %B Proceedings of The 27th International Conference on Artificial Intelligence and Statistics %C Proceedings of Machine Learning Research %D 2024 %E Sanjoy Dasgupta %E Stephan Mandt %E Yingzhen Li %F pmlr-v238-moreno24a %I PMLR %P 2206--2214 %U https://proceedings.mlr.press/v238/moreno24a.html %V 238 %X Many machine learning tasks can be solved by minimizing a convex function of an occupancy measure over the policies that generate them. These include reinforcement learning, imitation learning, among others. This more general paradigm is called the Concave Utility Reinforcement Learning problem (CURL). Since CURL invalidates classical Bellman equations, it requires new algorithms. We introduce MD-CURL, a new algorithm for CURL in a finite horizon Markov decision process. MD-CURL is inspired by mirror descent and uses a non-standard regularization to achieve convergence guarantees and a simple closed-form solution, eliminating the need for computationally expensive projection steps typically found in mirror descent approaches. We then extend CURL to an online learning scenario and present Greedy MD-CURL, a new method adapting MD-CURL to an online, episode-based setting with partially unknown dynamics. Like MD-CURL, the online version Greedy MD-CURL benefits from low computational complexity, while guaranteeing sub-linear or even logarithmic regret, depending on the level of information available on the underlying dynamics.
APA
Moreno, B.M., Bregere, M., Gaillard, P. & Oudjane, N.. (2024). Efficient Model-Based Concave Utility Reinforcement Learning through Greedy Mirror Descent. Proceedings of The 27th International Conference on Artificial Intelligence and Statistics, in Proceedings of Machine Learning Research 238:2206-2214 Available from https://proceedings.mlr.press/v238/moreno24a.html.

Related Material