Online Episodic Convex Reinforcement Learning

Bianca Marin Moreno; Khaled Eldowa; Pierre Gaillard; Margaux Brégère; Nadia Oudjane

Online Episodic Convex Reinforcement Learning

Bianca Marin Moreno, Khaled Eldowa, Pierre Gaillard, Margaux Brégère, Nadia Oudjane

Proceedings of the 42nd International Conference on Machine Learning, PMLR 267:44775-44824, 2025.

Abstract

We study online learning in episodic finite-horizon Markov decision processes (MDPs) with convex objective functions, known as the concave utility reinforcement learning (CURL) problem. This setting generalizes RL from linear to convex losses on the state-action distribution induced by the agent’s policy. The non-linearity of CURL invalidates classical Bellman equations and requires new algorithmic approaches. We introduce the first algorithm achieving near-optimal regret bounds for online CURL without any prior knowledge on the transition function. To achieve this, we use a novel online mirror descent algorithm with variable constraint sets and a carefully designed exploration bonus. We then address for the first time a bandit version of CURL, where the only feedback is the value of the objective function on the state-action distribution induced by the agent’s policy. We achieve a sub-linear regret bound for this more challenging problem by adapting techniques from bandit convex optimization to the MDP setting.

Cite this Paper

BibTeX

@InProceedings{pmlr-v267-moreno25a,
  title = 	 {Online Episodic Convex Reinforcement Learning},
  author =       {Moreno, Bianca Marin and Eldowa, Khaled and Gaillard, Pierre and Br\'{e}g\`{e}re, Margaux and Oudjane, Nadia},
  booktitle = 	 {Proceedings of the 42nd International Conference on Machine Learning},
  pages = 	 {44775--44824},
  year = 	 {2025},
  editor = 	 {Singh, Aarti and Fazel, Maryam and Hsu, Daniel and Lacoste-Julien, Simon and Berkenkamp, Felix and Maharaj, Tegan and Wagstaff, Kiri and Zhu, Jerry},
  volume = 	 {267},
  series = 	 {Proceedings of Machine Learning Research},
  month = 	 {13--19 Jul},
  publisher =    {PMLR},
  pdf = 	 {https://raw.githubusercontent.com/mlresearch/v267/main/assets/moreno25a/moreno25a.pdf},
  url = 	 {https://proceedings.mlr.press/v267/moreno25a.html},
  abstract = 	 {We study online learning in episodic finite-horizon Markov decision processes (MDPs) with convex objective functions, known as the concave utility reinforcement learning (CURL) problem. This setting generalizes RL from linear to convex losses on the state-action distribution induced by the agent’s policy. The non-linearity of CURL invalidates classical Bellman equations and requires new algorithmic approaches. We introduce the first algorithm achieving near-optimal regret bounds for online CURL without any prior knowledge on the transition function. To achieve this, we use a novel online mirror descent algorithm with variable constraint sets and a carefully designed exploration bonus. We then address for the first time a bandit version of CURL, where the only feedback is the value of the objective function on the state-action distribution induced by the agent’s policy. We achieve a sub-linear regret bound for this more challenging problem by adapting techniques from bandit convex optimization to the MDP setting.}
}

Endnote

%0 Conference Paper
%T Online Episodic Convex Reinforcement Learning
%A Bianca Marin Moreno
%A Khaled Eldowa
%A Pierre Gaillard
%A Margaux Brégère
%A Nadia Oudjane
%B Proceedings of the 42nd International Conference on Machine Learning
%C Proceedings of Machine Learning Research
%D 2025
%E Aarti Singh
%E Maryam Fazel
%E Daniel Hsu
%E Simon Lacoste-Julien
%E Felix Berkenkamp
%E Tegan Maharaj
%E Kiri Wagstaff
%E Jerry Zhu	
%F pmlr-v267-moreno25a
%I PMLR
%P 44775--44824
%U https://proceedings.mlr.press/v267/moreno25a.html
%V 267
%X We study online learning in episodic finite-horizon Markov decision processes (MDPs) with convex objective functions, known as the concave utility reinforcement learning (CURL) problem. This setting generalizes RL from linear to convex losses on the state-action distribution induced by the agent’s policy. The non-linearity of CURL invalidates classical Bellman equations and requires new algorithmic approaches. We introduce the first algorithm achieving near-optimal regret bounds for online CURL without any prior knowledge on the transition function. To achieve this, we use a novel online mirror descent algorithm with variable constraint sets and a carefully designed exploration bonus. We then address for the first time a bandit version of CURL, where the only feedback is the value of the objective function on the state-action distribution induced by the agent’s policy. We achieve a sub-linear regret bound for this more challenging problem by adapting techniques from bandit convex optimization to the MDP setting.

APA

Moreno, B.M., Eldowa, K., Gaillard, P., Brégère, M. & Oudjane, N.. (2025). Online Episodic Convex Reinforcement Learning. Proceedings of the 42nd International Conference on Machine Learning, in Proceedings of Machine Learning Research 267:44775-44824 Available from https://proceedings.mlr.press/v267/moreno25a.html.

Online Episodic Convex Reinforcement Learning

Abstract

Cite this Paper

Related Material