Online Learning in CMDPs: Handling Stochastic and Adversarial Constraints

Francesco Emanuele Stradi; Jacopo Germano; Gianmarco Genalti; Matteo Castiglioni; Alberto Marchesi; Nicola Gatti

Online Learning in CMDPs: Handling Stochastic and Adversarial Constraints

Francesco Emanuele Stradi, Jacopo Germano, Gianmarco Genalti, Matteo Castiglioni, Alberto Marchesi, Nicola Gatti

Proceedings of the 41st International Conference on Machine Learning, PMLR 235:46692-46721, 2024.

Abstract

We study online learning in episodic constrained Markov decision processes (CMDPs), where the learner aims at collecting as much reward as possible over the episodes, while satisfying some long-term constraints during the learning process. Rewards and constraints can be selected either stochastically or adversarially, and the transition function is not known to the learner. While online learning in classical (unconstrained) MDPs has received considerable attention over the last years, the setting of CMDPs is still largely unexplored. This is surprising, since in real-world applications, such as, e.g., autonomous driving, automated bidding, and recommender systems, there are usually additional constraints and specifications that an agent has to obey during the learning process. In this paper, we provide the first best-of-both-worlds algorithm for CMDPs with long-term constraints, in the flavor of Balseiro et al. (2023). Our algorithm is capable of handling settings in which rewards and constraints are selected either stochastically or adversarially, without requiring any knowledge of the underling process. Moreover, our algorithm matches state-of-the-art regret and constraint violation bounds for settings in which constraints are selected stochastically, while it is the first to provide guarantees in the case in which they are chosen adversarially.

Cite this Paper

BibTeX


@InProceedings{pmlr-v235-stradi24a,
  title = 	 {Online Learning in {CMDP}s: Handling Stochastic and Adversarial Constraints},
  author =       {Stradi, Francesco Emanuele and Germano, Jacopo and Genalti, Gianmarco and Castiglioni, Matteo and Marchesi, Alberto and Gatti, Nicola},
  booktitle = 	 {Proceedings of the 41st International Conference on Machine Learning},
  pages = 	 {46692--46721},
  year = 	 {2024},
  editor = 	 {Salakhutdinov, Ruslan and Kolter, Zico and Heller, Katherine and Weller, Adrian and Oliver, Nuria and Scarlett, Jonathan and Berkenkamp, Felix},
  volume = 	 {235},
  series = 	 {Proceedings of Machine Learning Research},
  month = 	 {21--27 Jul},
  publisher =    {PMLR},
  pdf = 	 {https://raw.githubusercontent.com/mlresearch/v235/main/assets/stradi24a/stradi24a.pdf},
  url = 	 {https://proceedings.mlr.press/v235/stradi24a.html},
  abstract = 	 {We study online learning in episodic constrained Markov decision processes (CMDPs), where the learner aims at collecting as much reward as possible over the episodes, while satisfying some long-term constraints during the learning process. Rewards and constraints can be selected either stochastically or adversarially, and the transition function is not known to the learner. While online learning in classical (unconstrained) MDPs has received considerable attention over the last years, the setting of CMDPs is still largely unexplored. This is surprising, since in real-world applications, such as, e.g., autonomous driving, automated bidding, and recommender systems, there are usually additional constraints and specifications that an agent has to obey during the learning process. In this paper, we provide the first best-of-both-worlds algorithm for CMDPs with long-term constraints, in the flavor of Balseiro et al. (2023). Our algorithm is capable of handling settings in which rewards and constraints are selected either stochastically or adversarially, without requiring any knowledge of the underling process. Moreover, our algorithm matches state-of-the-art regret and constraint violation bounds for settings in which constraints are selected stochastically, while it is the first to provide guarantees in the case in which they are chosen adversarially.}
}

Endnote

%0 Conference Paper
%T Online Learning in CMDPs: Handling Stochastic and Adversarial Constraints
%A Francesco Emanuele Stradi
%A Jacopo Germano
%A Gianmarco Genalti
%A Matteo Castiglioni
%A Alberto Marchesi
%A Nicola Gatti
%B Proceedings of the 41st International Conference on Machine Learning
%C Proceedings of Machine Learning Research
%D 2024
%E Ruslan Salakhutdinov
%E Zico Kolter
%E Katherine Heller
%E Adrian Weller
%E Nuria Oliver
%E Jonathan Scarlett
%E Felix Berkenkamp	
%F pmlr-v235-stradi24a
%I PMLR
%P 46692--46721
%U https://proceedings.mlr.press/v235/stradi24a.html
%V 235
%X We study online learning in episodic constrained Markov decision processes (CMDPs), where the learner aims at collecting as much reward as possible over the episodes, while satisfying some long-term constraints during the learning process. Rewards and constraints can be selected either stochastically or adversarially, and the transition function is not known to the learner. While online learning in classical (unconstrained) MDPs has received considerable attention over the last years, the setting of CMDPs is still largely unexplored. This is surprising, since in real-world applications, such as, e.g., autonomous driving, automated bidding, and recommender systems, there are usually additional constraints and specifications that an agent has to obey during the learning process. In this paper, we provide the first best-of-both-worlds algorithm for CMDPs with long-term constraints, in the flavor of Balseiro et al. (2023). Our algorithm is capable of handling settings in which rewards and constraints are selected either stochastically or adversarially, without requiring any knowledge of the underling process. Moreover, our algorithm matches state-of-the-art regret and constraint violation bounds for settings in which constraints are selected stochastically, while it is the first to provide guarantees in the case in which they are chosen adversarially.

APA


Stradi, F.E., Germano, J., Genalti, G., Castiglioni, M., Marchesi, A. & Gatti, N.. (2024). Online Learning in CMDPs: Handling Stochastic and Adversarial Constraints. Proceedings of the 41st International Conference on Machine Learning, in Proceedings of Machine Learning Research 235:46692-46721 Available from https://proceedings.mlr.press/v235/stradi24a.html.

Online Learning in CMDPs: Handling Stochastic and Adversarial Constraints

Abstract

Cite this Paper

Related Material