Policy Optimization for CMDPs with Bandit Feedback: Learning Stochastic and Adversarial Constraints

Francesco Emanuele Stradi, Anna Lunghi, Matteo Castiglioni, Alberto Marchesi, Nicola Gatti
Proceedings of the 42nd International Conference on Machine Learning, PMLR 267:56952-56989, 2025.

Abstract

We study online learning in constrained Markov decision processes (CMDPs) in which rewards and constraints may be either stochastic or adversarial. In such settings, stradi et al. (2024) proposed the first best-of-both-worlds algorithm able to seamlessly handle stochastic and adversarial constraints, achieving optimal regret and constraint violation bounds in both cases. This algorithm suffers from two major drawbacks. First, it only works under full feedback, which severely limits its applicability in practice. Moreover, it relies on optimizing over the space of occupancy measures, which requires solving convex optimization problems, an highly inefficient task. In this paper, we provide the first best-of-both-worlds algorithm for CMDPs with bandit feedback. Specifically, when the constraints are stochastic, the algorithm achieves $\widetilde{\mathcal{O}}(\sqrt{T})$ regret and constraint violation, while, when they are adversarial, it attains $\widetilde{\mathcal{O}}(\sqrt{T})$ constraint violation and a tight fraction of the optimal reward. Moreover, our algorithm is based on a policy optimization approach, which is much more efficient than occupancy-measure-based methods.

Cite this Paper


BibTeX
@InProceedings{pmlr-v267-stradi25b, title = {Policy Optimization for {CMDP}s with Bandit Feedback: Learning Stochastic and Adversarial Constraints}, author = {Stradi, Francesco Emanuele and Lunghi, Anna and Castiglioni, Matteo and Marchesi, Alberto and Gatti, Nicola}, booktitle = {Proceedings of the 42nd International Conference on Machine Learning}, pages = {56952--56989}, year = {2025}, editor = {Singh, Aarti and Fazel, Maryam and Hsu, Daniel and Lacoste-Julien, Simon and Berkenkamp, Felix and Maharaj, Tegan and Wagstaff, Kiri and Zhu, Jerry}, volume = {267}, series = {Proceedings of Machine Learning Research}, month = {13--19 Jul}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v267/main/assets/stradi25b/stradi25b.pdf}, url = {https://proceedings.mlr.press/v267/stradi25b.html}, abstract = {We study online learning in constrained Markov decision processes (CMDPs) in which rewards and constraints may be either stochastic or adversarial. In such settings, stradi et al. (2024) proposed the first best-of-both-worlds algorithm able to seamlessly handle stochastic and adversarial constraints, achieving optimal regret and constraint violation bounds in both cases. This algorithm suffers from two major drawbacks. First, it only works under full feedback, which severely limits its applicability in practice. Moreover, it relies on optimizing over the space of occupancy measures, which requires solving convex optimization problems, an highly inefficient task. In this paper, we provide the first best-of-both-worlds algorithm for CMDPs with bandit feedback. Specifically, when the constraints are stochastic, the algorithm achieves $\widetilde{\mathcal{O}}(\sqrt{T})$ regret and constraint violation, while, when they are adversarial, it attains $\widetilde{\mathcal{O}}(\sqrt{T})$ constraint violation and a tight fraction of the optimal reward. Moreover, our algorithm is based on a policy optimization approach, which is much more efficient than occupancy-measure-based methods.} }
Endnote
%0 Conference Paper %T Policy Optimization for CMDPs with Bandit Feedback: Learning Stochastic and Adversarial Constraints %A Francesco Emanuele Stradi %A Anna Lunghi %A Matteo Castiglioni %A Alberto Marchesi %A Nicola Gatti %B Proceedings of the 42nd International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2025 %E Aarti Singh %E Maryam Fazel %E Daniel Hsu %E Simon Lacoste-Julien %E Felix Berkenkamp %E Tegan Maharaj %E Kiri Wagstaff %E Jerry Zhu %F pmlr-v267-stradi25b %I PMLR %P 56952--56989 %U https://proceedings.mlr.press/v267/stradi25b.html %V 267 %X We study online learning in constrained Markov decision processes (CMDPs) in which rewards and constraints may be either stochastic or adversarial. In such settings, stradi et al. (2024) proposed the first best-of-both-worlds algorithm able to seamlessly handle stochastic and adversarial constraints, achieving optimal regret and constraint violation bounds in both cases. This algorithm suffers from two major drawbacks. First, it only works under full feedback, which severely limits its applicability in practice. Moreover, it relies on optimizing over the space of occupancy measures, which requires solving convex optimization problems, an highly inefficient task. In this paper, we provide the first best-of-both-worlds algorithm for CMDPs with bandit feedback. Specifically, when the constraints are stochastic, the algorithm achieves $\widetilde{\mathcal{O}}(\sqrt{T})$ regret and constraint violation, while, when they are adversarial, it attains $\widetilde{\mathcal{O}}(\sqrt{T})$ constraint violation and a tight fraction of the optimal reward. Moreover, our algorithm is based on a policy optimization approach, which is much more efficient than occupancy-measure-based methods.
APA
Stradi, F.E., Lunghi, A., Castiglioni, M., Marchesi, A. & Gatti, N.. (2025). Policy Optimization for CMDPs with Bandit Feedback: Learning Stochastic and Adversarial Constraints. Proceedings of the 42nd International Conference on Machine Learning, in Proceedings of Machine Learning Research 267:56952-56989 Available from https://proceedings.mlr.press/v267/stradi25b.html.

Related Material