Learning Adversarial MDPs with Stochastic Hard Constraints

Francesco Emanuele Stradi, Matteo Castiglioni, Alberto Marchesi, Nicola Gatti
Proceedings of the 42nd International Conference on Machine Learning, PMLR 267:56920-56951, 2025.

Abstract

We study online learning in constrained Markov decision processes (CMDPs) with adversarial losses and stochastic hard constraints, under bandit feedback. We consider three scenarios. In the first one, we address general CMDPs, where we design an algorithm attaining sublinear regret and cumulative positive constraints violation. In the second scenario, under the mild assumption that a policy strictly satisfying the constraints exists and is known to the learner, we design an algorithm that achieves sublinear regret while ensuring that constraints are satisfied at every episode with high probability. In the last scenario, we only assume the existence of a strictly feasible policy, which is not known to the learner, and we design an algorithm attaining sublinear regret and constant cumulative positive constraints violation. Finally, we show that in the last two scenarios, a dependence on the Slater’s parameter is unavoidable. To the best of our knowledge, our work is the first to study CMDPs involving both adversarial losses and hard constraints. Thus, our algorithms can deal with general non-stationary environments subject to requirements much stricter than those manageable with existing ones, enabling their adoption in a much wider range of applications.

Cite this Paper


BibTeX
@InProceedings{pmlr-v267-stradi25a, title = {Learning Adversarial {MDP}s with Stochastic Hard Constraints}, author = {Stradi, Francesco Emanuele and Castiglioni, Matteo and Marchesi, Alberto and Gatti, Nicola}, booktitle = {Proceedings of the 42nd International Conference on Machine Learning}, pages = {56920--56951}, year = {2025}, editor = {Singh, Aarti and Fazel, Maryam and Hsu, Daniel and Lacoste-Julien, Simon and Berkenkamp, Felix and Maharaj, Tegan and Wagstaff, Kiri and Zhu, Jerry}, volume = {267}, series = {Proceedings of Machine Learning Research}, month = {13--19 Jul}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v267/main/assets/stradi25a/stradi25a.pdf}, url = {https://proceedings.mlr.press/v267/stradi25a.html}, abstract = {We study online learning in constrained Markov decision processes (CMDPs) with adversarial losses and stochastic hard constraints, under bandit feedback. We consider three scenarios. In the first one, we address general CMDPs, where we design an algorithm attaining sublinear regret and cumulative positive constraints violation. In the second scenario, under the mild assumption that a policy strictly satisfying the constraints exists and is known to the learner, we design an algorithm that achieves sublinear regret while ensuring that constraints are satisfied at every episode with high probability. In the last scenario, we only assume the existence of a strictly feasible policy, which is not known to the learner, and we design an algorithm attaining sublinear regret and constant cumulative positive constraints violation. Finally, we show that in the last two scenarios, a dependence on the Slater’s parameter is unavoidable. To the best of our knowledge, our work is the first to study CMDPs involving both adversarial losses and hard constraints. Thus, our algorithms can deal with general non-stationary environments subject to requirements much stricter than those manageable with existing ones, enabling their adoption in a much wider range of applications.} }
Endnote
%0 Conference Paper %T Learning Adversarial MDPs with Stochastic Hard Constraints %A Francesco Emanuele Stradi %A Matteo Castiglioni %A Alberto Marchesi %A Nicola Gatti %B Proceedings of the 42nd International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2025 %E Aarti Singh %E Maryam Fazel %E Daniel Hsu %E Simon Lacoste-Julien %E Felix Berkenkamp %E Tegan Maharaj %E Kiri Wagstaff %E Jerry Zhu %F pmlr-v267-stradi25a %I PMLR %P 56920--56951 %U https://proceedings.mlr.press/v267/stradi25a.html %V 267 %X We study online learning in constrained Markov decision processes (CMDPs) with adversarial losses and stochastic hard constraints, under bandit feedback. We consider three scenarios. In the first one, we address general CMDPs, where we design an algorithm attaining sublinear regret and cumulative positive constraints violation. In the second scenario, under the mild assumption that a policy strictly satisfying the constraints exists and is known to the learner, we design an algorithm that achieves sublinear regret while ensuring that constraints are satisfied at every episode with high probability. In the last scenario, we only assume the existence of a strictly feasible policy, which is not known to the learner, and we design an algorithm attaining sublinear regret and constant cumulative positive constraints violation. Finally, we show that in the last two scenarios, a dependence on the Slater’s parameter is unavoidable. To the best of our knowledge, our work is the first to study CMDPs involving both adversarial losses and hard constraints. Thus, our algorithms can deal with general non-stationary environments subject to requirements much stricter than those manageable with existing ones, enabling their adoption in a much wider range of applications.
APA
Stradi, F.E., Castiglioni, M., Marchesi, A. & Gatti, N.. (2025). Learning Adversarial MDPs with Stochastic Hard Constraints. Proceedings of the 42nd International Conference on Machine Learning, in Proceedings of Machine Learning Research 267:56920-56951 Available from https://proceedings.mlr.press/v267/stradi25a.html.

Related Material