Towards Achieving Sub-linear Regret and Hard Constraint Violation in Model-free RL

Arnob Ghosh; Xingyu Zhou; Ness Shroff

Towards Achieving Sub-linear Regret and Hard Constraint Violation in Model-free RL

Arnob Ghosh, Xingyu Zhou, Ness Shroff

Proceedings of The 27th International Conference on Artificial Intelligence and Statistics, PMLR 238:1054-1062, 2024.

Abstract

We study the constrained Markov decision processes (CMDPs), in which an agent aims to maximize the expected cumulative reward subject to a constraint on the expected total value of a utility function. Existing approaches have primarily focused on \emph{soft} constraint violation, which allows compensation across episodes, making it easier to satisfy the constraints. In contrast, we consider a stronger \emph{hard} constraint violation metric, where only positive constraint violations are accumulated. Our main result is the development of the \emph{first model-free}, \emph{simulator-free} algorithm that achieves a sub-linear regret and a sub-linear hard constraint violation simultaneously, even in \emph{large-scale} systems. In particular, we show that

$\tilde{\mathcal{O}}(\sqrt{d^3H^4K})$ regret and

$\tilde{\mathcal{O}}(\sqrt{d^3H^4K})$ hard constraint violation bounds can be achieved, where

$K$ is the number of episodes,

$d$ is the dimension of the feature mapping,

$H$ is the length of the episode. Our results are achieved via novel adaptations of the primal-dual LSVI-UCB algorithm, i.e., it searches for the dual variable that balances between regret and constraint violation within every episode, rather than updating it at the end of each episode. This turns out to be crucial for our theoretical guarantees when dealing with hard constraint violations.

Cite this Paper

BibTeX

@InProceedings{pmlr-v238-ghosh24a,
  title = 	 {Towards Achieving Sub-linear Regret and Hard Constraint Violation in Model-free {RL}},
  author =       {Ghosh, Arnob and Zhou, Xingyu and Shroff, Ness},
  booktitle = 	 {Proceedings of The 27th International Conference on Artificial Intelligence and Statistics},
  pages = 	 {1054--1062},
  year = 	 {2024},
  editor = 	 {Dasgupta, Sanjoy and Mandt, Stephan and Li, Yingzhen},
  volume = 	 {238},
  series = 	 {Proceedings of Machine Learning Research},
  month = 	 {02--04 May},
  publisher =    {PMLR},
  pdf = 	 {https://proceedings.mlr.press/v238/ghosh24a/ghosh24a.pdf},
  url = 	 {https://proceedings.mlr.press/v238/ghosh24a.html},
  abstract = 	 {We study the constrained Markov decision processes (CMDPs), in which an agent aims to maximize the expected cumulative reward subject to a constraint on the expected total value of a utility function. Existing approaches have primarily focused on \emph{soft} constraint violation, which allows compensation across episodes, making it easier to satisfy the constraints. In contrast, we consider a stronger \emph{hard} constraint violation metric, where only positive constraint violations are accumulated. Our main result is the development of the \emph{first model-free}, \emph{simulator-free} algorithm that achieves a sub-linear regret and a sub-linear hard constraint violation simultaneously, even in \emph{large-scale} systems. In particular, we show that $\tilde{\mathcal{O}}(\sqrt{d^3H^4K})$ regret and $\tilde{\mathcal{O}}(\sqrt{d^3H^4K})$ hard constraint violation bounds can be achieved, where $K$ is the number of episodes, $d$ is the dimension of the feature mapping, $H$ is the length of the episode. Our results are achieved via novel adaptations of the primal-dual LSVI-UCB algorithm, i.e., it searches for the dual variable that balances between regret and constraint violation within every episode, rather than updating it at the end of each episode. This turns out to be crucial for our theoretical guarantees when dealing with hard constraint violations.}
}

Endnote

%0 Conference Paper
%T Towards Achieving Sub-linear Regret and Hard Constraint Violation in Model-free RL
%A Arnob Ghosh
%A Xingyu Zhou
%A Ness Shroff
%B Proceedings of The 27th International Conference on Artificial Intelligence and Statistics
%C Proceedings of Machine Learning Research
%D 2024
%E Sanjoy Dasgupta
%E Stephan Mandt
%E Yingzhen Li	
%F pmlr-v238-ghosh24a
%I PMLR
%P 1054--1062
%U https://proceedings.mlr.press/v238/ghosh24a.html
%V 238
%X We study the constrained Markov decision processes (CMDPs), in which an agent aims to maximize the expected cumulative reward subject to a constraint on the expected total value of a utility function. Existing approaches have primarily focused on \emph{soft} constraint violation, which allows compensation across episodes, making it easier to satisfy the constraints. In contrast, we consider a stronger \emph{hard} constraint violation metric, where only positive constraint violations are accumulated. Our main result is the development of the \emph{first model-free}, \emph{simulator-free} algorithm that achieves a sub-linear regret and a sub-linear hard constraint violation simultaneously, even in \emph{large-scale} systems. In particular, we show that $\tilde{\mathcal{O}}(\sqrt{d^3H^4K})$ regret and $\tilde{\mathcal{O}}(\sqrt{d^3H^4K})$ hard constraint violation bounds can be achieved, where $K$ is the number of episodes, $d$ is the dimension of the feature mapping, $H$ is the length of the episode. Our results are achieved via novel adaptations of the primal-dual LSVI-UCB algorithm, i.e., it searches for the dual variable that balances between regret and constraint violation within every episode, rather than updating it at the end of each episode. This turns out to be crucial for our theoretical guarantees when dealing with hard constraint violations.

APA

Ghosh, A., Zhou, X. & Shroff, N.. (2024). Towards Achieving Sub-linear Regret and Hard Constraint Violation in Model-free RL. Proceedings of The 27th International Conference on Artificial Intelligence and Statistics, in Proceedings of Machine Learning Research 238:1054-1062 Available from https://proceedings.mlr.press/v238/ghosh24a.html.

Towards Achieving Sub-linear Regret and Hard Constraint Violation in Model-free RL

Abstract

Cite this Paper

Related Material