Rectified Pessimistic-Optimistic Learning for Stochastic Continuum-armed Bandit with Constraints

Hengquan Guo, Zhu Qi, Xin Liu
Proceedings of The 5th Annual Learning for Dynamics and Control Conference, PMLR 211:1333-1344, 2023.

Abstract

This paper studies the problem of stochastic continuum-armed bandit with constraints (SCBwC), where we optimize a black-box reward function $f(x)$ subject to a black-box constraint function $g(x)\leq 0$ over a continuous space $\mathcal X$. We model reward and constraint functions via Gaussian processes (GPs) and propose a Rectified Pessimistic-Optimistic Learning (RPOL) framework, a penalty-based method incorporating optimistic and pessimistic GP bandit learning for reward and constraint functions, respectively. We consider the metric of cumulative constraint violation $\sum_{t=1}^T(g(x_t))^{+},$ which is strictly stronger than the traditional long-term constraint violation $\sum_{t=1}^Tg(x_t).$ The rectified design for the penalty update and the pessimistic learning for the constraint function in RPOL guarantee the cumulative constraint violation is minimal. RPOL can achieve sublinear regret and cumulative constraint violation for SCBwC and its variants (e.g., under delayed feedback). These theoretical results match their unconstrained counterparts. Our experiments justify RPOL outperforms several existing baseline algorithms.

Cite this Paper


BibTeX
@InProceedings{pmlr-v211-guo23a, title = {Rectified Pessimistic-Optimistic Learning for Stochastic Continuum-armed Bandit with Constraints}, author = {Guo, Hengquan and Qi, Zhu and Liu, Xin}, booktitle = {Proceedings of The 5th Annual Learning for Dynamics and Control Conference}, pages = {1333--1344}, year = {2023}, editor = {Matni, Nikolai and Morari, Manfred and Pappas, George J.}, volume = {211}, series = {Proceedings of Machine Learning Research}, month = {15--16 Jun}, publisher = {PMLR}, pdf = {https://proceedings.mlr.press/v211/guo23a/guo23a.pdf}, url = {https://proceedings.mlr.press/v211/guo23a.html}, abstract = {This paper studies the problem of stochastic continuum-armed bandit with constraints (SCBwC), where we optimize a black-box reward function $f(x)$ subject to a black-box constraint function $g(x)\leq 0$ over a continuous space $\mathcal X$. We model reward and constraint functions via Gaussian processes (GPs) and propose a Rectified Pessimistic-Optimistic Learning (RPOL) framework, a penalty-based method incorporating optimistic and pessimistic GP bandit learning for reward and constraint functions, respectively. We consider the metric of cumulative constraint violation $\sum_{t=1}^T(g(x_t))^{+},$ which is strictly stronger than the traditional long-term constraint violation $\sum_{t=1}^Tg(x_t).$ The rectified design for the penalty update and the pessimistic learning for the constraint function in RPOL guarantee the cumulative constraint violation is minimal. RPOL can achieve sublinear regret and cumulative constraint violation for SCBwC and its variants (e.g., under delayed feedback). These theoretical results match their unconstrained counterparts. Our experiments justify RPOL outperforms several existing baseline algorithms.} }
Endnote
%0 Conference Paper %T Rectified Pessimistic-Optimistic Learning for Stochastic Continuum-armed Bandit with Constraints %A Hengquan Guo %A Zhu Qi %A Xin Liu %B Proceedings of The 5th Annual Learning for Dynamics and Control Conference %C Proceedings of Machine Learning Research %D 2023 %E Nikolai Matni %E Manfred Morari %E George J. Pappas %F pmlr-v211-guo23a %I PMLR %P 1333--1344 %U https://proceedings.mlr.press/v211/guo23a.html %V 211 %X This paper studies the problem of stochastic continuum-armed bandit with constraints (SCBwC), where we optimize a black-box reward function $f(x)$ subject to a black-box constraint function $g(x)\leq 0$ over a continuous space $\mathcal X$. We model reward and constraint functions via Gaussian processes (GPs) and propose a Rectified Pessimistic-Optimistic Learning (RPOL) framework, a penalty-based method incorporating optimistic and pessimistic GP bandit learning for reward and constraint functions, respectively. We consider the metric of cumulative constraint violation $\sum_{t=1}^T(g(x_t))^{+},$ which is strictly stronger than the traditional long-term constraint violation $\sum_{t=1}^Tg(x_t).$ The rectified design for the penalty update and the pessimistic learning for the constraint function in RPOL guarantee the cumulative constraint violation is minimal. RPOL can achieve sublinear regret and cumulative constraint violation for SCBwC and its variants (e.g., under delayed feedback). These theoretical results match their unconstrained counterparts. Our experiments justify RPOL outperforms several existing baseline algorithms.
APA
Guo, H., Qi, Z. & Liu, X.. (2023). Rectified Pessimistic-Optimistic Learning for Stochastic Continuum-armed Bandit with Constraints. Proceedings of The 5th Annual Learning for Dynamics and Control Conference, in Proceedings of Machine Learning Research 211:1333-1344 Available from https://proceedings.mlr.press/v211/guo23a.html.

Related Material