Planning and Learning in Risk-Aware Restless Multi-Arm Bandits

Nima Akbarzadeh, Yossiri Adulyasak, Erick Delage
Proceedings of The 28th International Conference on Artificial Intelligence and Statistics, PMLR 258:991-999, 2025.

Abstract

In restless multi-arm bandits, a central agent is tasked with optimally distributing limited resources across several bandits (arms), with each arm being a Markov decision process. In this work, we generalize the traditional restless multi-arm bandit problem with a risk-neutral objective by incorporating risk-awareness. We establish indexability conditions for the case of a risk-aware objective and provide a solution based on Whittle index. In addition, we address the learning problem when the true transition probabilities are unknown by proposing a Thompson sampling approach and show that it achieves bounded regret that scales sublinearly with the number of episodes and quadratically with the number of arms. The efficacy of our method in reducing risk exposure in restless multi-arm bandits is illustrated through a set of numerical experiments in the contexts of machine replacement and patient scheduling applications under both planning and learning setups.

Cite this Paper


BibTeX
@InProceedings{pmlr-v258-akbarzadeh25a, title = {Planning and Learning in Risk-Aware Restless Multi-Arm Bandits}, author = {Akbarzadeh, Nima and Adulyasak, Yossiri and Delage, Erick}, booktitle = {Proceedings of The 28th International Conference on Artificial Intelligence and Statistics}, pages = {991--999}, year = {2025}, editor = {Li, Yingzhen and Mandt, Stephan and Agrawal, Shipra and Khan, Emtiyaz}, volume = {258}, series = {Proceedings of Machine Learning Research}, month = {03--05 May}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v258/main/assets/akbarzadeh25a/akbarzadeh25a.pdf}, url = {https://proceedings.mlr.press/v258/akbarzadeh25a.html}, abstract = {In restless multi-arm bandits, a central agent is tasked with optimally distributing limited resources across several bandits (arms), with each arm being a Markov decision process. In this work, we generalize the traditional restless multi-arm bandit problem with a risk-neutral objective by incorporating risk-awareness. We establish indexability conditions for the case of a risk-aware objective and provide a solution based on Whittle index. In addition, we address the learning problem when the true transition probabilities are unknown by proposing a Thompson sampling approach and show that it achieves bounded regret that scales sublinearly with the number of episodes and quadratically with the number of arms. The efficacy of our method in reducing risk exposure in restless multi-arm bandits is illustrated through a set of numerical experiments in the contexts of machine replacement and patient scheduling applications under both planning and learning setups.} }
Endnote
%0 Conference Paper %T Planning and Learning in Risk-Aware Restless Multi-Arm Bandits %A Nima Akbarzadeh %A Yossiri Adulyasak %A Erick Delage %B Proceedings of The 28th International Conference on Artificial Intelligence and Statistics %C Proceedings of Machine Learning Research %D 2025 %E Yingzhen Li %E Stephan Mandt %E Shipra Agrawal %E Emtiyaz Khan %F pmlr-v258-akbarzadeh25a %I PMLR %P 991--999 %U https://proceedings.mlr.press/v258/akbarzadeh25a.html %V 258 %X In restless multi-arm bandits, a central agent is tasked with optimally distributing limited resources across several bandits (arms), with each arm being a Markov decision process. In this work, we generalize the traditional restless multi-arm bandit problem with a risk-neutral objective by incorporating risk-awareness. We establish indexability conditions for the case of a risk-aware objective and provide a solution based on Whittle index. In addition, we address the learning problem when the true transition probabilities are unknown by proposing a Thompson sampling approach and show that it achieves bounded regret that scales sublinearly with the number of episodes and quadratically with the number of arms. The efficacy of our method in reducing risk exposure in restless multi-arm bandits is illustrated through a set of numerical experiments in the contexts of machine replacement and patient scheduling applications under both planning and learning setups.
APA
Akbarzadeh, N., Adulyasak, Y. & Delage, E.. (2025). Planning and Learning in Risk-Aware Restless Multi-Arm Bandits. Proceedings of The 28th International Conference on Artificial Intelligence and Statistics, in Proceedings of Machine Learning Research 258:991-999 Available from https://proceedings.mlr.press/v258/akbarzadeh25a.html.

Related Material