Scalable Safe Policy Improvement for Factored Multi-Agent MDPs

Federico Bianchi, Edoardo Zorzi, Alberto Castellini, Thiago D. Simão, Matthijs T. J. Spaan, Alessandro Farinelli
Proceedings of the 41st International Conference on Machine Learning, PMLR 235:3952-3973, 2024.

Abstract

In this work, we focus on safe policy improvement in multi-agent domains where current state-of-the-art methods cannot be effectively applied because of large state and action spaces. We consider recent results using Monte Carlo Tree Search for Safe Policy Improvement with Baseline Bootstrapping and propose a novel algorithm that scales this approach to multi-agent domains, exploiting the factorization of the transition model and value function. Given a centralized behavior policy and a dataset of trajectories, our algorithm generates an improved policy by selecting joint actions using a novel extension of Max-Plus (or Variable Elimination) that constrains local actions to guarantee safety criteria. An empirical evaluation on multi-agent SysAdmin and multi-UAV Delivery shows that the approach scales to very large domains where state-of-the-art methods cannot work.

Cite this Paper


BibTeX
@InProceedings{pmlr-v235-bianchi24b, title = {Scalable Safe Policy Improvement for Factored Multi-Agent {MDP}s}, author = {Bianchi, Federico and Zorzi, Edoardo and Castellini, Alberto and Sim\~{a}o, Thiago D. and Spaan, Matthijs T. J. and Farinelli, Alessandro}, booktitle = {Proceedings of the 41st International Conference on Machine Learning}, pages = {3952--3973}, year = {2024}, editor = {Salakhutdinov, Ruslan and Kolter, Zico and Heller, Katherine and Weller, Adrian and Oliver, Nuria and Scarlett, Jonathan and Berkenkamp, Felix}, volume = {235}, series = {Proceedings of Machine Learning Research}, month = {21--27 Jul}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v235/main/assets/bianchi24b/bianchi24b.pdf}, url = {https://proceedings.mlr.press/v235/bianchi24b.html}, abstract = {In this work, we focus on safe policy improvement in multi-agent domains where current state-of-the-art methods cannot be effectively applied because of large state and action spaces. We consider recent results using Monte Carlo Tree Search for Safe Policy Improvement with Baseline Bootstrapping and propose a novel algorithm that scales this approach to multi-agent domains, exploiting the factorization of the transition model and value function. Given a centralized behavior policy and a dataset of trajectories, our algorithm generates an improved policy by selecting joint actions using a novel extension of Max-Plus (or Variable Elimination) that constrains local actions to guarantee safety criteria. An empirical evaluation on multi-agent SysAdmin and multi-UAV Delivery shows that the approach scales to very large domains where state-of-the-art methods cannot work.} }
Endnote
%0 Conference Paper %T Scalable Safe Policy Improvement for Factored Multi-Agent MDPs %A Federico Bianchi %A Edoardo Zorzi %A Alberto Castellini %A Thiago D. Simão %A Matthijs T. J. Spaan %A Alessandro Farinelli %B Proceedings of the 41st International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2024 %E Ruslan Salakhutdinov %E Zico Kolter %E Katherine Heller %E Adrian Weller %E Nuria Oliver %E Jonathan Scarlett %E Felix Berkenkamp %F pmlr-v235-bianchi24b %I PMLR %P 3952--3973 %U https://proceedings.mlr.press/v235/bianchi24b.html %V 235 %X In this work, we focus on safe policy improvement in multi-agent domains where current state-of-the-art methods cannot be effectively applied because of large state and action spaces. We consider recent results using Monte Carlo Tree Search for Safe Policy Improvement with Baseline Bootstrapping and propose a novel algorithm that scales this approach to multi-agent domains, exploiting the factorization of the transition model and value function. Given a centralized behavior policy and a dataset of trajectories, our algorithm generates an improved policy by selecting joint actions using a novel extension of Max-Plus (or Variable Elimination) that constrains local actions to guarantee safety criteria. An empirical evaluation on multi-agent SysAdmin and multi-UAV Delivery shows that the approach scales to very large domains where state-of-the-art methods cannot work.
APA
Bianchi, F., Zorzi, E., Castellini, A., Simão, T.D., Spaan, M.T.J. & Farinelli, A.. (2024). Scalable Safe Policy Improvement for Factored Multi-Agent MDPs. Proceedings of the 41st International Conference on Machine Learning, in Proceedings of Machine Learning Research 235:3952-3973 Available from https://proceedings.mlr.press/v235/bianchi24b.html.

Related Material