Safe Cooperative Multi-Agent Reinforcement Learning with Function Approximation

Hao-Lun Hsu, Miroslav Pajic
Proceedings of the 7th Annual Learning for Dynamics \& Control Conference, PMLR 283:1353-1364, 2025.

Abstract

Cooperative multi-agent reinforcement learning (MARL) has shown significant promise in dynamic control environments, where effective communication and tailored exploration strategies facilitate collaboration. However, ensuring safe exploration remains challenging, as even a single unsafe action from any agent can lead to severe consequences. To mitigate this risk, we introduce Scoop-LSVI, a UCB-based cooperative parallel RL framework that achieves low cumulative regret with minimal communication demands while adhering to safety constraints. This framework enables multiple agents to concurrently solve isolated Markov Decision Processes (MDPs) and share information to enhance learning efficiency. Scoop-LSVI attains a regret of $\Tilde{O}(\kappa d^{3/2} H^2 \sqrt{MK})$, where $d$ is the feature dimension, $H$ is the horizon length, $M$ is the number of agents, $K$ is the number of episodes for each agent, and $\kappa$ represents safety constraints. This result aligns with state-of-the-art findings for unsafe cooperative MARL and also matches the regret bounds of UCB-based single-agent RL algorithms ($M = 1$), highlighting the potential of Scoop-LSVI to support safe and efficient learning in cooperative MARL applications.

Cite this Paper


BibTeX
@InProceedings{pmlr-v283-hsu25a, title = {Safe Cooperative Multi-Agent Reinforcement Learning with Function Approximation}, author = {Hsu, Hao-Lun and Pajic, Miroslav}, booktitle = {Proceedings of the 7th Annual Learning for Dynamics \& Control Conference}, pages = {1353--1364}, year = {2025}, editor = {Ozay, Necmiye and Balzano, Laura and Panagou, Dimitra and Abate, Alessandro}, volume = {283}, series = {Proceedings of Machine Learning Research}, month = {04--06 Jun}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v283/main/assets/hsu25a/hsu25a.pdf}, url = {https://proceedings.mlr.press/v283/hsu25a.html}, abstract = {Cooperative multi-agent reinforcement learning (MARL) has shown significant promise in dynamic control environments, where effective communication and tailored exploration strategies facilitate collaboration. However, ensuring safe exploration remains challenging, as even a single unsafe action from any agent can lead to severe consequences. To mitigate this risk, we introduce Scoop-LSVI, a UCB-based cooperative parallel RL framework that achieves low cumulative regret with minimal communication demands while adhering to safety constraints. This framework enables multiple agents to concurrently solve isolated Markov Decision Processes (MDPs) and share information to enhance learning efficiency. Scoop-LSVI attains a regret of $\Tilde{O}(\kappa d^{3/2} H^2 \sqrt{MK})$, where $d$ is the feature dimension, $H$ is the horizon length, $M$ is the number of agents, $K$ is the number of episodes for each agent, and $\kappa$ represents safety constraints. This result aligns with state-of-the-art findings for unsafe cooperative MARL and also matches the regret bounds of UCB-based single-agent RL algorithms ($M = 1$), highlighting the potential of Scoop-LSVI to support safe and efficient learning in cooperative MARL applications.} }
Endnote
%0 Conference Paper %T Safe Cooperative Multi-Agent Reinforcement Learning with Function Approximation %A Hao-Lun Hsu %A Miroslav Pajic %B Proceedings of the 7th Annual Learning for Dynamics \& Control Conference %C Proceedings of Machine Learning Research %D 2025 %E Necmiye Ozay %E Laura Balzano %E Dimitra Panagou %E Alessandro Abate %F pmlr-v283-hsu25a %I PMLR %P 1353--1364 %U https://proceedings.mlr.press/v283/hsu25a.html %V 283 %X Cooperative multi-agent reinforcement learning (MARL) has shown significant promise in dynamic control environments, where effective communication and tailored exploration strategies facilitate collaboration. However, ensuring safe exploration remains challenging, as even a single unsafe action from any agent can lead to severe consequences. To mitigate this risk, we introduce Scoop-LSVI, a UCB-based cooperative parallel RL framework that achieves low cumulative regret with minimal communication demands while adhering to safety constraints. This framework enables multiple agents to concurrently solve isolated Markov Decision Processes (MDPs) and share information to enhance learning efficiency. Scoop-LSVI attains a regret of $\Tilde{O}(\kappa d^{3/2} H^2 \sqrt{MK})$, where $d$ is the feature dimension, $H$ is the horizon length, $M$ is the number of agents, $K$ is the number of episodes for each agent, and $\kappa$ represents safety constraints. This result aligns with state-of-the-art findings for unsafe cooperative MARL and also matches the regret bounds of UCB-based single-agent RL algorithms ($M = 1$), highlighting the potential of Scoop-LSVI to support safe and efficient learning in cooperative MARL applications.
APA
Hsu, H. & Pajic, M.. (2025). Safe Cooperative Multi-Agent Reinforcement Learning with Function Approximation. Proceedings of the 7th Annual Learning for Dynamics \& Control Conference, in Proceedings of Machine Learning Research 283:1353-1364 Available from https://proceedings.mlr.press/v283/hsu25a.html.

Related Material