[edit]
Safe Cooperative Multi-Agent Reinforcement Learning with Function Approximation
Proceedings of the 7th Annual Learning for Dynamics \& Control Conference, PMLR 283:1353-1364, 2025.
Abstract
Cooperative multi-agent reinforcement learning (MARL) has shown significant promise in dynamic control environments, where effective communication and tailored exploration strategies facilitate collaboration. However, ensuring safe exploration remains challenging, as even a single unsafe action from any agent can lead to severe consequences. To mitigate this risk, we introduce Scoop-LSVI, a UCB-based cooperative parallel RL framework that achieves low cumulative regret with minimal communication demands while adhering to safety constraints. This framework enables multiple agents to concurrently solve isolated Markov Decision Processes (MDPs) and share information to enhance learning efficiency. Scoop-LSVI attains a regret of $\Tilde{O}(\kappa d^{3/2} H^2 \sqrt{MK})$, where $d$ is the feature dimension, $H$ is the horizon length, $M$ is the number of agents, $K$ is the number of episodes for each agent, and $\kappa$ represents safety constraints. This result aligns with state-of-the-art findings for unsafe cooperative MARL and also matches the regret bounds of UCB-based single-agent RL algorithms ($M = 1$), highlighting the potential of Scoop-LSVI to support safe and efficient learning in cooperative MARL applications.