Multi-Agent Reinforcement Learning with Reward Delays

Yuyang Zhang, Runyu Zhang, Yuantao Gu, Na Li
Proceedings of The 5th Annual Learning for Dynamics and Control Conference, PMLR 211:692-704, 2023.

Abstract

This paper considers multi-agent reinforcement learning (MARL) where the rewards are received after delays and the delay time varies across agents and across time steps. Based on the V-learning framework, this paper proposes MARL algorithms that efficiently deal with reward delays. When the delays are finite, our algorithm reaches a coarse correlated equilibrium (CCE) with rate ˜O(H3STKK+H3SAK) where K is the number of episodes, H is the planning horizon, S is the size of the state space, A is the size of the largest action space, and TK is the measure of total delay formally defined in the paper. Moreover, our algorithm is extended to cases with infinite delays through a reward skipping scheme. It achieves convergence rate similar to the finite delay case.

Cite this Paper


BibTeX
@InProceedings{pmlr-v211-zhang23c, title = {Multi-Agent Reinforcement Learning with Reward Delays}, author = {Zhang, Yuyang and Zhang, Runyu and Gu, Yuantao and Li, Na}, booktitle = {Proceedings of The 5th Annual Learning for Dynamics and Control Conference}, pages = {692--704}, year = {2023}, editor = {Matni, Nikolai and Morari, Manfred and Pappas, George J.}, volume = {211}, series = {Proceedings of Machine Learning Research}, month = {15--16 Jun}, publisher = {PMLR}, pdf = {https://proceedings.mlr.press/v211/zhang23c/zhang23c.pdf}, url = {https://proceedings.mlr.press/v211/zhang23c.html}, abstract = {This paper considers multi-agent reinforcement learning (MARL) where the rewards are received after delays and the delay time varies across agents and across time steps. Based on the V-learning framework, this paper proposes MARL algorithms that efficiently deal with reward delays. When the delays are finite, our algorithm reaches a coarse correlated equilibrium (CCE) with rate $\tilde{\mathcal{O}}(\frac{H^3\sqrt{S\mathcal{T}_K}}{K}+\frac{H^3\sqrt{SA}}{\sqrt{K}})$ where $K$ is the number of episodes, $H$ is the planning horizon, $S$ is the size of the state space, $A$ is the size of the largest action space, and $\mathcal{T}_K$ is the measure of total delay formally defined in the paper. Moreover, our algorithm is extended to cases with infinite delays through a reward skipping scheme. It achieves convergence rate similar to the finite delay case. } }
Endnote
%0 Conference Paper %T Multi-Agent Reinforcement Learning with Reward Delays %A Yuyang Zhang %A Runyu Zhang %A Yuantao Gu %A Na Li %B Proceedings of The 5th Annual Learning for Dynamics and Control Conference %C Proceedings of Machine Learning Research %D 2023 %E Nikolai Matni %E Manfred Morari %E George J. Pappas %F pmlr-v211-zhang23c %I PMLR %P 692--704 %U https://proceedings.mlr.press/v211/zhang23c.html %V 211 %X This paper considers multi-agent reinforcement learning (MARL) where the rewards are received after delays and the delay time varies across agents and across time steps. Based on the V-learning framework, this paper proposes MARL algorithms that efficiently deal with reward delays. When the delays are finite, our algorithm reaches a coarse correlated equilibrium (CCE) with rate $\tilde{\mathcal{O}}(\frac{H^3\sqrt{S\mathcal{T}_K}}{K}+\frac{H^3\sqrt{SA}}{\sqrt{K}})$ where $K$ is the number of episodes, $H$ is the planning horizon, $S$ is the size of the state space, $A$ is the size of the largest action space, and $\mathcal{T}_K$ is the measure of total delay formally defined in the paper. Moreover, our algorithm is extended to cases with infinite delays through a reward skipping scheme. It achieves convergence rate similar to the finite delay case.
APA
Zhang, Y., Zhang, R., Gu, Y. & Li, N.. (2023). Multi-Agent Reinforcement Learning with Reward Delays. Proceedings of The 5th Annual Learning for Dynamics and Control Conference, in Proceedings of Machine Learning Research 211:692-704 Available from https://proceedings.mlr.press/v211/zhang23c.html.

Related Material