Best Possible Q-Learning

Jiechuan Jiang, Zongqing Lu
Proceedings of the Forty-first Conference on Uncertainty in Artificial Intelligence, PMLR 286:1895-1908, 2025.

Abstract

Fully decentralized learning, where the global information, *i.e.*, the actions of other agents, is inaccessible, is a fundamental challenge in cooperative multi-agent reinforcement learning. However, the convergence and optimality of most decentralized algorithms are not theoretically guaranteed, since the transition probabilities are non-stationary as all agents are updating policies simultaneously. To tackle this challenge, we propose \textit{best possible operator}, a novel decentralized operator, and prove that the policies of cooperative agents will converge to the optimal joint policy if each agent independently updates its individual state-action value by the operator when there is only one optimal joint policy. Further, to make the update more efficient and practical, we simplify the operator and prove that the convergence and optimality still hold with the simplified one. By instantiating the simplified operator, the derived fully decentralized algorithm, *best possible Q-learning* (BQL), does not suffer from non-stationarity. Empirically, we show that BQL achieves remarkable improvement over baselines in a variety of cooperative multi-agent tasks.

Cite this Paper


BibTeX
@InProceedings{pmlr-v286-jiang25c, title = {Best Possible Q-Learning}, author = {Jiang, Jiechuan and Lu, Zongqing}, booktitle = {Proceedings of the Forty-first Conference on Uncertainty in Artificial Intelligence}, pages = {1895--1908}, year = {2025}, editor = {Chiappa, Silvia and Magliacane, Sara}, volume = {286}, series = {Proceedings of Machine Learning Research}, month = {21--25 Jul}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v286/main/assets/jiang25c/jiang25c.pdf}, url = {https://proceedings.mlr.press/v286/jiang25c.html}, abstract = {Fully decentralized learning, where the global information, *i.e.*, the actions of other agents, is inaccessible, is a fundamental challenge in cooperative multi-agent reinforcement learning. However, the convergence and optimality of most decentralized algorithms are not theoretically guaranteed, since the transition probabilities are non-stationary as all agents are updating policies simultaneously. To tackle this challenge, we propose \textit{best possible operator}, a novel decentralized operator, and prove that the policies of cooperative agents will converge to the optimal joint policy if each agent independently updates its individual state-action value by the operator when there is only one optimal joint policy. Further, to make the update more efficient and practical, we simplify the operator and prove that the convergence and optimality still hold with the simplified one. By instantiating the simplified operator, the derived fully decentralized algorithm, *best possible Q-learning* (BQL), does not suffer from non-stationarity. Empirically, we show that BQL achieves remarkable improvement over baselines in a variety of cooperative multi-agent tasks.} }
Endnote
%0 Conference Paper %T Best Possible Q-Learning %A Jiechuan Jiang %A Zongqing Lu %B Proceedings of the Forty-first Conference on Uncertainty in Artificial Intelligence %C Proceedings of Machine Learning Research %D 2025 %E Silvia Chiappa %E Sara Magliacane %F pmlr-v286-jiang25c %I PMLR %P 1895--1908 %U https://proceedings.mlr.press/v286/jiang25c.html %V 286 %X Fully decentralized learning, where the global information, *i.e.*, the actions of other agents, is inaccessible, is a fundamental challenge in cooperative multi-agent reinforcement learning. However, the convergence and optimality of most decentralized algorithms are not theoretically guaranteed, since the transition probabilities are non-stationary as all agents are updating policies simultaneously. To tackle this challenge, we propose \textit{best possible operator}, a novel decentralized operator, and prove that the policies of cooperative agents will converge to the optimal joint policy if each agent independently updates its individual state-action value by the operator when there is only one optimal joint policy. Further, to make the update more efficient and practical, we simplify the operator and prove that the convergence and optimality still hold with the simplified one. By instantiating the simplified operator, the derived fully decentralized algorithm, *best possible Q-learning* (BQL), does not suffer from non-stationarity. Empirically, we show that BQL achieves remarkable improvement over baselines in a variety of cooperative multi-agent tasks.
APA
Jiang, J. & Lu, Z.. (2025). Best Possible Q-Learning. Proceedings of the Forty-first Conference on Uncertainty in Artificial Intelligence, in Proceedings of Machine Learning Research 286:1895-1908 Available from https://proceedings.mlr.press/v286/jiang25c.html.

Related Material