Near-Optimal Reinforcement Learning with Self-Play under Adaptivity Constraints

Dan Qiao; Yu-Xiang Wang

Near-Optimal Reinforcement Learning with Self-Play under Adaptivity Constraints

Dan Qiao, Yu-Xiang Wang

Proceedings of the 41st International Conference on Machine Learning, PMLR 235:41430-41455, 2024.

Abstract

We study the problem of multi-agent reinforcement learning (MARL) with adaptivity constraints — a new problem motivated by real-world applications where deployments of new policies are costly and the number of policy updates must be minimized. For two-player zero-sum Markov Games, we design a (policy) elimination based algorithm that achieves a regret of

$\widetilde{O}(\sqrt{H^3 S^2 ABK})$ , while the batch complexity is only

$O(H+\log\log K)$ . In the above,

$S$ denotes the number of states,

$A,B$ are the number of actions for the two players respectively,

$H$ is the horizon and

$K$ is the number of episodes. Furthermore, we prove a batch complexity lower bound

$\Omega(\frac{H}{\log_{A}K}+\log\log K)$ for all algorithms with

$\widetilde{O}(\sqrt{K})$ regret bound, which matches our upper bound up to logarithmic factors. As a byproduct, our techniques naturally extend to learning bandit games and reward-free MARL within near optimal batch complexity. To the best of our knowledge, these are the first line of results towards understanding MARL with low adaptivity.

Cite this Paper

BibTeX


@InProceedings{pmlr-v235-qiao24b,
  title = 	 {Near-Optimal Reinforcement Learning with Self-Play under Adaptivity Constraints},
  author =       {Qiao, Dan and Wang, Yu-Xiang},
  booktitle = 	 {Proceedings of the 41st International Conference on Machine Learning},
  pages = 	 {41430--41455},
  year = 	 {2024},
  editor = 	 {Salakhutdinov, Ruslan and Kolter, Zico and Heller, Katherine and Weller, Adrian and Oliver, Nuria and Scarlett, Jonathan and Berkenkamp, Felix},
  volume = 	 {235},
  series = 	 {Proceedings of Machine Learning Research},
  month = 	 {21--27 Jul},
  publisher =    {PMLR},
  pdf = 	 {https://raw.githubusercontent.com/mlresearch/v235/main/assets/qiao24b/qiao24b.pdf},
  url = 	 {https://proceedings.mlr.press/v235/qiao24b.html},
  abstract = 	 {We study the problem of multi-agent reinforcement learning (MARL) with adaptivity constraints — a new problem motivated by real-world applications where deployments of new policies are costly and the number of policy updates must be minimized. For two-player zero-sum Markov Games, we design a (policy) elimination based algorithm that achieves a regret of $\widetilde{O}(\sqrt{H^3 S^2 ABK})$, while the batch complexity is only $O(H+\log\log K)$. In the above, $S$ denotes the number of states, $A,B$ are the number of actions for the two players respectively, $H$ is the horizon and $K$ is the number of episodes. Furthermore, we prove a batch complexity lower bound $\Omega(\frac{H}{\log_{A}K}+\log\log K)$ for all algorithms with $\widetilde{O}(\sqrt{K})$ regret bound, which matches our upper bound up to logarithmic factors. As a byproduct, our techniques naturally extend to learning bandit games and reward-free MARL within near optimal batch complexity. To the best of our knowledge, these are the first line of results towards understanding MARL with low adaptivity.}
}

Endnote

%0 Conference Paper
%T Near-Optimal Reinforcement Learning with Self-Play under Adaptivity Constraints
%A Dan Qiao
%A Yu-Xiang Wang
%B Proceedings of the 41st International Conference on Machine Learning
%C Proceedings of Machine Learning Research
%D 2024
%E Ruslan Salakhutdinov
%E Zico Kolter
%E Katherine Heller
%E Adrian Weller
%E Nuria Oliver
%E Jonathan Scarlett
%E Felix Berkenkamp	
%F pmlr-v235-qiao24b
%I PMLR
%P 41430--41455
%U https://proceedings.mlr.press/v235/qiao24b.html
%V 235
%X We study the problem of multi-agent reinforcement learning (MARL) with adaptivity constraints — a new problem motivated by real-world applications where deployments of new policies are costly and the number of policy updates must be minimized. For two-player zero-sum Markov Games, we design a (policy) elimination based algorithm that achieves a regret of $\widetilde{O}(\sqrt{H^3 S^2 ABK})$, while the batch complexity is only $O(H+\log\log K)$. In the above, $S$ denotes the number of states, $A,B$ are the number of actions for the two players respectively, $H$ is the horizon and $K$ is the number of episodes. Furthermore, we prove a batch complexity lower bound $\Omega(\frac{H}{\log_{A}K}+\log\log K)$ for all algorithms with $\widetilde{O}(\sqrt{K})$ regret bound, which matches our upper bound up to logarithmic factors. As a byproduct, our techniques naturally extend to learning bandit games and reward-free MARL within near optimal batch complexity. To the best of our knowledge, these are the first line of results towards understanding MARL with low adaptivity.

APA


Qiao, D. & Wang, Y.. (2024). Near-Optimal Reinforcement Learning with Self-Play under Adaptivity Constraints. Proceedings of the 41st International Conference on Machine Learning, in Proceedings of Machine Learning Research 235:41430-41455 Available from https://proceedings.mlr.press/v235/qiao24b.html.

Near-Optimal Reinforcement Learning with Self-Play under Adaptivity Constraints

Abstract

Cite this Paper

Related Material