Cooperative Online Learning in Stochastic and Adversarial MDPs

Tal Lancewicki, Aviv Rosenberg, Yishay Mansour
Proceedings of the 39th International Conference on Machine Learning, PMLR 162:11918-11968, 2022.

Abstract

We study cooperative online learning in stochastic and adversarial Markov decision process (MDP). That is, in each episode, $m$ agents interact with an MDP simultaneously and share information in order to minimize their individual regret. We consider environments with two types of randomness: fresh – where each agent’s trajectory is sampled i.i.d, and non-fresh – where the realization is shared by all agents (but each agent’s trajectory is also affected by its own actions). More precisely, with non-fresh randomness the realization of every cost and transition is fixed at the start of each episode, and agents that take the same action in the same state at the same time observe the same cost and next state. We thoroughly analyze all relevant settings, highlight the challenges and differences between the models, and prove nearly-matching regret lower and upper bounds. To our knowledge, we are the first to consider cooperative reinforcement learning (RL) with either non-fresh randomness or in adversarial MDPs.

Cite this Paper


BibTeX
@InProceedings{pmlr-v162-lancewicki22a, title = {Cooperative Online Learning in Stochastic and Adversarial {MDP}s}, author = {Lancewicki, Tal and Rosenberg, Aviv and Mansour, Yishay}, booktitle = {Proceedings of the 39th International Conference on Machine Learning}, pages = {11918--11968}, year = {2022}, editor = {Chaudhuri, Kamalika and Jegelka, Stefanie and Song, Le and Szepesvari, Csaba and Niu, Gang and Sabato, Sivan}, volume = {162}, series = {Proceedings of Machine Learning Research}, month = {17--23 Jul}, publisher = {PMLR}, pdf = {https://proceedings.mlr.press/v162/lancewicki22a/lancewicki22a.pdf}, url = {https://proceedings.mlr.press/v162/lancewicki22a.html}, abstract = {We study cooperative online learning in stochastic and adversarial Markov decision process (MDP). That is, in each episode, $m$ agents interact with an MDP simultaneously and share information in order to minimize their individual regret. We consider environments with two types of randomness: fresh – where each agent’s trajectory is sampled i.i.d, and non-fresh – where the realization is shared by all agents (but each agent’s trajectory is also affected by its own actions). More precisely, with non-fresh randomness the realization of every cost and transition is fixed at the start of each episode, and agents that take the same action in the same state at the same time observe the same cost and next state. We thoroughly analyze all relevant settings, highlight the challenges and differences between the models, and prove nearly-matching regret lower and upper bounds. To our knowledge, we are the first to consider cooperative reinforcement learning (RL) with either non-fresh randomness or in adversarial MDPs.} }
Endnote
%0 Conference Paper %T Cooperative Online Learning in Stochastic and Adversarial MDPs %A Tal Lancewicki %A Aviv Rosenberg %A Yishay Mansour %B Proceedings of the 39th International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2022 %E Kamalika Chaudhuri %E Stefanie Jegelka %E Le Song %E Csaba Szepesvari %E Gang Niu %E Sivan Sabato %F pmlr-v162-lancewicki22a %I PMLR %P 11918--11968 %U https://proceedings.mlr.press/v162/lancewicki22a.html %V 162 %X We study cooperative online learning in stochastic and adversarial Markov decision process (MDP). That is, in each episode, $m$ agents interact with an MDP simultaneously and share information in order to minimize their individual regret. We consider environments with two types of randomness: fresh – where each agent’s trajectory is sampled i.i.d, and non-fresh – where the realization is shared by all agents (but each agent’s trajectory is also affected by its own actions). More precisely, with non-fresh randomness the realization of every cost and transition is fixed at the start of each episode, and agents that take the same action in the same state at the same time observe the same cost and next state. We thoroughly analyze all relevant settings, highlight the challenges and differences between the models, and prove nearly-matching regret lower and upper bounds. To our knowledge, we are the first to consider cooperative reinforcement learning (RL) with either non-fresh randomness or in adversarial MDPs.
APA
Lancewicki, T., Rosenberg, A. & Mansour, Y.. (2022). Cooperative Online Learning in Stochastic and Adversarial MDPs. Proceedings of the 39th International Conference on Machine Learning, in Proceedings of Machine Learning Research 162:11918-11968 Available from https://proceedings.mlr.press/v162/lancewicki22a.html.

Related Material