Gradient-free Online Learning in Continuous Games with Delayed Rewards

Amélie Héliou, Panayotis Mertikopoulos, Zhengyuan Zhou
Proceedings of the 37th International Conference on Machine Learning, PMLR 119:4172-4181, 2020.

Abstract

Motivated by applications to online advertising and recommender systems, we consider a game-theoretic model with delayed rewards and asynchronous, payoff-based feedback. In contrast to previous work on delayed multi-armed bandits, we focus on games with continuous action spaces, and we examine the long-run behavior of strategic agents that follow a no-regret learning policy (but are otherwise oblivious to the game being played, the objectives of their opponents, etc.). To account for the lack of a consistent stream of information (for instance, rewards can arrive out of order and with an a priori unbounded delay), we introduce a gradient-free learning policy where payoff information is placed in a priority queue as it arrives. Somewhat surprisingly, we find that under a standard diagonal concavity assumption, the induced sequence of play converges to Nash Equilibrium (NE) with probability 1, even if the delay between choosing an action and receiving the corresponding reward is unbounded.

Cite this Paper


BibTeX
@InProceedings{pmlr-v119-heliou20a, title = {Gradient-free Online Learning in Continuous Games with Delayed Rewards}, author = {H{\'e}liou, Am{\'e}lie and Mertikopoulos, Panayotis and Zhou, Zhengyuan}, booktitle = {Proceedings of the 37th International Conference on Machine Learning}, pages = {4172--4181}, year = {2020}, editor = {III, Hal Daumé and Singh, Aarti}, volume = {119}, series = {Proceedings of Machine Learning Research}, month = {13--18 Jul}, publisher = {PMLR}, pdf = {http://proceedings.mlr.press/v119/heliou20a/heliou20a.pdf}, url = {https://proceedings.mlr.press/v119/heliou20a.html}, abstract = {Motivated by applications to online advertising and recommender systems, we consider a game-theoretic model with delayed rewards and asynchronous, payoff-based feedback. In contrast to previous work on delayed multi-armed bandits, we focus on games with continuous action spaces, and we examine the long-run behavior of strategic agents that follow a no-regret learning policy (but are otherwise oblivious to the game being played, the objectives of their opponents, etc.). To account for the lack of a consistent stream of information (for instance, rewards can arrive out of order and with an a priori unbounded delay), we introduce a gradient-free learning policy where payoff information is placed in a priority queue as it arrives. Somewhat surprisingly, we find that under a standard diagonal concavity assumption, the induced sequence of play converges to Nash Equilibrium (NE) with probability 1, even if the delay between choosing an action and receiving the corresponding reward is unbounded.} }
Endnote
%0 Conference Paper %T Gradient-free Online Learning in Continuous Games with Delayed Rewards %A Amélie Héliou %A Panayotis Mertikopoulos %A Zhengyuan Zhou %B Proceedings of the 37th International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2020 %E Hal Daumé III %E Aarti Singh %F pmlr-v119-heliou20a %I PMLR %P 4172--4181 %U https://proceedings.mlr.press/v119/heliou20a.html %V 119 %X Motivated by applications to online advertising and recommender systems, we consider a game-theoretic model with delayed rewards and asynchronous, payoff-based feedback. In contrast to previous work on delayed multi-armed bandits, we focus on games with continuous action spaces, and we examine the long-run behavior of strategic agents that follow a no-regret learning policy (but are otherwise oblivious to the game being played, the objectives of their opponents, etc.). To account for the lack of a consistent stream of information (for instance, rewards can arrive out of order and with an a priori unbounded delay), we introduce a gradient-free learning policy where payoff information is placed in a priority queue as it arrives. Somewhat surprisingly, we find that under a standard diagonal concavity assumption, the induced sequence of play converges to Nash Equilibrium (NE) with probability 1, even if the delay between choosing an action and receiving the corresponding reward is unbounded.
APA
Héliou, A., Mertikopoulos, P. & Zhou, Z.. (2020). Gradient-free Online Learning in Continuous Games with Delayed Rewards. Proceedings of the 37th International Conference on Machine Learning, in Proceedings of Machine Learning Research 119:4172-4181 Available from https://proceedings.mlr.press/v119/heliou20a.html.

Related Material