A Last Switch Dependent Analysis of Satiation and Seasonality in Bandits

Pierre Laforgue, Giulia Clerici, Nicolò Cesa-Bianchi, Ran Gilad-Bachrach
Proceedings of The 25th International Conference on Artificial Intelligence and Statistics, PMLR 151:971-990, 2022.

Abstract

Motivated by the fact that humans like some level of unpredictability or novelty, and might therefore get quickly bored when interacting with a stationary policy, we introduce a novel non-stationary bandit problem, where the expected reward of an arm is fully determined by the time elapsed since the arm last took part in a switch of actions. Our model generalizes previous notions of delay-dependent rewards, and also relaxes most assumptions on the reward function. This enables the modeling of phenomena such as progressive satiation and periodic behaviours. Building upon the Combinatorial Semi-Bandits (CSB) framework, we design an algorithm and prove a bound on its regret with respect to the optimal non-stationary policy (which is NP-hard to compute). Similarly to previous works, our regret analysis is based on defining and solving an appropriate trade-off between approximation and estimation. Preliminary experiments confirm the superiority of our algorithm over both the oracle greedy approach and a vanilla CSB solver.

Cite this Paper


BibTeX
@InProceedings{pmlr-v151-laforgue22a, title = { A Last Switch Dependent Analysis of Satiation and Seasonality in Bandits }, author = {Laforgue, Pierre and Clerici, Giulia and Cesa-Bianchi, Nicol\`o and Gilad-Bachrach, Ran}, booktitle = {Proceedings of The 25th International Conference on Artificial Intelligence and Statistics}, pages = {971--990}, year = {2022}, editor = {Camps-Valls, Gustau and Ruiz, Francisco J. R. and Valera, Isabel}, volume = {151}, series = {Proceedings of Machine Learning Research}, month = {28--30 Mar}, publisher = {PMLR}, pdf = {https://proceedings.mlr.press/v151/laforgue22a/laforgue22a.pdf}, url = {https://proceedings.mlr.press/v151/laforgue22a.html}, abstract = { Motivated by the fact that humans like some level of unpredictability or novelty, and might therefore get quickly bored when interacting with a stationary policy, we introduce a novel non-stationary bandit problem, where the expected reward of an arm is fully determined by the time elapsed since the arm last took part in a switch of actions. Our model generalizes previous notions of delay-dependent rewards, and also relaxes most assumptions on the reward function. This enables the modeling of phenomena such as progressive satiation and periodic behaviours. Building upon the Combinatorial Semi-Bandits (CSB) framework, we design an algorithm and prove a bound on its regret with respect to the optimal non-stationary policy (which is NP-hard to compute). Similarly to previous works, our regret analysis is based on defining and solving an appropriate trade-off between approximation and estimation. Preliminary experiments confirm the superiority of our algorithm over both the oracle greedy approach and a vanilla CSB solver. } }
Endnote
%0 Conference Paper %T A Last Switch Dependent Analysis of Satiation and Seasonality in Bandits %A Pierre Laforgue %A Giulia Clerici %A Nicolò Cesa-Bianchi %A Ran Gilad-Bachrach %B Proceedings of The 25th International Conference on Artificial Intelligence and Statistics %C Proceedings of Machine Learning Research %D 2022 %E Gustau Camps-Valls %E Francisco J. R. Ruiz %E Isabel Valera %F pmlr-v151-laforgue22a %I PMLR %P 971--990 %U https://proceedings.mlr.press/v151/laforgue22a.html %V 151 %X Motivated by the fact that humans like some level of unpredictability or novelty, and might therefore get quickly bored when interacting with a stationary policy, we introduce a novel non-stationary bandit problem, where the expected reward of an arm is fully determined by the time elapsed since the arm last took part in a switch of actions. Our model generalizes previous notions of delay-dependent rewards, and also relaxes most assumptions on the reward function. This enables the modeling of phenomena such as progressive satiation and periodic behaviours. Building upon the Combinatorial Semi-Bandits (CSB) framework, we design an algorithm and prove a bound on its regret with respect to the optimal non-stationary policy (which is NP-hard to compute). Similarly to previous works, our regret analysis is based on defining and solving an appropriate trade-off between approximation and estimation. Preliminary experiments confirm the superiority of our algorithm over both the oracle greedy approach and a vanilla CSB solver.
APA
Laforgue, P., Clerici, G., Cesa-Bianchi, N. & Gilad-Bachrach, R.. (2022). A Last Switch Dependent Analysis of Satiation and Seasonality in Bandits . Proceedings of The 25th International Conference on Artificial Intelligence and Statistics, in Proceedings of Machine Learning Research 151:971-990 Available from https://proceedings.mlr.press/v151/laforgue22a.html.

Related Material