A Unified Analysis of Nonstochastic Delayed Feedback for Combinatorial Semi-Bandits, Linear Bandits, and MDPs

Dirk van der Hoeven; Lukas Zierahn; Tal Lancewicki; Aviv Rosenberg; Nicolò Cesa-Bianchi

A Unified Analysis of Nonstochastic Delayed Feedback for Combinatorial Semi-Bandits, Linear Bandits, and MDPs

Dirk van der Hoeven, Lukas Zierahn, Tal Lancewicki, Aviv Rosenberg, Nicolò Cesa-Bianchi

Proceedings of Thirty Sixth Conference on Learning Theory, PMLR 195:1285-1321, 2023.

Abstract

We derive a new analysis of Follow The Regularized Leader (FTRL) for online learning with delayed bandit feedback. By separating the cost of delayed feedback from that of bandit feedback, our analysis allows us to obtain new results in three important settings. On the one hand, we derive the first optimal (up to logarithmic factors) regret bounds for combinatorial semi-bandits with delay and adversarial Markov decision processes with delay (and known transition functions). On the other hand, we use our analysis to derive an efficient algorithm for linear bandits with delay achieving near-optimal regret bounds. Our novel regret decomposition shows that FTRL remains stable across multiple rounds under mild assumptions on the Hessian of the regularizer.

Cite this Paper

BibTeX

@InProceedings{pmlr-v195-hoeven23a,
  title = 	 {A Unified Analysis of Nonstochastic Delayed Feedback for Combinatorial Semi-Bandits, Linear Bandits, and MDPs},
  author =       {van der Hoeven, Dirk and Zierahn, Lukas and Lancewicki, Tal and Rosenberg, Aviv and Cesa-Bianchi, Nicol{\`o}},
  booktitle = 	 {Proceedings of Thirty Sixth Conference on Learning Theory},
  pages = 	 {1285--1321},
  year = 	 {2023},
  editor = 	 {Neu, Gergely and Rosasco, Lorenzo},
  volume = 	 {195},
  series = 	 {Proceedings of Machine Learning Research},
  month = 	 {12--15 Jul},
  publisher =    {PMLR},
  pdf = 	 {https://proceedings.mlr.press/v195/hoeven23a/hoeven23a.pdf},
  url = 	 {https://proceedings.mlr.press/v195/hoeven23a.html},
  abstract = 	 {We derive a new analysis of Follow The Regularized Leader (FTRL) for online learning with delayed bandit feedback. By separating the cost of delayed feedback from that of bandit feedback, our analysis allows us to obtain new results in three important settings. On the one hand, we derive the first optimal (up to logarithmic factors) regret bounds for combinatorial semi-bandits with delay and adversarial Markov decision processes with delay (and known transition functions).   On the other hand, we use our analysis to derive an efficient algorithm for linear bandits with delay achieving near-optimal regret bounds. Our novel regret decomposition shows that FTRL remains stable across multiple rounds under mild assumptions on the Hessian of the regularizer.}
}

Endnote

%0 Conference Paper
%T A Unified Analysis of Nonstochastic Delayed Feedback for Combinatorial Semi-Bandits, Linear Bandits, and MDPs
%A Dirk van der Hoeven
%A Lukas Zierahn
%A Tal Lancewicki
%A Aviv Rosenberg
%A Nicolò Cesa-Bianchi
%B Proceedings of Thirty Sixth Conference on Learning Theory
%C Proceedings of Machine Learning Research
%D 2023
%E Gergely Neu
%E Lorenzo Rosasco	
%F pmlr-v195-hoeven23a
%I PMLR
%P 1285--1321
%U https://proceedings.mlr.press/v195/hoeven23a.html
%V 195
%X We derive a new analysis of Follow The Regularized Leader (FTRL) for online learning with delayed bandit feedback. By separating the cost of delayed feedback from that of bandit feedback, our analysis allows us to obtain new results in three important settings. On the one hand, we derive the first optimal (up to logarithmic factors) regret bounds for combinatorial semi-bandits with delay and adversarial Markov decision processes with delay (and known transition functions).   On the other hand, we use our analysis to derive an efficient algorithm for linear bandits with delay achieving near-optimal regret bounds. Our novel regret decomposition shows that FTRL remains stable across multiple rounds under mild assumptions on the Hessian of the regularizer.

APA

van der Hoeven, D., Zierahn, L., Lancewicki, T., Rosenberg, A. & Cesa-Bianchi, N.. (2023). A Unified Analysis of Nonstochastic Delayed Feedback for Combinatorial Semi-Bandits, Linear Bandits, and MDPs. Proceedings of Thirty Sixth Conference on Learning Theory, in Proceedings of Machine Learning Research 195:1285-1321 Available from https://proceedings.mlr.press/v195/hoeven23a.html.

Related Material

Download PDF