Adaptive Bandits: Towards the best history-dependent strategy

Maillard Odalric, Rémi Munos
Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, PMLR 15:570-578, 2011.

Abstract

We consider multi-armed bandit games with possibly adaptive opponents. We introduce models Θ of constraints based on equivalence classes on the common history (information shared by the player and the opponent) which define two learning scenarios: (1) The opponent is constrained, i.e. he provides rewards that are stochastic functions of equivalence classes defined by some model θΘ. The regret is measured with respect to (w.r.t.) the best history-dependent strategy. (2) The opponent is arbitrary and we measure the regret w.r.t. the best strategy among all mappings from classes to actions (i.e. the best history-class-based strategy) for the best model in Θ. This allows to model opponents (case 1) or strategies (case 2) which handles finite memory, periodicity, standard stochastic bandits and other situations. When Θ={θ}, i.e. only one model is considered, we derive tractable algorithms achieving a tight regret (at time T) bounded by ˜O(TAC), where C is the number of classes of θ. Now, when many models are available, all known algorithms achieving a nice regret O(T) are unfortunately not tractable and scale poorly with the number of models |Θ|. Our contribution here is to provide tractable algorithms with regret bounded by T2/3C1/3log(|Θ|)1/2.

Cite this Paper


BibTeX
@InProceedings{pmlr-v15-odalric11a, title = {Adaptive Bandits: Towards the best history-dependent strategy}, author = {Odalric, Maillard and Munos, Rémi}, booktitle = {Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics}, pages = {570--578}, year = {2011}, editor = {Gordon, Geoffrey and Dunson, David and Dudík, Miroslav}, volume = {15}, series = {Proceedings of Machine Learning Research}, address = {Fort Lauderdale, FL, USA}, month = {11--13 Apr}, publisher = {PMLR}, pdf = {http://proceedings.mlr.press/v15/odalric11a/odalric11a.pdf}, url = {https://proceedings.mlr.press/v15/odalric11a.html}, abstract = {We consider multi-armed bandit games with possibly adaptive opponents. We introduce models $\Theta$ of constraints based on equivalence classes on the common history (information shared by the player and the opponent) which define two learning scenarios: (1) The opponent is constrained, i.e. he provides rewards that are stochastic functions of equivalence classes defined by some model $\theta^* \in \Theta$. The regret is measured with respect to (w.r.t.) the best history-dependent strategy. (2) The opponent is arbitrary and we measure the regret w.r.t. the best strategy among all mappings from classes to actions (i.e. the best history-class-based strategy) for the best model in $\Theta$. This allows to model opponents (case 1) or strategies (case 2) which handles finite memory, periodicity, standard stochastic bandits and other situations. When $\Theta=\{ \theta \}$, i.e. only one model is considered, we derive tractable algorithms achieving a tight regret (at time $T$) bounded by $\tilde{O}(\sqrt{TAC})$, where $C$ is the number of classes of $\theta$. Now, when many models are available, all known algorithms achieving a nice regret $O(\sqrt{T})$ are unfortunately not tractable and scale poorly with the number of models $|\Theta|$. Our contribution here is to provide tractable algorithms with regret bounded by $T^{2/3}C^{1/3}\log(|\Theta|)^{1/2}$.} }
Endnote
%0 Conference Paper %T Adaptive Bandits: Towards the best history-dependent strategy %A Maillard Odalric %A Rémi Munos %B Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics %C Proceedings of Machine Learning Research %D 2011 %E Geoffrey Gordon %E David Dunson %E Miroslav Dudík %F pmlr-v15-odalric11a %I PMLR %P 570--578 %U https://proceedings.mlr.press/v15/odalric11a.html %V 15 %X We consider multi-armed bandit games with possibly adaptive opponents. We introduce models $\Theta$ of constraints based on equivalence classes on the common history (information shared by the player and the opponent) which define two learning scenarios: (1) The opponent is constrained, i.e. he provides rewards that are stochastic functions of equivalence classes defined by some model $\theta^* \in \Theta$. The regret is measured with respect to (w.r.t.) the best history-dependent strategy. (2) The opponent is arbitrary and we measure the regret w.r.t. the best strategy among all mappings from classes to actions (i.e. the best history-class-based strategy) for the best model in $\Theta$. This allows to model opponents (case 1) or strategies (case 2) which handles finite memory, periodicity, standard stochastic bandits and other situations. When $\Theta=\{ \theta \}$, i.e. only one model is considered, we derive tractable algorithms achieving a tight regret (at time $T$) bounded by $\tilde{O}(\sqrt{TAC})$, where $C$ is the number of classes of $\theta$. Now, when many models are available, all known algorithms achieving a nice regret $O(\sqrt{T})$ are unfortunately not tractable and scale poorly with the number of models $|\Theta|$. Our contribution here is to provide tractable algorithms with regret bounded by $T^{2/3}C^{1/3}\log(|\Theta|)^{1/2}$.
RIS
TY - CPAPER TI - Adaptive Bandits: Towards the best history-dependent strategy AU - Maillard Odalric AU - Rémi Munos BT - Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics DA - 2011/06/14 ED - Geoffrey Gordon ED - David Dunson ED - Miroslav Dudík ID - pmlr-v15-odalric11a PB - PMLR DP - Proceedings of Machine Learning Research VL - 15 SP - 570 EP - 578 L1 - http://proceedings.mlr.press/v15/odalric11a/odalric11a.pdf UR - https://proceedings.mlr.press/v15/odalric11a.html AB - We consider multi-armed bandit games with possibly adaptive opponents. We introduce models $\Theta$ of constraints based on equivalence classes on the common history (information shared by the player and the opponent) which define two learning scenarios: (1) The opponent is constrained, i.e. he provides rewards that are stochastic functions of equivalence classes defined by some model $\theta^* \in \Theta$. The regret is measured with respect to (w.r.t.) the best history-dependent strategy. (2) The opponent is arbitrary and we measure the regret w.r.t. the best strategy among all mappings from classes to actions (i.e. the best history-class-based strategy) for the best model in $\Theta$. This allows to model opponents (case 1) or strategies (case 2) which handles finite memory, periodicity, standard stochastic bandits and other situations. When $\Theta=\{ \theta \}$, i.e. only one model is considered, we derive tractable algorithms achieving a tight regret (at time $T$) bounded by $\tilde{O}(\sqrt{TAC})$, where $C$ is the number of classes of $\theta$. Now, when many models are available, all known algorithms achieving a nice regret $O(\sqrt{T})$ are unfortunately not tractable and scale poorly with the number of models $|\Theta|$. Our contribution here is to provide tractable algorithms with regret bounded by $T^{2/3}C^{1/3}\log(|\Theta|)^{1/2}$. ER -
APA
Odalric, M. & Munos, R.. (2011). Adaptive Bandits: Towards the best history-dependent strategy. Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, in Proceedings of Machine Learning Research 15:570-578 Available from https://proceedings.mlr.press/v15/odalric11a.html.

Related Material