Sleeping Experts and Bandits with Stochastic Action Availability and Adversarial Rewards

Varun Kanade, H. Brendan McMahan, Brent Bryan
Proceedings of the Twelfth International Conference on Artificial Intelligence and Statistics, PMLR 5:272-279, 2009.

Abstract

We consider algorithms for selecting actions in order to maximize rewards chosen by an adversary, where the set of actions available on any given round is selected stochastically. We present the first polynomial-time no-regret algorithms for this setting. In the full-observation (experts) version of the problem, we present an exponential-weights algorithm that achieves regret $\mathcal{O}(\sqrt{T \log n})$, which is the best possible. For the bandit setting (where the algorithm only observes the reward of the action selected), we present a no-regret algorithm based on follow-the-perturbed-leader. This algorithm runs in polynomial time, unlike the EXP4 algorithm which can also be applied to this setting. Our algorithm has the interesting interpretation of solving a geometric experts problem where the embedding in which rewards are linear is never explicitly constructed. We argue that this adversarial-reward, stochastic availability formulation is important in practice, as assuming stationary stochastic rewards is unrealistic in many domains.

Cite this Paper


BibTeX
@InProceedings{pmlr-v5-kanade09a, title = {Sleeping Experts and Bandits with Stochastic Action Availability and Adversarial Rewards}, author = {Kanade, Varun and McMahan, H. Brendan and Bryan, Brent}, booktitle = {Proceedings of the Twelfth International Conference on Artificial Intelligence and Statistics}, pages = {272--279}, year = {2009}, editor = {van Dyk, David and Welling, Max}, volume = {5}, series = {Proceedings of Machine Learning Research}, address = {Hilton Clearwater Beach Resort, Clearwater Beach, Florida USA}, month = {16--18 Apr}, publisher = {PMLR}, pdf = {http://proceedings.mlr.press/v5/kanade09a/kanade09a.pdf}, url = {https://proceedings.mlr.press/v5/kanade09a.html}, abstract = {We consider algorithms for selecting actions in order to maximize rewards chosen by an adversary, where the set of actions available on any given round is selected stochastically. We present the first polynomial-time no-regret algorithms for this setting. In the full-observation (experts) version of the problem, we present an exponential-weights algorithm that achieves regret $\mathcal{O}(\sqrt{T \log n})$, which is the best possible. For the bandit setting (where the algorithm only observes the reward of the action selected), we present a no-regret algorithm based on follow-the-perturbed-leader. This algorithm runs in polynomial time, unlike the EXP4 algorithm which can also be applied to this setting. Our algorithm has the interesting interpretation of solving a geometric experts problem where the embedding in which rewards are linear is never explicitly constructed. We argue that this adversarial-reward, stochastic availability formulation is important in practice, as assuming stationary stochastic rewards is unrealistic in many domains.} }
Endnote
%0 Conference Paper %T Sleeping Experts and Bandits with Stochastic Action Availability and Adversarial Rewards %A Varun Kanade %A H. Brendan McMahan %A Brent Bryan %B Proceedings of the Twelfth International Conference on Artificial Intelligence and Statistics %C Proceedings of Machine Learning Research %D 2009 %E David van Dyk %E Max Welling %F pmlr-v5-kanade09a %I PMLR %P 272--279 %U https://proceedings.mlr.press/v5/kanade09a.html %V 5 %X We consider algorithms for selecting actions in order to maximize rewards chosen by an adversary, where the set of actions available on any given round is selected stochastically. We present the first polynomial-time no-regret algorithms for this setting. In the full-observation (experts) version of the problem, we present an exponential-weights algorithm that achieves regret $\mathcal{O}(\sqrt{T \log n})$, which is the best possible. For the bandit setting (where the algorithm only observes the reward of the action selected), we present a no-regret algorithm based on follow-the-perturbed-leader. This algorithm runs in polynomial time, unlike the EXP4 algorithm which can also be applied to this setting. Our algorithm has the interesting interpretation of solving a geometric experts problem where the embedding in which rewards are linear is never explicitly constructed. We argue that this adversarial-reward, stochastic availability formulation is important in practice, as assuming stationary stochastic rewards is unrealistic in many domains.
RIS
TY - CPAPER TI - Sleeping Experts and Bandits with Stochastic Action Availability and Adversarial Rewards AU - Varun Kanade AU - H. Brendan McMahan AU - Brent Bryan BT - Proceedings of the Twelfth International Conference on Artificial Intelligence and Statistics DA - 2009/04/15 ED - David van Dyk ED - Max Welling ID - pmlr-v5-kanade09a PB - PMLR DP - Proceedings of Machine Learning Research VL - 5 SP - 272 EP - 279 L1 - http://proceedings.mlr.press/v5/kanade09a/kanade09a.pdf UR - https://proceedings.mlr.press/v5/kanade09a.html AB - We consider algorithms for selecting actions in order to maximize rewards chosen by an adversary, where the set of actions available on any given round is selected stochastically. We present the first polynomial-time no-regret algorithms for this setting. In the full-observation (experts) version of the problem, we present an exponential-weights algorithm that achieves regret $\mathcal{O}(\sqrt{T \log n})$, which is the best possible. For the bandit setting (where the algorithm only observes the reward of the action selected), we present a no-regret algorithm based on follow-the-perturbed-leader. This algorithm runs in polynomial time, unlike the EXP4 algorithm which can also be applied to this setting. Our algorithm has the interesting interpretation of solving a geometric experts problem where the embedding in which rewards are linear is never explicitly constructed. We argue that this adversarial-reward, stochastic availability formulation is important in practice, as assuming stationary stochastic rewards is unrealistic in many domains. ER -
APA
Kanade, V., McMahan, H.B. & Bryan, B.. (2009). Sleeping Experts and Bandits with Stochastic Action Availability and Adversarial Rewards. Proceedings of the Twelfth International Conference on Artificial Intelligence and Statistics, in Proceedings of Machine Learning Research 5:272-279 Available from https://proceedings.mlr.press/v5/kanade09a.html.

Related Material