Sleeping Experts and Bandits with Stochastic Action Availability and Adversarial Rewards

Varun Kanade, H. Brendan McMahan, Brent Bryan
; Proceedings of the Twelth International Conference on Artificial Intelligence and Statistics, PMLR 5:272-279, 2009.

Abstract

We consider algorithms for selecting actions in order to maximize rewards chosen by an adversary, where the set of actions available on any given round is selected stochastically. We present the first polynomial-time no-regret algorithms for this setting. In the full-observation (experts) version of the problem, we present an exponential-weights algorithm that achieves regret O(\sqrtT log n), which is the best possible. For the bandit setting (where the algorithm only observes the reward of the action selected), we present a no-regret algorithm based on follow-the-perturbed-leader. This algorithm runs in polynomial time, unlike the EXP4 algorithm which can also be applied to this setting. Our algorithm has the interesting interpretation of solving a geometric experts problem where the embedding in which rewards are linear is never explicitly constructed. We argue that this adversarial-reward, stochastic availability formulation is important in practice, as assuming stationary stochastic rewards is unrealistic in many domains.

Cite this Paper


BibTeX
@InProceedings{pmlr-v5-kanade09a, title = {Sleeping Experts and Bandits with Stochastic Action Availability and Adversarial Rewards}, author = {Varun Kanade and H. Brendan McMahan and Brent Bryan}, booktitle = {Proceedings of the Twelth International Conference on Artificial Intelligence and Statistics}, pages = {272--279}, year = {2009}, editor = {David van Dyk and Max Welling}, volume = {5}, series = {Proceedings of Machine Learning Research}, address = {Hilton Clearwater Beach Resort, Clearwater Beach, Florida USA}, month = {16--18 Apr}, publisher = {PMLR}, pdf = {http://proceedings.mlr.press/v5/kanade09a/kanade09a.pdf}, url = {http://proceedings.mlr.press/v5/kanade09a.html}, abstract = {We consider algorithms for selecting actions in order to maximize rewards chosen by an adversary, where the set of actions available on any given round is selected stochastically. We present the first polynomial-time no-regret algorithms for this setting. In the full-observation (experts) version of the problem, we present an exponential-weights algorithm that achieves regret O(\sqrtT log n), which is the best possible. For the bandit setting (where the algorithm only observes the reward of the action selected), we present a no-regret algorithm based on follow-the-perturbed-leader. This algorithm runs in polynomial time, unlike the EXP4 algorithm which can also be applied to this setting. Our algorithm has the interesting interpretation of solving a geometric experts problem where the embedding in which rewards are linear is never explicitly constructed. We argue that this adversarial-reward, stochastic availability formulation is important in practice, as assuming stationary stochastic rewards is unrealistic in many domains.} }
Endnote
%0 Conference Paper %T Sleeping Experts and Bandits with Stochastic Action Availability and Adversarial Rewards %A Varun Kanade %A H. Brendan McMahan %A Brent Bryan %B Proceedings of the Twelth International Conference on Artificial Intelligence and Statistics %C Proceedings of Machine Learning Research %D 2009 %E David van Dyk %E Max Welling %F pmlr-v5-kanade09a %I PMLR %J Proceedings of Machine Learning Research %P 272--279 %U http://proceedings.mlr.press %V 5 %W PMLR %X We consider algorithms for selecting actions in order to maximize rewards chosen by an adversary, where the set of actions available on any given round is selected stochastically. We present the first polynomial-time no-regret algorithms for this setting. In the full-observation (experts) version of the problem, we present an exponential-weights algorithm that achieves regret O(\sqrtT log n), which is the best possible. For the bandit setting (where the algorithm only observes the reward of the action selected), we present a no-regret algorithm based on follow-the-perturbed-leader. This algorithm runs in polynomial time, unlike the EXP4 algorithm which can also be applied to this setting. Our algorithm has the interesting interpretation of solving a geometric experts problem where the embedding in which rewards are linear is never explicitly constructed. We argue that this adversarial-reward, stochastic availability formulation is important in practice, as assuming stationary stochastic rewards is unrealistic in many domains.
RIS
TY - CPAPER TI - Sleeping Experts and Bandits with Stochastic Action Availability and Adversarial Rewards AU - Varun Kanade AU - H. Brendan McMahan AU - Brent Bryan BT - Proceedings of the Twelth International Conference on Artificial Intelligence and Statistics PY - 2009/04/15 DA - 2009/04/15 ED - David van Dyk ED - Max Welling ID - pmlr-v5-kanade09a PB - PMLR SP - 272 DP - PMLR EP - 279 L1 - http://proceedings.mlr.press/v5/kanade09a/kanade09a.pdf UR - http://proceedings.mlr.press/v5/kanade09a.html AB - We consider algorithms for selecting actions in order to maximize rewards chosen by an adversary, where the set of actions available on any given round is selected stochastically. We present the first polynomial-time no-regret algorithms for this setting. In the full-observation (experts) version of the problem, we present an exponential-weights algorithm that achieves regret O(\sqrtT log n), which is the best possible. For the bandit setting (where the algorithm only observes the reward of the action selected), we present a no-regret algorithm based on follow-the-perturbed-leader. This algorithm runs in polynomial time, unlike the EXP4 algorithm which can also be applied to this setting. Our algorithm has the interesting interpretation of solving a geometric experts problem where the embedding in which rewards are linear is never explicitly constructed. We argue that this adversarial-reward, stochastic availability formulation is important in practice, as assuming stationary stochastic rewards is unrealistic in many domains. ER -
APA
Kanade, V., McMahan, H.B. & Bryan, B.. (2009). Sleeping Experts and Bandits with Stochastic Action Availability and Adversarial Rewards. Proceedings of the Twelth International Conference on Artificial Intelligence and Statistics, in PMLR 5:272-279

Related Material