Sleeping Experts and Bandits with Stochastic Action Availability and Adversarial Rewards

Varun Kanade; H. Brendan McMahan; Brent Bryan

Sleeping Experts and Bandits with Stochastic Action Availability and Adversarial Rewards

Varun Kanade, H. Brendan McMahan, Brent Bryan

Proceedings of the Twelfth International Conference on Artificial Intelligence and Statistics, PMLR 5:272-279, 2009.

Abstract

We consider algorithms for selecting actions in order to maximize rewards chosen by an adversary, where the set of actions available on any given round is selected stochastically. We present the first polynomial-time no-regret algorithms for this setting. In the full-observation (experts) version of the problem, we present an exponential-weights algorithm that achieves regret $\mathcal{O}(\sqrt{T \log n})$, which is the best possible. For the bandit setting (where the algorithm only observes the reward of the action selected), we present a no-regret algorithm based on follow-the-perturbed-leader. This algorithm runs in polynomial time, unlike the EXP4 algorithm which can also be applied to this setting. Our algorithm has the interesting interpretation of solving a geometric experts problem where the embedding in which rewards are linear is never explicitly constructed. We argue that this adversarial-reward, stochastic availability formulation is important in practice, as assuming stationary stochastic rewards is unrealistic in many domains.

Cite this Paper

BibTeX


@InProceedings{pmlr-v5-kanade09a,
  title = 	 {Sleeping Experts and Bandits with Stochastic Action Availability and Adversarial Rewards},
  author = 	 {Kanade, Varun and McMahan, H. Brendan and Bryan, Brent},
  booktitle = 	 {Proceedings of the Twelfth International Conference on Artificial Intelligence and Statistics},
  pages = 	 {272--279},
  year = 	 {2009},
  editor = 	 {van Dyk, David and Welling, Max},
  volume = 	 {5},
  series = 	 {Proceedings of Machine Learning Research},
  address = 	 {Hilton Clearwater Beach Resort, Clearwater Beach, Florida USA},
  month = 	 {16--18 Apr},
  publisher =    {PMLR},
  pdf = 	 {http://proceedings.mlr.press/v5/kanade09a/kanade09a.pdf},
  url = 	 {https://proceedings.mlr.press/v5/kanade09a.html},
  abstract = 	 {We consider algorithms for selecting actions in order to maximize rewards chosen by an adversary, where the set of actions available on any given round is selected stochastically. We present the first polynomial-time no-regret algorithms for this setting. In the full-observation (experts) version of the problem, we present an exponential-weights algorithm that achieves regret $\mathcal{O}(\sqrt{T \log n})$, which is the best possible. For the bandit setting (where the algorithm only observes the reward of the action selected), we present a no-regret algorithm based on follow-the-perturbed-leader. This algorithm runs in polynomial time, unlike the EXP4 algorithm which can also be applied to this setting. Our algorithm has the interesting interpretation of solving a geometric experts problem where the embedding in which rewards are linear is never explicitly constructed. We argue that this adversarial-reward, stochastic availability formulation is important in practice, as assuming stationary stochastic rewards is unrealistic in many domains.}
}

Endnote

%0 Conference Paper
%T Sleeping Experts and Bandits with Stochastic Action Availability and Adversarial Rewards
%A Varun Kanade
%A H. Brendan McMahan
%A Brent Bryan
%B Proceedings of the Twelfth International Conference on Artificial Intelligence and Statistics
%C Proceedings of Machine Learning Research
%D 2009
%E David van Dyk
%E Max Welling	
%F pmlr-v5-kanade09a
%I PMLR
%P 272--279
%U https://proceedings.mlr.press/v5/kanade09a.html
%V 5
%X We consider algorithms for selecting actions in order to maximize rewards chosen by an adversary, where the set of actions available on any given round is selected stochastically. We present the first polynomial-time no-regret algorithms for this setting. In the full-observation (experts) version of the problem, we present an exponential-weights algorithm that achieves regret $\mathcal{O}(\sqrt{T \log n})$, which is the best possible. For the bandit setting (where the algorithm only observes the reward of the action selected), we present a no-regret algorithm based on follow-the-perturbed-leader. This algorithm runs in polynomial time, unlike the EXP4 algorithm which can also be applied to this setting. Our algorithm has the interesting interpretation of solving a geometric experts problem where the embedding in which rewards are linear is never explicitly constructed. We argue that this adversarial-reward, stochastic availability formulation is important in practice, as assuming stationary stochastic rewards is unrealistic in many domains.

RIS


TY  - CPAPER
TI  - Sleeping Experts and Bandits with Stochastic Action Availability and Adversarial Rewards
AU  - Varun Kanade
AU  - H. Brendan McMahan
AU  - Brent Bryan
BT  - Proceedings of the Twelfth International Conference on Artificial Intelligence and Statistics
DA  - 2009/04/15
ED  - David van Dyk
ED  - Max Welling	
ID  - pmlr-v5-kanade09a
PB  - PMLR
DP  - Proceedings of Machine Learning Research
VL  - 5
SP  - 272
EP  - 279
L1  - http://proceedings.mlr.press/v5/kanade09a/kanade09a.pdf
UR  - https://proceedings.mlr.press/v5/kanade09a.html
AB  - We consider algorithms for selecting actions in order to maximize rewards chosen by an adversary, where the set of actions available on any given round is selected stochastically. We present the first polynomial-time no-regret algorithms for this setting. In the full-observation (experts) version of the problem, we present an exponential-weights algorithm that achieves regret $\mathcal{O}(\sqrt{T \log n})$, which is the best possible. For the bandit setting (where the algorithm only observes the reward of the action selected), we present a no-regret algorithm based on follow-the-perturbed-leader. This algorithm runs in polynomial time, unlike the EXP4 algorithm which can also be applied to this setting. Our algorithm has the interesting interpretation of solving a geometric experts problem where the embedding in which rewards are linear is never explicitly constructed. We argue that this adversarial-reward, stochastic availability formulation is important in practice, as assuming stationary stochastic rewards is unrealistic in many domains.
ER  -

APA


Kanade, V., McMahan, H.B. & Bryan, B.. (2009). Sleeping Experts and Bandits with Stochastic Action Availability and Adversarial Rewards. Proceedings of the Twelfth International Conference on Artificial Intelligence and Statistics, in Proceedings of Machine Learning Research 5:272-279 Available from https://proceedings.mlr.press/v5/kanade09a.html.

Related Material

Download PDF