An efficient algorithm for contextual bandits with knapsacks, and an extension to concave objectives

Shipra Agrawal; Nikhil R. Devanur; Lihong Li

An efficient algorithm for contextual bandits with knapsacks, and an extension to concave objectives

Shipra Agrawal, Nikhil R. Devanur, Lihong Li

29th Annual Conference on Learning Theory, PMLR 49:4-18, 2016.

Abstract

We consider a contextual version of multi-armed bandit problem with global knapsack constraints. In each round, the outcome of pulling an arm is a scalar reward and a resource consumption vector, both dependent on the context, and the global knapsack constraints require the total consumption for each resource to be below some pre-fixed budget. The learning agent competes with an arbitrary set of context-dependent policies. This problem was introduced by Badanidiyuru et al., who gave a computationally inefficient algorithm with near-optimal regret bounds for it. We give a \emphcomputationally efficient algorithm for this problem with slightly better regret bounds, by generalizing the approach of Dudik et al. for the non-constrained version of the problem. The computational time of our algorithm scales \emphlogarithmically in the size of the policy space. This answers the main open question of Badanidiyuru et al. We also extend our results to a variant where there are no knapsack constraints but the objective is an arbitrary Lipschitz concave function of the sum of outcome vectors.

Cite this Paper

BibTeX


@InProceedings{pmlr-v49-agrawal16,
  title = 	 {An efficient algorithm for contextual bandits with knapsacks, and an extension to concave objectives},
  author = 	 {Agrawal, Shipra and Devanur, Nikhil R. and Li, Lihong},
  booktitle = 	 {29th Annual Conference on Learning Theory},
  pages = 	 {4--18},
  year = 	 {2016},
  editor = 	 {Feldman, Vitaly and Rakhlin, Alexander and Shamir, Ohad},
  volume = 	 {49},
  series = 	 {Proceedings of Machine Learning Research},
  address = 	 {Columbia University, New York, New York, USA},
  month = 	 {23--26 Jun},
  publisher =    {PMLR},
  pdf = 	 {http://proceedings.mlr.press/v49/agrawal16.pdf},
  url = 	 {https://proceedings.mlr.press/v49/agrawal16.html},
  abstract = 	 {We consider a contextual version of multi-armed bandit problem with global knapsack constraints. In each round, the outcome of pulling an arm is a scalar reward and a resource consumption vector, both dependent on the context, and the global knapsack constraints require the total consumption for each resource to be below some pre-fixed budget. The learning agent competes with an arbitrary set of context-dependent policies. This problem was introduced by Badanidiyuru et al., who gave a computationally inefficient algorithm with near-optimal regret bounds for it.  We give a \emphcomputationally efficient algorithm for this problem with slightly better regret bounds, by generalizing the approach of Dudik et al. for the non-constrained version of the problem. The computational time of our algorithm scales \emphlogarithmically in the size of the policy space. This answers the main open question of Badanidiyuru et al. We also extend our results to a variant where there are no knapsack constraints but the objective is an arbitrary Lipschitz concave function of the sum of outcome vectors. }
}

Endnote

%0 Conference Paper
%T An efficient algorithm for contextual bandits with knapsacks, and an extension to concave objectives
%A Shipra Agrawal
%A Nikhil R. Devanur
%A Lihong Li
%B 29th Annual Conference on Learning Theory
%C Proceedings of Machine Learning Research
%D 2016
%E Vitaly Feldman
%E Alexander Rakhlin
%E Ohad Shamir	
%F pmlr-v49-agrawal16
%I PMLR
%P 4--18
%U https://proceedings.mlr.press/v49/agrawal16.html
%V 49
%X We consider a contextual version of multi-armed bandit problem with global knapsack constraints. In each round, the outcome of pulling an arm is a scalar reward and a resource consumption vector, both dependent on the context, and the global knapsack constraints require the total consumption for each resource to be below some pre-fixed budget. The learning agent competes with an arbitrary set of context-dependent policies. This problem was introduced by Badanidiyuru et al., who gave a computationally inefficient algorithm with near-optimal regret bounds for it.  We give a \emphcomputationally efficient algorithm for this problem with slightly better regret bounds, by generalizing the approach of Dudik et al. for the non-constrained version of the problem. The computational time of our algorithm scales \emphlogarithmically in the size of the policy space. This answers the main open question of Badanidiyuru et al. We also extend our results to a variant where there are no knapsack constraints but the objective is an arbitrary Lipschitz concave function of the sum of outcome vectors.

RIS


TY  - CPAPER
TI  - An efficient algorithm for contextual bandits with knapsacks, and an extension to concave objectives
AU  - Shipra Agrawal
AU  - Nikhil R. Devanur
AU  - Lihong Li
BT  - 29th Annual Conference on Learning Theory
DA  - 2016/06/06
ED  - Vitaly Feldman
ED  - Alexander Rakhlin
ED  - Ohad Shamir	
ID  - pmlr-v49-agrawal16
PB  - PMLR
DP  - Proceedings of Machine Learning Research
VL  - 49
SP  - 4
EP  - 18
L1  - http://proceedings.mlr.press/v49/agrawal16.pdf
UR  - https://proceedings.mlr.press/v49/agrawal16.html
AB  - We consider a contextual version of multi-armed bandit problem with global knapsack constraints. In each round, the outcome of pulling an arm is a scalar reward and a resource consumption vector, both dependent on the context, and the global knapsack constraints require the total consumption for each resource to be below some pre-fixed budget. The learning agent competes with an arbitrary set of context-dependent policies. This problem was introduced by Badanidiyuru et al., who gave a computationally inefficient algorithm with near-optimal regret bounds for it.  We give a \emphcomputationally efficient algorithm for this problem with slightly better regret bounds, by generalizing the approach of Dudik et al. for the non-constrained version of the problem. The computational time of our algorithm scales \emphlogarithmically in the size of the policy space. This answers the main open question of Badanidiyuru et al. We also extend our results to a variant where there are no knapsack constraints but the objective is an arbitrary Lipschitz concave function of the sum of outcome vectors. 
ER  -

APA


Agrawal, S., Devanur, N.R. & Li, L.. (2016). An efficient algorithm for contextual bandits with knapsacks, and an extension to concave objectives. 29th Annual Conference on Learning Theory, in Proceedings of Machine Learning Research 49:4-18 Available from https://proceedings.mlr.press/v49/agrawal16.html.

Related Material

Download PDF