Pseudo-reward Algorithms for Contextual Bandits with Linear Payoff Functions

Ku-Chun Chou; Hsuan-Tien Lin; Chao-Kai Chiang; Chi-Jen Lu

Pseudo-reward Algorithms for Contextual Bandits with Linear Payoff Functions

Ku-Chun Chou, Hsuan-Tien Lin, Chao-Kai Chiang, Chi-Jen Lu

Proceedings of the Sixth Asian Conference on Machine Learning, PMLR 39:344-359, 2015.

Abstract

We study the contextual bandit problem with linear payoff functions, which is a generalization of the traditional multi-armed bandit problem. In the contextual bandit problem, the learner needs to iteratively select an action based on an observed context, and receives a linear score on only the selected action as the reward feedback. Motivated by the observation that better performance is achievable if the other rewards on the non-selected actions can also be revealed to the learner, we propose a new framework that feeds the learner with pseudo-rewards, which are estimates of the rewards on the non-selected actions. We argue that the pseudo-rewards should better contain over-estimates of the true rewards, and propose a forgetting mechanism to decrease the negative influence of the over-estimation in the long run. Then, we couple the two key ideas above with the linear upper confidence bound (LinUCB) algorithm to design a novel algorithm called linear pseudo-reward upper confidence bound (LinPRUCB). We prove that LinPRUCB shares the same order of regret bound to LinUCB, while enjoying the practical observation of faster reward-gathering in the earlier iterations. Experiments on artificial and real-world data sets justify that LinPRUCB is competitive to and sometimes even better than LinUCB. Furthermore, we couple LinPRUCB with a special parameter to formalize a new algorithm that yields faster computation in updating the internal models while keeping the promising practical performance. The two properties match the real-world needs of the contextual bandit problem and make the new algorithm a favorable choice in practice.

Cite this Paper

BibTeX


@InProceedings{pmlr-v39-chou14,
  title = 	 {Pseudo-reward Algorithms for Contextual Bandits with Linear Payoff Functions},
  author = 	 {Chou, Ku-Chun and Lin, Hsuan-Tien and Chiang, Chao-Kai and Lu, Chi-Jen},
  booktitle = 	 {Proceedings of the Sixth Asian Conference on Machine Learning},
  pages = 	 {344--359},
  year = 	 {2015},
  editor = 	 {Phung, Dinh and Li, Hang},
  volume = 	 {39},
  series = 	 {Proceedings of Machine Learning Research},
  address = 	 {Nha Trang City, Vietnam},
  month = 	 {26--28 Nov},
  publisher =    {PMLR},
  pdf = 	 {http://proceedings.mlr.press/v39/chou14.pdf},
  url = 	 {https://proceedings.mlr.press/v39/chou14.html},
  abstract = 	 {We study the contextual bandit problem with linear payoff functions, which is a generalization of the traditional multi-armed bandit problem. In the contextual bandit problem, the learner needs to iteratively select an action based on an observed context, and receives a linear score on only the selected action as the reward feedback. Motivated by the observation that better performance is achievable if the other rewards on the non-selected actions can also be revealed to the learner, we propose a new framework that feeds the learner with pseudo-rewards, which are estimates of the rewards on the non-selected actions. We argue that the pseudo-rewards should better contain over-estimates of the true rewards, and propose a forgetting mechanism to decrease the negative influence of the over-estimation in the long run. Then, we couple the two key ideas above with the linear upper confidence bound (LinUCB) algorithm to design a novel algorithm called linear pseudo-reward upper confidence bound (LinPRUCB). We prove that LinPRUCB shares the same order of regret bound to LinUCB, while enjoying the practical observation of faster reward-gathering in the earlier iterations. Experiments on artificial and real-world data sets justify that LinPRUCB is competitive to and sometimes even better than LinUCB. Furthermore, we couple LinPRUCB with a special parameter to formalize a new algorithm that yields faster computation in updating the internal models while keeping the promising practical performance. The two properties match the real-world needs of the contextual bandit problem and make the new algorithm a favorable choice in practice.}
}

Endnote

%0 Conference Paper
%T Pseudo-reward Algorithms for Contextual Bandits with Linear Payoff Functions
%A Ku-Chun Chou
%A Hsuan-Tien Lin
%A Chao-Kai Chiang
%A Chi-Jen Lu
%B Proceedings of the Sixth Asian Conference on Machine Learning
%C Proceedings of Machine Learning Research
%D 2015
%E Dinh Phung
%E Hang Li	
%F pmlr-v39-chou14
%I PMLR
%P 344--359
%U https://proceedings.mlr.press/v39/chou14.html
%V 39
%X We study the contextual bandit problem with linear payoff functions, which is a generalization of the traditional multi-armed bandit problem. In the contextual bandit problem, the learner needs to iteratively select an action based on an observed context, and receives a linear score on only the selected action as the reward feedback. Motivated by the observation that better performance is achievable if the other rewards on the non-selected actions can also be revealed to the learner, we propose a new framework that feeds the learner with pseudo-rewards, which are estimates of the rewards on the non-selected actions. We argue that the pseudo-rewards should better contain over-estimates of the true rewards, and propose a forgetting mechanism to decrease the negative influence of the over-estimation in the long run. Then, we couple the two key ideas above with the linear upper confidence bound (LinUCB) algorithm to design a novel algorithm called linear pseudo-reward upper confidence bound (LinPRUCB). We prove that LinPRUCB shares the same order of regret bound to LinUCB, while enjoying the practical observation of faster reward-gathering in the earlier iterations. Experiments on artificial and real-world data sets justify that LinPRUCB is competitive to and sometimes even better than LinUCB. Furthermore, we couple LinPRUCB with a special parameter to formalize a new algorithm that yields faster computation in updating the internal models while keeping the promising practical performance. The two properties match the real-world needs of the contextual bandit problem and make the new algorithm a favorable choice in practice.

RIS


TY  - CPAPER
TI  - Pseudo-reward Algorithms for Contextual Bandits with Linear Payoff Functions
AU  - Ku-Chun Chou
AU  - Hsuan-Tien Lin
AU  - Chao-Kai Chiang
AU  - Chi-Jen Lu
BT  - Proceedings of the Sixth Asian Conference on Machine Learning
DA  - 2015/02/16
ED  - Dinh Phung
ED  - Hang Li	
ID  - pmlr-v39-chou14
PB  - PMLR
DP  - Proceedings of Machine Learning Research
VL  - 39
SP  - 344
EP  - 359
L1  - http://proceedings.mlr.press/v39/chou14.pdf
UR  - https://proceedings.mlr.press/v39/chou14.html
AB  - We study the contextual bandit problem with linear payoff functions, which is a generalization of the traditional multi-armed bandit problem. In the contextual bandit problem, the learner needs to iteratively select an action based on an observed context, and receives a linear score on only the selected action as the reward feedback. Motivated by the observation that better performance is achievable if the other rewards on the non-selected actions can also be revealed to the learner, we propose a new framework that feeds the learner with pseudo-rewards, which are estimates of the rewards on the non-selected actions. We argue that the pseudo-rewards should better contain over-estimates of the true rewards, and propose a forgetting mechanism to decrease the negative influence of the over-estimation in the long run. Then, we couple the two key ideas above with the linear upper confidence bound (LinUCB) algorithm to design a novel algorithm called linear pseudo-reward upper confidence bound (LinPRUCB). We prove that LinPRUCB shares the same order of regret bound to LinUCB, while enjoying the practical observation of faster reward-gathering in the earlier iterations. Experiments on artificial and real-world data sets justify that LinPRUCB is competitive to and sometimes even better than LinUCB. Furthermore, we couple LinPRUCB with a special parameter to formalize a new algorithm that yields faster computation in updating the internal models while keeping the promising practical performance. The two properties match the real-world needs of the contextual bandit problem and make the new algorithm a favorable choice in practice.
ER  -

APA


Chou, K., Lin, H., Chiang, C. & Lu, C.. (2015). Pseudo-reward Algorithms for Contextual Bandits with Linear Payoff Functions. Proceedings of the Sixth Asian Conference on Machine Learning, in Proceedings of Machine Learning Research 39:344-359 Available from https://proceedings.mlr.press/v39/chou14.html.

Related Material

Download PDF