Pseudo-reward Algorithms for Contextual Bandits with Linear Payoff Functions

Ku-Chun Chou, Hsuan-Tien Lin, Chao-Kai Chiang, Chi-Jen Lu
Proceedings of the Sixth Asian Conference on Machine Learning, PMLR 39:344-359, 2015.

Abstract

We study the contextual bandit problem with linear payoff functions, which is a generalization of the traditional multi-armed bandit problem. In the contextual bandit problem, the learner needs to iteratively select an action based on an observed context, and receives a linear score on only the selected action as the reward feedback. Motivated by the observation that better performance is achievable if the other rewards on the non-selected actions can also be revealed to the learner, we propose a new framework that feeds the learner with pseudo-rewards, which are estimates of the rewards on the non-selected actions. We argue that the pseudo-rewards should better contain over-estimates of the true rewards, and propose a forgetting mechanism to decrease the negative influence of the over-estimation in the long run. Then, we couple the two key ideas above with the linear upper confidence bound (LinUCB) algorithm to design a novel algorithm called linear pseudo-reward upper confidence bound (LinPRUCB). We prove that LinPRUCB shares the same order of regret bound to LinUCB, while enjoying the practical observation of faster reward-gathering in the earlier iterations. Experiments on artificial and real-world data sets justify that LinPRUCB is competitive to and sometimes even better than LinUCB. Furthermore, we couple LinPRUCB with a special parameter to formalize a new algorithm that yields faster computation in updating the internal models while keeping the promising practical performance. The two properties match the real-world needs of the contextual bandit problem and make the new algorithm a favorable choice in practice.

Cite this Paper


BibTeX
@InProceedings{pmlr-v39-chou14, title = {Pseudo-reward Algorithms for Contextual Bandits with Linear Payoff Functions}, author = {Chou, Ku-Chun and Lin, Hsuan-Tien and Chiang, Chao-Kai and Lu, Chi-Jen}, booktitle = {Proceedings of the Sixth Asian Conference on Machine Learning}, pages = {344--359}, year = {2015}, editor = {Phung, Dinh and Li, Hang}, volume = {39}, series = {Proceedings of Machine Learning Research}, address = {Nha Trang City, Vietnam}, month = {26--28 Nov}, publisher = {PMLR}, pdf = {http://proceedings.mlr.press/v39/chou14.pdf}, url = {https://proceedings.mlr.press/v39/chou14.html}, abstract = {We study the contextual bandit problem with linear payoff functions, which is a generalization of the traditional multi-armed bandit problem. In the contextual bandit problem, the learner needs to iteratively select an action based on an observed context, and receives a linear score on only the selected action as the reward feedback. Motivated by the observation that better performance is achievable if the other rewards on the non-selected actions can also be revealed to the learner, we propose a new framework that feeds the learner with pseudo-rewards, which are estimates of the rewards on the non-selected actions. We argue that the pseudo-rewards should better contain over-estimates of the true rewards, and propose a forgetting mechanism to decrease the negative influence of the over-estimation in the long run. Then, we couple the two key ideas above with the linear upper confidence bound (LinUCB) algorithm to design a novel algorithm called linear pseudo-reward upper confidence bound (LinPRUCB). We prove that LinPRUCB shares the same order of regret bound to LinUCB, while enjoying the practical observation of faster reward-gathering in the earlier iterations. Experiments on artificial and real-world data sets justify that LinPRUCB is competitive to and sometimes even better than LinUCB. Furthermore, we couple LinPRUCB with a special parameter to formalize a new algorithm that yields faster computation in updating the internal models while keeping the promising practical performance. The two properties match the real-world needs of the contextual bandit problem and make the new algorithm a favorable choice in practice.} }
Endnote
%0 Conference Paper %T Pseudo-reward Algorithms for Contextual Bandits with Linear Payoff Functions %A Ku-Chun Chou %A Hsuan-Tien Lin %A Chao-Kai Chiang %A Chi-Jen Lu %B Proceedings of the Sixth Asian Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2015 %E Dinh Phung %E Hang Li %F pmlr-v39-chou14 %I PMLR %P 344--359 %U https://proceedings.mlr.press/v39/chou14.html %V 39 %X We study the contextual bandit problem with linear payoff functions, which is a generalization of the traditional multi-armed bandit problem. In the contextual bandit problem, the learner needs to iteratively select an action based on an observed context, and receives a linear score on only the selected action as the reward feedback. Motivated by the observation that better performance is achievable if the other rewards on the non-selected actions can also be revealed to the learner, we propose a new framework that feeds the learner with pseudo-rewards, which are estimates of the rewards on the non-selected actions. We argue that the pseudo-rewards should better contain over-estimates of the true rewards, and propose a forgetting mechanism to decrease the negative influence of the over-estimation in the long run. Then, we couple the two key ideas above with the linear upper confidence bound (LinUCB) algorithm to design a novel algorithm called linear pseudo-reward upper confidence bound (LinPRUCB). We prove that LinPRUCB shares the same order of regret bound to LinUCB, while enjoying the practical observation of faster reward-gathering in the earlier iterations. Experiments on artificial and real-world data sets justify that LinPRUCB is competitive to and sometimes even better than LinUCB. Furthermore, we couple LinPRUCB with a special parameter to formalize a new algorithm that yields faster computation in updating the internal models while keeping the promising practical performance. The two properties match the real-world needs of the contextual bandit problem and make the new algorithm a favorable choice in practice.
RIS
TY - CPAPER TI - Pseudo-reward Algorithms for Contextual Bandits with Linear Payoff Functions AU - Ku-Chun Chou AU - Hsuan-Tien Lin AU - Chao-Kai Chiang AU - Chi-Jen Lu BT - Proceedings of the Sixth Asian Conference on Machine Learning DA - 2015/02/16 ED - Dinh Phung ED - Hang Li ID - pmlr-v39-chou14 PB - PMLR DP - Proceedings of Machine Learning Research VL - 39 SP - 344 EP - 359 L1 - http://proceedings.mlr.press/v39/chou14.pdf UR - https://proceedings.mlr.press/v39/chou14.html AB - We study the contextual bandit problem with linear payoff functions, which is a generalization of the traditional multi-armed bandit problem. In the contextual bandit problem, the learner needs to iteratively select an action based on an observed context, and receives a linear score on only the selected action as the reward feedback. Motivated by the observation that better performance is achievable if the other rewards on the non-selected actions can also be revealed to the learner, we propose a new framework that feeds the learner with pseudo-rewards, which are estimates of the rewards on the non-selected actions. We argue that the pseudo-rewards should better contain over-estimates of the true rewards, and propose a forgetting mechanism to decrease the negative influence of the over-estimation in the long run. Then, we couple the two key ideas above with the linear upper confidence bound (LinUCB) algorithm to design a novel algorithm called linear pseudo-reward upper confidence bound (LinPRUCB). We prove that LinPRUCB shares the same order of regret bound to LinUCB, while enjoying the practical observation of faster reward-gathering in the earlier iterations. Experiments on artificial and real-world data sets justify that LinPRUCB is competitive to and sometimes even better than LinUCB. Furthermore, we couple LinPRUCB with a special parameter to formalize a new algorithm that yields faster computation in updating the internal models while keeping the promising practical performance. The two properties match the real-world needs of the contextual bandit problem and make the new algorithm a favorable choice in practice. ER -
APA
Chou, K., Lin, H., Chiang, C. & Lu, C.. (2015). Pseudo-reward Algorithms for Contextual Bandits with Linear Payoff Functions. Proceedings of the Sixth Asian Conference on Machine Learning, in Proceedings of Machine Learning Research 39:344-359 Available from https://proceedings.mlr.press/v39/chou14.html.

Related Material