Trading off Rewards and Errors in Multi-Armed Bandits

Akram Erraqabi, Alessandro Lazaric, Michal Valko, Emma Brunskill, Yun-En Liu
Proceedings of the 20th International Conference on Artificial Intelligence and Statistics, PMLR 54:709-717, 2017.

Abstract

In multi-armed bandits, the most common objective is the maximization of the cumulative reward. Alternative settings include active exploration, where a learner tries to gain accurate estimates of the rewards of all arms. While these objectives are contrasting, in many scenarios it is desirable to trade off rewards and errors. For instance, in educational games the designer wants to gather generalizable knowledge about the behavior of the students and teaching strategies (small estimation errors) but, at the same time, the system needs to avoid giving a bad experience to the players, who may leave the system permanently (large reward). In this paper, we formalize this tradeoff and introduce the ForcingBalance algorithm whose performance is provably close to the best possible tradeoff strategy. Finally, we demonstrate on real-world educational data that ForcingBalance returns useful information about the arms without compromising the overall reward.

Cite this Paper


BibTeX
@InProceedings{pmlr-v54-erraqabi17a, title = {{Trading off Rewards and Errors in Multi-Armed Bandits}}, author = {Erraqabi, Akram and Lazaric, Alessandro and Valko, Michal and Brunskill, Emma and Liu, Yun-En}, booktitle = {Proceedings of the 20th International Conference on Artificial Intelligence and Statistics}, pages = {709--717}, year = {2017}, editor = {Singh, Aarti and Zhu, Jerry}, volume = {54}, series = {Proceedings of Machine Learning Research}, month = {20--22 Apr}, publisher = {PMLR}, pdf = {http://proceedings.mlr.press/v54/erraqabi17a/erraqabi17a.pdf}, url = {https://proceedings.mlr.press/v54/erraqabi17a.html}, abstract = {In multi-armed bandits, the most common objective is the maximization of the cumulative reward. Alternative settings include active exploration, where a learner tries to gain accurate estimates of the rewards of all arms. While these objectives are contrasting, in many scenarios it is desirable to trade off rewards and errors. For instance, in educational games the designer wants to gather generalizable knowledge about the behavior of the students and teaching strategies (small estimation errors) but, at the same time, the system needs to avoid giving a bad experience to the players, who may leave the system permanently (large reward). In this paper, we formalize this tradeoff and introduce the ForcingBalance algorithm whose performance is provably close to the best possible tradeoff strategy. Finally, we demonstrate on real-world educational data that ForcingBalance returns useful information about the arms without compromising the overall reward.} }
Endnote
%0 Conference Paper %T Trading off Rewards and Errors in Multi-Armed Bandits %A Akram Erraqabi %A Alessandro Lazaric %A Michal Valko %A Emma Brunskill %A Yun-En Liu %B Proceedings of the 20th International Conference on Artificial Intelligence and Statistics %C Proceedings of Machine Learning Research %D 2017 %E Aarti Singh %E Jerry Zhu %F pmlr-v54-erraqabi17a %I PMLR %P 709--717 %U https://proceedings.mlr.press/v54/erraqabi17a.html %V 54 %X In multi-armed bandits, the most common objective is the maximization of the cumulative reward. Alternative settings include active exploration, where a learner tries to gain accurate estimates of the rewards of all arms. While these objectives are contrasting, in many scenarios it is desirable to trade off rewards and errors. For instance, in educational games the designer wants to gather generalizable knowledge about the behavior of the students and teaching strategies (small estimation errors) but, at the same time, the system needs to avoid giving a bad experience to the players, who may leave the system permanently (large reward). In this paper, we formalize this tradeoff and introduce the ForcingBalance algorithm whose performance is provably close to the best possible tradeoff strategy. Finally, we demonstrate on real-world educational data that ForcingBalance returns useful information about the arms without compromising the overall reward.
APA
Erraqabi, A., Lazaric, A., Valko, M., Brunskill, E. & Liu, Y.. (2017). Trading off Rewards and Errors in Multi-Armed Bandits. Proceedings of the 20th International Conference on Artificial Intelligence and Statistics, in Proceedings of Machine Learning Research 54:709-717 Available from https://proceedings.mlr.press/v54/erraqabi17a.html.

Related Material