Incentivizing Exploration by Heterogeneous Users

Bangrui Chen, Peter Frazier, David Kempe
Proceedings of the 31st Conference On Learning Theory, PMLR 75:798-818, 2018.

Abstract

We consider the problem of incentivizing exploration with heterogeneous agents. In this problem, $N$ bandit arms provide vector-valued outcomes equal to an unknown arm-specific attribute vector, perturbed by independent noise.Agents arrive sequentially and choose arms to pull based on their own private and heterogeneous linear utility functions over attributes and the estimates of the arms’ attribute vectors derived from observations of other agents’ past pulls. Agents are myopic and selfish and thus would choose the arm with maximum estimated utility. A principal, knowing only the distribution from which agents’ preferences are drawn, but not the specific draws, can offer arm-specific incentive payments to encourage agents to explore underplayed arms. The principal seeks to minimize the total expected cumulative regret incurred by agents relative to their best arms, while also making a small expected cumulative payment. We propose an algorithm that incentivizes arms played infrequently in the past whose probability of being played in the next round would be small without incentives. Under the assumption that each arm is preferred by at least a fraction $p > 0$ of agents, we show that this algorithm achieves expected cumulative regret of $O (N \e^{2/p} + N \log^3(T))$, using expected cumulative payments of $O(N^2 \e^{2/p})$. If $p$ is known or the distribution over agent preferences is discrete, the exponential term $\e^{2/p}$ can be replaced with suitable polynomials in $N$ and $1/p$. For discrete preferences, the regret’s dependence on $T$ can be eliminated entirely, giving constant (depending only polynomially on $N$ and $1/p$) expected regret and payments. This constant regret stands in contrast to the $\Theta(\log(T))$ dependence of regret in standard multi-armed bandit problems. It arises because even unobserved heterogeneity in agent preferences causes exploitation of arms to also explore arms fully; succinctly, heterogeneity provides free exploration.

Cite this Paper


BibTeX
@InProceedings{pmlr-v75-chen18a, title = {Incentivizing Exploration by Heterogeneous Users}, author = {Chen, Bangrui and Frazier, Peter and Kempe, David}, booktitle = {Proceedings of the 31st Conference On Learning Theory}, pages = {798--818}, year = {2018}, editor = {Bubeck, Sébastien and Perchet, Vianney and Rigollet, Philippe}, volume = {75}, series = {Proceedings of Machine Learning Research}, month = {06--09 Jul}, publisher = {PMLR}, pdf = {http://proceedings.mlr.press/v75/chen18a/chen18a.pdf}, url = {https://proceedings.mlr.press/v75/chen18a.html}, abstract = {We consider the problem of incentivizing exploration with heterogeneous agents. In this problem, $N$ bandit arms provide vector-valued outcomes equal to an unknown arm-specific attribute vector, perturbed by independent noise.Agents arrive sequentially and choose arms to pull based on their own private and heterogeneous linear utility functions over attributes and the estimates of the arms’ attribute vectors derived from observations of other agents’ past pulls. Agents are myopic and selfish and thus would choose the arm with maximum estimated utility. A principal, knowing only the distribution from which agents’ preferences are drawn, but not the specific draws, can offer arm-specific incentive payments to encourage agents to explore underplayed arms. The principal seeks to minimize the total expected cumulative regret incurred by agents relative to their best arms, while also making a small expected cumulative payment. We propose an algorithm that incentivizes arms played infrequently in the past whose probability of being played in the next round would be small without incentives. Under the assumption that each arm is preferred by at least a fraction $p > 0$ of agents, we show that this algorithm achieves expected cumulative regret of $O (N \e^{2/p} + N \log^3(T))$, using expected cumulative payments of $O(N^2 \e^{2/p})$. If $p$ is known or the distribution over agent preferences is discrete, the exponential term $\e^{2/p}$ can be replaced with suitable polynomials in $N$ and $1/p$. For discrete preferences, the regret’s dependence on $T$ can be eliminated entirely, giving constant (depending only polynomially on $N$ and $1/p$) expected regret and payments. This constant regret stands in contrast to the $\Theta(\log(T))$ dependence of regret in standard multi-armed bandit problems. It arises because even unobserved heterogeneity in agent preferences causes exploitation of arms to also explore arms fully; succinctly, heterogeneity provides free exploration. } }
Endnote
%0 Conference Paper %T Incentivizing Exploration by Heterogeneous Users %A Bangrui Chen %A Peter Frazier %A David Kempe %B Proceedings of the 31st Conference On Learning Theory %C Proceedings of Machine Learning Research %D 2018 %E Sébastien Bubeck %E Vianney Perchet %E Philippe Rigollet %F pmlr-v75-chen18a %I PMLR %P 798--818 %U https://proceedings.mlr.press/v75/chen18a.html %V 75 %X We consider the problem of incentivizing exploration with heterogeneous agents. In this problem, $N$ bandit arms provide vector-valued outcomes equal to an unknown arm-specific attribute vector, perturbed by independent noise.Agents arrive sequentially and choose arms to pull based on their own private and heterogeneous linear utility functions over attributes and the estimates of the arms’ attribute vectors derived from observations of other agents’ past pulls. Agents are myopic and selfish and thus would choose the arm with maximum estimated utility. A principal, knowing only the distribution from which agents’ preferences are drawn, but not the specific draws, can offer arm-specific incentive payments to encourage agents to explore underplayed arms. The principal seeks to minimize the total expected cumulative regret incurred by agents relative to their best arms, while also making a small expected cumulative payment. We propose an algorithm that incentivizes arms played infrequently in the past whose probability of being played in the next round would be small without incentives. Under the assumption that each arm is preferred by at least a fraction $p > 0$ of agents, we show that this algorithm achieves expected cumulative regret of $O (N \e^{2/p} + N \log^3(T))$, using expected cumulative payments of $O(N^2 \e^{2/p})$. If $p$ is known or the distribution over agent preferences is discrete, the exponential term $\e^{2/p}$ can be replaced with suitable polynomials in $N$ and $1/p$. For discrete preferences, the regret’s dependence on $T$ can be eliminated entirely, giving constant (depending only polynomially on $N$ and $1/p$) expected regret and payments. This constant regret stands in contrast to the $\Theta(\log(T))$ dependence of regret in standard multi-armed bandit problems. It arises because even unobserved heterogeneity in agent preferences causes exploitation of arms to also explore arms fully; succinctly, heterogeneity provides free exploration.
APA
Chen, B., Frazier, P. & Kempe, D.. (2018). Incentivizing Exploration by Heterogeneous Users. Proceedings of the 31st Conference On Learning Theory, in Proceedings of Machine Learning Research 75:798-818 Available from https://proceedings.mlr.press/v75/chen18a.html.

Related Material