Online Recommendations for Agents with Discounted Adaptive Preferences

William Brown, Arpit Agarwal
Proceedings of The 35th International Conference on Algorithmic Learning Theory, PMLR 237:244-281, 2024.

Abstract

We consider a bandit recommendations problem in which an agent’s preferences (representing selection probabilities over recommended items) evolve as a function of past selections, according to an unknown \textit{preference model}. In each round, we show a menu of $k$ items (out of $n$ total) to the agent, who then chooses a single item, and we aim to minimize regret with respect to some \textit{target set} (a subset of the item simplex) for adversarial losses over the agent’s choices. Extending the setting from \cite{AgarwalB22}, where uniform-memory agents were considered, here we allow for non-uniform memory in which a discount factor is applied to the agent’s memory vector at each subsequent round. In the “long-term memory” regime (when the effective memory horizon scales with $T$ sublinearly), we show that efficient sublinear regret is obtainable with respect to the set of \textit{everywhere instantaneously realizable distributions} (the “EIRD set”, as formulated in prior work) for any \textit{smooth} preference model. Further, for preferences which are bounded above and below by linear functions of memory weight (we call these “scale-bounded” preferences) we give an algorithm which obtains efficient sublinear regret with respect to nearly the \textit{entire} item simplex. We show an NP-hardness result for expanding to targets beyond EIRD in general. In the “short-term memory” regime (when the memory horizon is constant), we show that scale-bounded preferences again enable efficient sublinear regret for nearly the entire simplex even without smoothness if losses do not change too frequently, yet we show an information-theoretic barrier for competing against the EIRD set under arbitrary smooth preference models even when losses are constant.

Cite this Paper


BibTeX
@InProceedings{pmlr-v237-brown24a, title = {Online Recommendations for Agents with Discounted Adaptive Preferences}, author = {Brown, William and Agarwal, Arpit}, booktitle = {Proceedings of The 35th International Conference on Algorithmic Learning Theory}, pages = {244--281}, year = {2024}, editor = {Vernade, Claire and Hsu, Daniel}, volume = {237}, series = {Proceedings of Machine Learning Research}, month = {25--28 Feb}, publisher = {PMLR}, pdf = {https://proceedings.mlr.press/v237/brown24a/brown24a.pdf}, url = {https://proceedings.mlr.press/v237/brown24a.html}, abstract = {We consider a bandit recommendations problem in which an agent’s preferences (representing selection probabilities over recommended items) evolve as a function of past selections, according to an unknown \textit{preference model}. In each round, we show a menu of $k$ items (out of $n$ total) to the agent, who then chooses a single item, and we aim to minimize regret with respect to some \textit{target set} (a subset of the item simplex) for adversarial losses over the agent’s choices. Extending the setting from \cite{AgarwalB22}, where uniform-memory agents were considered, here we allow for non-uniform memory in which a discount factor is applied to the agent’s memory vector at each subsequent round. In the “long-term memory” regime (when the effective memory horizon scales with $T$ sublinearly), we show that efficient sublinear regret is obtainable with respect to the set of \textit{everywhere instantaneously realizable distributions} (the “EIRD set”, as formulated in prior work) for any \textit{smooth} preference model. Further, for preferences which are bounded above and below by linear functions of memory weight (we call these “scale-bounded” preferences) we give an algorithm which obtains efficient sublinear regret with respect to nearly the \textit{entire} item simplex. We show an NP-hardness result for expanding to targets beyond EIRD in general. In the “short-term memory” regime (when the memory horizon is constant), we show that scale-bounded preferences again enable efficient sublinear regret for nearly the entire simplex even without smoothness if losses do not change too frequently, yet we show an information-theoretic barrier for competing against the EIRD set under arbitrary smooth preference models even when losses are constant.} }
Endnote
%0 Conference Paper %T Online Recommendations for Agents with Discounted Adaptive Preferences %A William Brown %A Arpit Agarwal %B Proceedings of The 35th International Conference on Algorithmic Learning Theory %C Proceedings of Machine Learning Research %D 2024 %E Claire Vernade %E Daniel Hsu %F pmlr-v237-brown24a %I PMLR %P 244--281 %U https://proceedings.mlr.press/v237/brown24a.html %V 237 %X We consider a bandit recommendations problem in which an agent’s preferences (representing selection probabilities over recommended items) evolve as a function of past selections, according to an unknown \textit{preference model}. In each round, we show a menu of $k$ items (out of $n$ total) to the agent, who then chooses a single item, and we aim to minimize regret with respect to some \textit{target set} (a subset of the item simplex) for adversarial losses over the agent’s choices. Extending the setting from \cite{AgarwalB22}, where uniform-memory agents were considered, here we allow for non-uniform memory in which a discount factor is applied to the agent’s memory vector at each subsequent round. In the “long-term memory” regime (when the effective memory horizon scales with $T$ sublinearly), we show that efficient sublinear regret is obtainable with respect to the set of \textit{everywhere instantaneously realizable distributions} (the “EIRD set”, as formulated in prior work) for any \textit{smooth} preference model. Further, for preferences which are bounded above and below by linear functions of memory weight (we call these “scale-bounded” preferences) we give an algorithm which obtains efficient sublinear regret with respect to nearly the \textit{entire} item simplex. We show an NP-hardness result for expanding to targets beyond EIRD in general. In the “short-term memory” regime (when the memory horizon is constant), we show that scale-bounded preferences again enable efficient sublinear regret for nearly the entire simplex even without smoothness if losses do not change too frequently, yet we show an information-theoretic barrier for competing against the EIRD set under arbitrary smooth preference models even when losses are constant.
APA
Brown, W. & Agarwal, A.. (2024). Online Recommendations for Agents with Discounted Adaptive Preferences. Proceedings of The 35th International Conference on Algorithmic Learning Theory, in Proceedings of Machine Learning Research 237:244-281 Available from https://proceedings.mlr.press/v237/brown24a.html.

Related Material