[edit]
Online Recommendations for Agents with Discounted Adaptive Preferences
Proceedings of The 35th International Conference on Algorithmic Learning Theory, PMLR 237:244-281, 2024.
Abstract
We consider a bandit recommendations problem in which an agent’s preferences (representing selection probabilities over recommended items) evolve as a function of past selections, according to an unknown \textit{preference model}. In each round, we show a menu of $k$ items (out of $n$ total) to the agent, who then chooses a single item, and we aim to minimize regret with respect to some \textit{target set} (a subset of the item simplex) for adversarial losses over the agent’s choices. Extending the setting from \cite{AgarwalB22}, where uniform-memory agents were considered, here we allow for non-uniform memory in which a discount factor is applied to the agent’s memory vector at each subsequent round. In the “long-term memory” regime (when the effective memory horizon scales with $T$ sublinearly), we show that efficient sublinear regret is obtainable with respect to the set of \textit{everywhere instantaneously realizable distributions} (the “EIRD set”, as formulated in prior work) for any \textit{smooth} preference model. Further, for preferences which are bounded above and below by linear functions of memory weight (we call these “scale-bounded” preferences) we give an algorithm which obtains efficient sublinear regret with respect to nearly the \textit{entire} item simplex. We show an NP-hardness result for expanding to targets beyond EIRD in general. In the “short-term memory” regime (when the memory horizon is constant), we show that scale-bounded preferences again enable efficient sublinear regret for nearly the entire simplex even without smoothness if losses do not change too frequently, yet we show an information-theoretic barrier for competing against the EIRD set under arbitrary smooth preference models even when losses are constant.