Online Recommendations for Agents with Discounted Adaptive Preferences

William Brown; Arpit Agarwal

Online Recommendations for Agents with Discounted Adaptive Preferences

William Brown, Arpit Agarwal

Proceedings of The 35th International Conference on Algorithmic Learning Theory, PMLR 237:244-281, 2024.

Abstract

We consider a bandit recommendations problem in which an agent’s preferences (representing selection probabilities over recommended items) evolve as a function of past selections, according to an unknown \textit{preference model}. In each round, we show a menu of

$k$ items (out of

$n$ total) to the agent, who then chooses a single item, and we aim to minimize regret with respect to some \textit{target set} (a subset of the item simplex) for adversarial losses over the agent’s choices. Extending the setting from \cite{AgarwalB22}, where uniform-memory agents were considered, here we allow for non-uniform memory in which a discount factor is applied to the agent’s memory vector at each subsequent round. In the “long-term memory” regime (when the effective memory horizon scales with

$T$ sublinearly), we show that efficient sublinear regret is obtainable with respect to the set of \textit{everywhere instantaneously realizable distributions} (the “EIRD set”, as formulated in prior work) for any \textit{smooth} preference model. Further, for preferences which are bounded above and below by linear functions of memory weight (we call these “scale-bounded” preferences) we give an algorithm which obtains efficient sublinear regret with respect to nearly the \textit{entire} item simplex. We show an NP-hardness result for expanding to targets beyond EIRD in general. In the “short-term memory” regime (when the memory horizon is constant), we show that scale-bounded preferences again enable efficient sublinear regret for nearly the entire simplex even without smoothness if losses do not change too frequently, yet we show an information-theoretic barrier for competing against the EIRD set under arbitrary smooth preference models even when losses are constant.

Cite this Paper

BibTeX


@InProceedings{pmlr-v237-brown24a,
  title = 	 {Online Recommendations for Agents with Discounted Adaptive Preferences},
  author =       {Brown, William and Agarwal, Arpit},
  booktitle = 	 {Proceedings of The 35th International Conference on Algorithmic Learning Theory},
  pages = 	 {244--281},
  year = 	 {2024},
  editor = 	 {Vernade, Claire and Hsu, Daniel},
  volume = 	 {237},
  series = 	 {Proceedings of Machine Learning Research},
  month = 	 {25--28 Feb},
  publisher =    {PMLR},
  pdf = 	 {https://proceedings.mlr.press/v237/brown24a/brown24a.pdf},
  url = 	 {https://proceedings.mlr.press/v237/brown24a.html},
  abstract = 	 {We consider a bandit recommendations problem in which an agent’s preferences (representing selection probabilities over recommended items) evolve as a function of past selections, according to an unknown \textit{preference model}. In each round, we show a menu of $k$ items (out of $n$ total) to the agent, who then chooses a single item, and we aim to minimize regret with respect to some \textit{target set} (a subset of the item simplex) for adversarial losses over the agent’s choices. Extending the setting from \cite{AgarwalB22}, where uniform-memory agents were considered, here we allow for non-uniform memory in which a discount factor is applied to the agent’s memory vector at each subsequent round. In the “long-term memory” regime (when the effective memory horizon scales with $T$ sublinearly), we show that efficient sublinear regret is obtainable with respect to the set of \textit{everywhere instantaneously realizable distributions} (the “EIRD set”, as formulated in prior work) for any \textit{smooth} preference model. Further, for preferences which are bounded above and below by linear functions of memory weight (we call these “scale-bounded” preferences) we give an algorithm which obtains efficient sublinear regret with respect to nearly the \textit{entire} item simplex. We show an NP-hardness result for expanding to targets beyond EIRD in general. In the “short-term memory” regime (when the memory horizon is constant), we show that scale-bounded preferences again enable efficient sublinear regret for nearly the entire simplex even without smoothness if losses do not change too frequently, yet we show an information-theoretic barrier for competing against the EIRD set under arbitrary smooth preference models even when losses are constant.}
}

Endnote

%0 Conference Paper
%T Online Recommendations for Agents with Discounted Adaptive Preferences
%A William Brown
%A Arpit Agarwal
%B Proceedings of The 35th International Conference on Algorithmic Learning Theory
%C Proceedings of Machine Learning Research
%D 2024
%E Claire Vernade
%E Daniel Hsu	
%F pmlr-v237-brown24a
%I PMLR
%P 244--281
%U https://proceedings.mlr.press/v237/brown24a.html
%V 237
%X We consider a bandit recommendations problem in which an agent’s preferences (representing selection probabilities over recommended items) evolve as a function of past selections, according to an unknown \textit{preference model}. In each round, we show a menu of $k$ items (out of $n$ total) to the agent, who then chooses a single item, and we aim to minimize regret with respect to some \textit{target set} (a subset of the item simplex) for adversarial losses over the agent’s choices. Extending the setting from \cite{AgarwalB22}, where uniform-memory agents were considered, here we allow for non-uniform memory in which a discount factor is applied to the agent’s memory vector at each subsequent round. In the “long-term memory” regime (when the effective memory horizon scales with $T$ sublinearly), we show that efficient sublinear regret is obtainable with respect to the set of \textit{everywhere instantaneously realizable distributions} (the “EIRD set”, as formulated in prior work) for any \textit{smooth} preference model. Further, for preferences which are bounded above and below by linear functions of memory weight (we call these “scale-bounded” preferences) we give an algorithm which obtains efficient sublinear regret with respect to nearly the \textit{entire} item simplex. We show an NP-hardness result for expanding to targets beyond EIRD in general. In the “short-term memory” regime (when the memory horizon is constant), we show that scale-bounded preferences again enable efficient sublinear regret for nearly the entire simplex even without smoothness if losses do not change too frequently, yet we show an information-theoretic barrier for competing against the EIRD set under arbitrary smooth preference models even when losses are constant.

APA


Brown, W. & Agarwal, A.. (2024). Online Recommendations for Agents with Discounted Adaptive Preferences. Proceedings of The 35th International Conference on Algorithmic Learning Theory, in Proceedings of Machine Learning Research 237:244-281 Available from https://proceedings.mlr.press/v237/brown24a.html.

Related Material

Download PDF