Linearizing contextual bandits with latent state dynamics

Elliot Nelson, Debarun Bhattacharjya, Tian Gao, Miao Liu, Djallel Bouneffouf, Pascal Poupart
Proceedings of the Thirty-Eighth Conference on Uncertainty in Artificial Intelligence, PMLR 180:1477-1487, 2022.

Abstract

In many real-world applications of multi-armed bandit problems, both rewards and contexts are often influenced by confounding latent variables which evolve stochastically over time. While the observed contexts and rewards are nonlinearly related, we show that prior knowledge of latent causal structure can be used to reduce the problem to the linear bandit setting. We develop two algorithms, Latent Linear Thompson Sampling (L2TS) and Latent Linear UCB (L2UCB), which use online EM algorithms for hidden Markov models to learn the latent transition model and maintain a posterior belief over the latent state, and then use the resulting posteriors as context features in a linear bandit problem. We upper bound the error in reward estimation in the presence of a dynamical latent state, and derive a novel problem-dependent regret bound for linear Thompson sampling with non-stationarity and unconstrained reward distributions, which we apply to L2TS under certain conditions. Finally, we demonstrate the superiority of our algorithms over related bandit algorithms through experiments.

Cite this Paper


BibTeX
@InProceedings{pmlr-v180-nelson22a, title = {Linearizing contextual bandits with latent state dynamics}, author = {Nelson, Elliot and Bhattacharjya, Debarun and Gao, Tian and Liu, Miao and Bouneffouf, Djallel and Poupart, Pascal}, booktitle = {Proceedings of the Thirty-Eighth Conference on Uncertainty in Artificial Intelligence}, pages = {1477--1487}, year = {2022}, editor = {Cussens, James and Zhang, Kun}, volume = {180}, series = {Proceedings of Machine Learning Research}, month = {01--05 Aug}, publisher = {PMLR}, pdf = {https://proceedings.mlr.press/v180/nelson22a/nelson22a.pdf}, url = {https://proceedings.mlr.press/v180/nelson22a.html}, abstract = {In many real-world applications of multi-armed bandit problems, both rewards and contexts are often influenced by confounding latent variables which evolve stochastically over time. While the observed contexts and rewards are nonlinearly related, we show that prior knowledge of latent causal structure can be used to reduce the problem to the linear bandit setting. We develop two algorithms, Latent Linear Thompson Sampling (L2TS) and Latent Linear UCB (L2UCB), which use online EM algorithms for hidden Markov models to learn the latent transition model and maintain a posterior belief over the latent state, and then use the resulting posteriors as context features in a linear bandit problem. We upper bound the error in reward estimation in the presence of a dynamical latent state, and derive a novel problem-dependent regret bound for linear Thompson sampling with non-stationarity and unconstrained reward distributions, which we apply to L2TS under certain conditions. Finally, we demonstrate the superiority of our algorithms over related bandit algorithms through experiments.} }
Endnote
%0 Conference Paper %T Linearizing contextual bandits with latent state dynamics %A Elliot Nelson %A Debarun Bhattacharjya %A Tian Gao %A Miao Liu %A Djallel Bouneffouf %A Pascal Poupart %B Proceedings of the Thirty-Eighth Conference on Uncertainty in Artificial Intelligence %C Proceedings of Machine Learning Research %D 2022 %E James Cussens %E Kun Zhang %F pmlr-v180-nelson22a %I PMLR %P 1477--1487 %U https://proceedings.mlr.press/v180/nelson22a.html %V 180 %X In many real-world applications of multi-armed bandit problems, both rewards and contexts are often influenced by confounding latent variables which evolve stochastically over time. While the observed contexts and rewards are nonlinearly related, we show that prior knowledge of latent causal structure can be used to reduce the problem to the linear bandit setting. We develop two algorithms, Latent Linear Thompson Sampling (L2TS) and Latent Linear UCB (L2UCB), which use online EM algorithms for hidden Markov models to learn the latent transition model and maintain a posterior belief over the latent state, and then use the resulting posteriors as context features in a linear bandit problem. We upper bound the error in reward estimation in the presence of a dynamical latent state, and derive a novel problem-dependent regret bound for linear Thompson sampling with non-stationarity and unconstrained reward distributions, which we apply to L2TS under certain conditions. Finally, we demonstrate the superiority of our algorithms over related bandit algorithms through experiments.
APA
Nelson, E., Bhattacharjya, D., Gao, T., Liu, M., Bouneffouf, D. & Poupart, P.. (2022). Linearizing contextual bandits with latent state dynamics. Proceedings of the Thirty-Eighth Conference on Uncertainty in Artificial Intelligence, in Proceedings of Machine Learning Research 180:1477-1487 Available from https://proceedings.mlr.press/v180/nelson22a.html.

Related Material