[edit]
Deep Q-Learning with Whittle Index for Contextual Restless Bandits: Application to Email Recommender Systems
Proceedings of the 6th Northern Lights Deep Learning Conference (NLDL), PMLR 265:176-183, 2025.
Abstract
In this paper, we introduce DQWIC, a novel algorithm that combines Deep Reinforcement Learning and Whittle index theory within the Contextual Restless Multi-Armed Bandit framework for the discounted criterion. DQWIC is designed to learn in evolving environments typical of real-world applications, such as recommender systems, where user preferences and environmental dynamics evolve over time. In particular, we apply DQWIC to the problem of optimizing email recommendations, where it tackles the dual challenges of enhancing content relevance and reducing spam messages, thereby addressing ethical concerns related to intrusive emailing. The algorithm leverages two neural networks: a Q-network for approximating action-value functions and a Whittle-network for estimating Whittle indices, both of which integrate contextual features to inform decision-making. In addition, the inclusion of context allows us to handle many heterogeneous users in a scalable way. The learning process occurs through a two time scale stochastic approximation, with the Q-network updated frequently to minimize the loss between predicted and target Q-values, and the Whittle-network updated on a slower time scale. To evaluate its effectiveness, we conducted experiments in partnership with a company specializing in digital marketing. Our results, derived from both synthetic and real-world data, show that DQWIC outperforms existing email marketing baselines.