ImitationRegularized Offline Learning
[edit]
Proceedings of Machine Learning Research, PMLR 89:29562965, 2019.
Abstract
We study the problem of offline learning in automated decision systems under the contextual bandits model. We are given logged historical data consisting of contexts, (randomized) actions, and (nonnegative) rewards. A common goal is to evaluate what would happen if different actions were taken in the same contexts, so as to optimize the action policies accordingly. The typical approach to this problem, inverse probability weighted estimation (IPWE), requires logged action probabilities, which may be missing in practice due to engineering complications. Even when available, small action probabilities cause large uncertainty in IPWE, rendering the corresponding results insignificant. To solve both problems, we show how one can use policy improvement (PIL) objectives, regularized by policy imitation (IML). We motivate and analyze PIL as an extension to ClippedIPWE, by showing that both are lowerbound surrogates to the vanilla IPWE. We also formally connect IML to IPWE variance estimation and natural policy gradients. Without probability logging, our PILIML interpretations justify and improve, by rewardweighting, the stateofart crossentropy (CE) loss that predicts the action items among all action candidates available in the same contexts. With probability logging, our main theoretical contribution connects IMLunderfitting to the existence of either confounding variables or model misspecification. We show the value and accuracy of our insights by simulations based on Simpson’s paradox, standard UCI multiclasstobandit conversions and on the Criteo counterfactual analysis challenge dataset.
Related Material


