[edit]

# Information-Directed Pessimism for Offline Reinforcement Learning

*Proceedings of the 41st International Conference on Machine Learning*, PMLR 235:25226-25264, 2024.

#### Abstract

Policy optimization from batch data, i.e., offline reinforcement learning (RL) is important when collecting data from a current policy is not possible. This setting incurs distribution mismatch between batch training data and trajectories from the current policy. Pessimistic offsets estimate mismatch using concentration bounds, which possess strong theoretical guarantees and simplicity of implementation. Mismatch may be conservative in sparse data regions and less so otherwise, which can result in under-performing their no-penalty variants in practice. We derive a new pessimistic penalty as the distance between the data and the true distribution using an evaluable one-sample test known as Stein Discrepancy that requires minimal smoothness conditions, and noticeably, allows a mixture family representation of distribution over next states. This entity forms a quantifier of information in offline data, which justifies calling this approach

*information-directed pessimism*(IDP) for offline RL. We further establish that this new penalty based on discrete Stein discrepancy yields practical gains in performance while generalizing the regret of prior art to multimodal distributions.