Theoretical Analysis of Efficiency and Robustness of Softmax and Gap-Increasing Operators in Reinforcement Learning
Proceedings of the Twenty-Second International Conference on Artificial Intelligence and Statistics, PMLR 89:2995-3003, 2019.
In this paper, we propose and analyze conservative value iteration, which unifies value iteration, soft value iteration, advantage learning, and dynamic policy programming. Our analysis shows that algorithms using a combination of gap-increasing and max operators are resilient to stochastic errors, but not to non-stochastic errors. In contrast, algorithms using a softmax operator without a gap-increasing operator are less susceptible to all types of errors, but may display poor asymptotic performance. Algorithms using a combination of gap-increasing and softmax operators are much more effective and may asymptotically outperform algorithms with the max operator. Not only do these theoretical results provide a deep understanding of various reinforcement learning algorithms, but they also highlight the effectiveness of gap-increasing operators, as well as the limitations of traditional greedy value updates by the max operator.