[edit]
Symmetric Reinforcement Learning Loss for Robust Learning on Diverse Tasks and Model Scales
Proceedings of the 42nd International Conference on Machine Learning, PMLR 267:6126-6148, 2025.
Abstract
Reinforcement learning (RL) training is inherently unstable due to factors such as moving targets and high gradient variance. Reinforcement Learning from Human Feedback (RLHF) and Reinforcement Learning from AI Feedback (RLAIF) introduce additional challenges. For instance, diverse preferences complicate the alignment process, and prediction errors in a trained reward model can become more severe as the LLM generates unseen outputs. These RL challenges create confusion about whether the probability of an action for a given state should be increased or decreased, similar to the noise in labels for classification tasks. In this work, we focus on RL algorithms that share learning difficulties with cross-entropy loss, especially for low-probability predictions. To enhance stability, we adapt reverse cross-entropy (RCE) from supervised learning for noisy data, defining a symmetric RL loss. We demonstrate performance improvements across various tasks and scales. We conduct experiments in discrete action tasks (Atari games) and continuous action space tasks (MuJoCo benchmark and Box2D) using Symmetric A2C (SA2C) and Symmetric PPO (SPPO). Notably, SPPO shows strong performance across different hyperparameters. Furthermore, we validate the symmetric RL loss in the RLHF framework using PPO for natural language processing tasks such as IMDB positive sentiment and TL;DR summarization.