Symmetric Reinforcement Learning Loss for Robust Learning on Diverse Tasks and Model Scales

Ju-Seung Byun, Andrew Perrault
Proceedings of the 42nd International Conference on Machine Learning, PMLR 267:6126-6148, 2025.

Abstract

Reinforcement learning (RL) training is inherently unstable due to factors such as moving targets and high gradient variance. Reinforcement Learning from Human Feedback (RLHF) and Reinforcement Learning from AI Feedback (RLAIF) introduce additional challenges. For instance, diverse preferences complicate the alignment process, and prediction errors in a trained reward model can become more severe as the LLM generates unseen outputs. These RL challenges create confusion about whether the probability of an action for a given state should be increased or decreased, similar to the noise in labels for classification tasks. In this work, we focus on RL algorithms that share learning difficulties with cross-entropy loss, especially for low-probability predictions. To enhance stability, we adapt reverse cross-entropy (RCE) from supervised learning for noisy data, defining a symmetric RL loss. We demonstrate performance improvements across various tasks and scales. We conduct experiments in discrete action tasks (Atari games) and continuous action space tasks (MuJoCo benchmark and Box2D) using Symmetric A2C (SA2C) and Symmetric PPO (SPPO). Notably, SPPO shows strong performance across different hyperparameters. Furthermore, we validate the symmetric RL loss in the RLHF framework using PPO for natural language processing tasks such as IMDB positive sentiment and TL;DR summarization.

Cite this Paper


BibTeX
@InProceedings{pmlr-v267-byun25a, title = {Symmetric Reinforcement Learning Loss for Robust Learning on Diverse Tasks and Model Scales}, author = {Byun, Ju-Seung and Perrault, Andrew}, booktitle = {Proceedings of the 42nd International Conference on Machine Learning}, pages = {6126--6148}, year = {2025}, editor = {Singh, Aarti and Fazel, Maryam and Hsu, Daniel and Lacoste-Julien, Simon and Berkenkamp, Felix and Maharaj, Tegan and Wagstaff, Kiri and Zhu, Jerry}, volume = {267}, series = {Proceedings of Machine Learning Research}, month = {13--19 Jul}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v267/main/assets/byun25a/byun25a.pdf}, url = {https://proceedings.mlr.press/v267/byun25a.html}, abstract = {Reinforcement learning (RL) training is inherently unstable due to factors such as moving targets and high gradient variance. Reinforcement Learning from Human Feedback (RLHF) and Reinforcement Learning from AI Feedback (RLAIF) introduce additional challenges. For instance, diverse preferences complicate the alignment process, and prediction errors in a trained reward model can become more severe as the LLM generates unseen outputs. These RL challenges create confusion about whether the probability of an action for a given state should be increased or decreased, similar to the noise in labels for classification tasks. In this work, we focus on RL algorithms that share learning difficulties with cross-entropy loss, especially for low-probability predictions. To enhance stability, we adapt reverse cross-entropy (RCE) from supervised learning for noisy data, defining a symmetric RL loss. We demonstrate performance improvements across various tasks and scales. We conduct experiments in discrete action tasks (Atari games) and continuous action space tasks (MuJoCo benchmark and Box2D) using Symmetric A2C (SA2C) and Symmetric PPO (SPPO). Notably, SPPO shows strong performance across different hyperparameters. Furthermore, we validate the symmetric RL loss in the RLHF framework using PPO for natural language processing tasks such as IMDB positive sentiment and TL;DR summarization.} }
Endnote
%0 Conference Paper %T Symmetric Reinforcement Learning Loss for Robust Learning on Diverse Tasks and Model Scales %A Ju-Seung Byun %A Andrew Perrault %B Proceedings of the 42nd International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2025 %E Aarti Singh %E Maryam Fazel %E Daniel Hsu %E Simon Lacoste-Julien %E Felix Berkenkamp %E Tegan Maharaj %E Kiri Wagstaff %E Jerry Zhu %F pmlr-v267-byun25a %I PMLR %P 6126--6148 %U https://proceedings.mlr.press/v267/byun25a.html %V 267 %X Reinforcement learning (RL) training is inherently unstable due to factors such as moving targets and high gradient variance. Reinforcement Learning from Human Feedback (RLHF) and Reinforcement Learning from AI Feedback (RLAIF) introduce additional challenges. For instance, diverse preferences complicate the alignment process, and prediction errors in a trained reward model can become more severe as the LLM generates unseen outputs. These RL challenges create confusion about whether the probability of an action for a given state should be increased or decreased, similar to the noise in labels for classification tasks. In this work, we focus on RL algorithms that share learning difficulties with cross-entropy loss, especially for low-probability predictions. To enhance stability, we adapt reverse cross-entropy (RCE) from supervised learning for noisy data, defining a symmetric RL loss. We demonstrate performance improvements across various tasks and scales. We conduct experiments in discrete action tasks (Atari games) and continuous action space tasks (MuJoCo benchmark and Box2D) using Symmetric A2C (SA2C) and Symmetric PPO (SPPO). Notably, SPPO shows strong performance across different hyperparameters. Furthermore, we validate the symmetric RL loss in the RLHF framework using PPO for natural language processing tasks such as IMDB positive sentiment and TL;DR summarization.
APA
Byun, J. & Perrault, A.. (2025). Symmetric Reinforcement Learning Loss for Robust Learning on Diverse Tasks and Model Scales. Proceedings of the 42nd International Conference on Machine Learning, in Proceedings of Machine Learning Research 267:6126-6148 Available from https://proceedings.mlr.press/v267/byun25a.html.

Related Material