[edit]
Towards Optimizing Proximal Policy Optimization PPO through Supervised Model-Support
Proceedings of the The 39th Canadian Conference on Artificial Intelligence, PMLR 318:260-271, 2026.
Abstract
Reinforcement Learning enables agents to learn behaviors by interacting with the environment and maximizing cumulative rewards. Model-free methods are widely used for their simplicity and flexibility, but often suffer from slow convergence as they solely rely on trial-and-error learning without knowledge of environment dynamics. Proximal Policy Optimization (PPO) is a popular on-policy algorithm that collects data using its current policy. However, because PPO relies on freshly sampled trajectories, it has limited ability to reuse past experiences, which can lead to repeatedly exploring suboptimal behaviors and slow policy improvement. To address this, we present Model-Support (MS), a supervised assistant that maintains model-free learning principles while improving efficiency. MS learns state-action pairs from high-return trajectories and serves as a supplementary policy that clones high-performing behaviors. While the agent explores broadly, the MS policy samples meaningful actions based on those behaviors. This combination leads to greater diversity of actions by mixing broad sampling from the agent’s actor with focused sampling from the MS policy. MS acts as a form of local memory, capturing high-reward trajectories and guiding exploration toward promising regions that the agent policy might overwrite or miss. Although PPO uses advantage estimates to emphasize better actions within sampled data, it does not explicitly prioritize high-return trajectories. Consequently, suboptimal experiences still influence learning, weakening valuable signals and slowing convergence. This highlights the role of MS in preserving and cloning high-return behaviors to guide exploration and accelerate convergence.