A Unifying Framework for Action-Conditional Self-Predictive Reinforcement Learning

Khimya Khetarpal; Zhaohan Daniel Guo; Bernardo Avila Pires; Yunhao Tang; Clare Lyle; Mark Rowland; Nicolas Heess; Diana L Borsa; Arthur Guez; Will Dabney

A Unifying Framework for Action-Conditional Self-Predictive Reinforcement Learning

Khimya Khetarpal, Zhaohan Daniel Guo, Bernardo Avila Pires, Yunhao Tang, Clare Lyle, Mark Rowland, Nicolas Heess, Diana L Borsa, Arthur Guez, Will Dabney

Proceedings of The 28th International Conference on Artificial Intelligence and Statistics, PMLR 258:181-189, 2025.

Abstract

Learning a good representation is a crucial challenge for reinforcement learning (RL) agents. Self-predictive algorithms jointly learn a latent representation and dynamics model by bootstrapping from future latent representations (BYOL). Recent work has developed theoretical insights into these algorithms by studying a continuous-time ODE model in the case of a fixed policy (BYOL-$\Pi$); this assumption is at odds with practical implementations, which explicitly condition their predictions on future actions. In this work, we take a step towards bridging the gap between theory and practice by analyzing an action-conditional self-predictive objective (BYOL-AC) using the ODE framework. Interestingly, we uncover that BYOL-$\Pi$ and BYOL-AC are related through the lens of variance. We unify the study of these objectives through two complementary lenses; a model-based perspective, where each objective is related to low-rank approximation of certain dynamics, and a model-free perspective, which relates the objectives to modified value, Q-value, and Advantage functions. This mismatch with the true value functions leads to the empirical observation (in both linear and deep RL experiments) that BYOL-$\Pi$ and BYOL-AC are either very similar in performance across many tasks or task-dependent.

Cite this Paper

BibTeX

@InProceedings{pmlr-v258-khetarpal25a,
  title = 	 {A Unifying Framework for Action-Conditional Self-Predictive Reinforcement Learning},
  author =       {Khetarpal, Khimya and Guo, Zhaohan Daniel and Pires, Bernardo Avila and Tang, Yunhao and Lyle, Clare and Rowland, Mark and Heess, Nicolas and Borsa, Diana L and Guez, Arthur and Dabney, Will},
  booktitle = 	 {Proceedings of The 28th International Conference on Artificial Intelligence and Statistics},
  pages = 	 {181--189},
  year = 	 {2025},
  editor = 	 {Li, Yingzhen and Mandt, Stephan and Agrawal, Shipra and Khan, Emtiyaz},
  volume = 	 {258},
  series = 	 {Proceedings of Machine Learning Research},
  month = 	 {03--05 May},
  publisher =    {PMLR},
  pdf = 	 {https://raw.githubusercontent.com/mlresearch/v258/main/assets/khetarpal25a/khetarpal25a.pdf},
  url = 	 {https://proceedings.mlr.press/v258/khetarpal25a.html},
  abstract = 	 {Learning a good representation is a crucial challenge for reinforcement learning (RL) agents. Self-predictive algorithms jointly learn a latent representation and dynamics model by bootstrapping from future latent representations (BYOL). Recent work has developed theoretical insights into these algorithms by studying a continuous-time ODE model in the case of a fixed policy (BYOL-$\Pi$); this assumption is at odds with practical implementations, which explicitly condition their predictions on future actions. In this work, we take a step towards bridging the gap between theory and practice by analyzing an action-conditional self-predictive objective (BYOL-AC) using the ODE framework. Interestingly, we uncover that BYOL-$\Pi$ and BYOL-AC are related through the lens of variance. We unify the study of these objectives through two complementary lenses; a model-based perspective, where each objective is related to low-rank approximation of certain dynamics, and a model-free perspective, which relates the objectives to modified value, Q-value, and Advantage functions. This mismatch with the true value functions leads to the empirical observation (in both linear and deep RL experiments) that BYOL-$\Pi$ and BYOL-AC are either very similar in performance across many tasks or task-dependent.}
}

Endnote

%0 Conference Paper
%T A Unifying Framework for Action-Conditional Self-Predictive Reinforcement Learning
%A Khimya Khetarpal
%A Zhaohan Daniel Guo
%A Bernardo Avila Pires
%A Yunhao Tang
%A Clare Lyle
%A Mark Rowland
%A Nicolas Heess
%A Diana L Borsa
%A Arthur Guez
%A Will Dabney
%B Proceedings of The 28th International Conference on Artificial Intelligence and Statistics
%C Proceedings of Machine Learning Research
%D 2025
%E Yingzhen Li
%E Stephan Mandt
%E Shipra Agrawal
%E Emtiyaz Khan	
%F pmlr-v258-khetarpal25a
%I PMLR
%P 181--189
%U https://proceedings.mlr.press/v258/khetarpal25a.html
%V 258
%X Learning a good representation is a crucial challenge for reinforcement learning (RL) agents. Self-predictive algorithms jointly learn a latent representation and dynamics model by bootstrapping from future latent representations (BYOL). Recent work has developed theoretical insights into these algorithms by studying a continuous-time ODE model in the case of a fixed policy (BYOL-$\Pi$); this assumption is at odds with practical implementations, which explicitly condition their predictions on future actions. In this work, we take a step towards bridging the gap between theory and practice by analyzing an action-conditional self-predictive objective (BYOL-AC) using the ODE framework. Interestingly, we uncover that BYOL-$\Pi$ and BYOL-AC are related through the lens of variance. We unify the study of these objectives through two complementary lenses; a model-based perspective, where each objective is related to low-rank approximation of certain dynamics, and a model-free perspective, which relates the objectives to modified value, Q-value, and Advantage functions. This mismatch with the true value functions leads to the empirical observation (in both linear and deep RL experiments) that BYOL-$\Pi$ and BYOL-AC are either very similar in performance across many tasks or task-dependent.

APA

Khetarpal, K., Guo, Z.D., Pires, B.A., Tang, Y., Lyle, C., Rowland, M., Heess, N., Borsa, D.L., Guez, A. & Dabney, W.. (2025). A Unifying Framework for Action-Conditional Self-Predictive Reinforcement Learning. Proceedings of The 28th International Conference on Artificial Intelligence and Statistics, in Proceedings of Machine Learning Research 258:181-189 Available from https://proceedings.mlr.press/v258/khetarpal25a.html.

A Unifying Framework for Action-Conditional Self-Predictive Reinforcement Learning

Abstract

Cite this Paper

Related Material