Target Networks and Over-parameterization Stabilize Off-policy Bootstrapping with Function Approximation

Fengdi Che, Chenjun Xiao, Jincheng Mei, Bo Dai, Ramki Gummadi, Oscar A Ramirez, Christopher K Harris, A. Rupam Mahmood, Dale Schuurmans
Proceedings of the 41st International Conference on Machine Learning, PMLR 235:6372-6396, 2024.

Abstract

We prove that the combination of a target network and over-parameterized linear function approximation establishes a weaker convergence condition for bootstrapped value estimation in certain cases, even with off-policy data. Our condition is naturally satisfied for expected updates over the entire state-action space or learning with a batch of complete trajectories from episodic Markov decision processes. Notably, using only a target network or an over-parameterized model does not provide such a convergence guarantee. Additionally, we extend our results to learning with truncated trajectories, showing that convergence is achievable for all tasks with minor modifications, akin to value truncation for the final states in trajectories. Our primary result focuses on temporal difference estimation for prediction, providing high-probability value estimation error bounds and empirical analysis on Baird’s counterexample and a Four-room task. Furthermore, we explore the control setting, demonstrating that similar convergence conditions apply to Q-learning.

Cite this Paper


BibTeX
@InProceedings{pmlr-v235-che24a, title = {Target Networks and Over-parameterization Stabilize Off-policy Bootstrapping with Function Approximation}, author = {Che, Fengdi and Xiao, Chenjun and Mei, Jincheng and Dai, Bo and Gummadi, Ramki and Ramirez, Oscar A and Harris, Christopher K and Mahmood, A. Rupam and Schuurmans, Dale}, booktitle = {Proceedings of the 41st International Conference on Machine Learning}, pages = {6372--6396}, year = {2024}, editor = {Salakhutdinov, Ruslan and Kolter, Zico and Heller, Katherine and Weller, Adrian and Oliver, Nuria and Scarlett, Jonathan and Berkenkamp, Felix}, volume = {235}, series = {Proceedings of Machine Learning Research}, month = {21--27 Jul}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v235/main/assets/che24a/che24a.pdf}, url = {https://proceedings.mlr.press/v235/che24a.html}, abstract = {We prove that the combination of a target network and over-parameterized linear function approximation establishes a weaker convergence condition for bootstrapped value estimation in certain cases, even with off-policy data. Our condition is naturally satisfied for expected updates over the entire state-action space or learning with a batch of complete trajectories from episodic Markov decision processes. Notably, using only a target network or an over-parameterized model does not provide such a convergence guarantee. Additionally, we extend our results to learning with truncated trajectories, showing that convergence is achievable for all tasks with minor modifications, akin to value truncation for the final states in trajectories. Our primary result focuses on temporal difference estimation for prediction, providing high-probability value estimation error bounds and empirical analysis on Baird’s counterexample and a Four-room task. Furthermore, we explore the control setting, demonstrating that similar convergence conditions apply to Q-learning.} }
Endnote
%0 Conference Paper %T Target Networks and Over-parameterization Stabilize Off-policy Bootstrapping with Function Approximation %A Fengdi Che %A Chenjun Xiao %A Jincheng Mei %A Bo Dai %A Ramki Gummadi %A Oscar A Ramirez %A Christopher K Harris %A A. Rupam Mahmood %A Dale Schuurmans %B Proceedings of the 41st International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2024 %E Ruslan Salakhutdinov %E Zico Kolter %E Katherine Heller %E Adrian Weller %E Nuria Oliver %E Jonathan Scarlett %E Felix Berkenkamp %F pmlr-v235-che24a %I PMLR %P 6372--6396 %U https://proceedings.mlr.press/v235/che24a.html %V 235 %X We prove that the combination of a target network and over-parameterized linear function approximation establishes a weaker convergence condition for bootstrapped value estimation in certain cases, even with off-policy data. Our condition is naturally satisfied for expected updates over the entire state-action space or learning with a batch of complete trajectories from episodic Markov decision processes. Notably, using only a target network or an over-parameterized model does not provide such a convergence guarantee. Additionally, we extend our results to learning with truncated trajectories, showing that convergence is achievable for all tasks with minor modifications, akin to value truncation for the final states in trajectories. Our primary result focuses on temporal difference estimation for prediction, providing high-probability value estimation error bounds and empirical analysis on Baird’s counterexample and a Four-room task. Furthermore, we explore the control setting, demonstrating that similar convergence conditions apply to Q-learning.
APA
Che, F., Xiao, C., Mei, J., Dai, B., Gummadi, R., Ramirez, O.A., Harris, C.K., Mahmood, A.R. & Schuurmans, D.. (2024). Target Networks and Over-parameterization Stabilize Off-policy Bootstrapping with Function Approximation. Proceedings of the 41st International Conference on Machine Learning, in Proceedings of Machine Learning Research 235:6372-6396 Available from https://proceedings.mlr.press/v235/che24a.html.

Related Material