TD-M(PC)$^2$: Improving Temporal Difference MPC Through Policy Constraint

Haotian Lin, Pengcheng Wang, Jeff Schneider, Guanya Shi
Proceedings of The 8th Annual Learning for Dynamics and Control Conference, PMLR 331:705-736, 2026.

Abstract

Model-based reinforcement learning (MBRL) algorithms that integrate sampling-based MPC with learned value or policy priors have shown great potential for solving complex continuous control problems. However, existing practice relies on online planning to collect high-quality data, resulting in value learning that is entirely dependent on off-policy experiences. Contrary to the belief that value learned from model-free policy iteration is sufficiently accurate and expressive, we found that severe value overestimation bias occurs, especially in high-dimensional tasks. Through both theoretical analysis and empirical evaluations, we identify that this overestimation stems from a structural policy mismatch: the exploration policy induced by the model-based planner diverges far from the exploitation policy evaluated by the value function. To improve value learning, we emphasize conservatism that mitigates \textit{out-of-distribution} queries. The proposed method, \method, addresses this by applying a soft-constrained policy update—a minimalist yet effective solution that can be seamlessly integrated into the existing plan-based MBRL pipeline without incurring additional computational overhead. Extensive experiments demonstrate that the proposed approach improves performance over baselines by large margins, particularly in 61-DoF humanoid control tasks

Cite this Paper


BibTeX
@InProceedings{pmlr-v331-lin26a, title = {TD-M(PC)$^2$: Improving Temporal Difference MPC Through Policy Constraint}, author = {Lin, Haotian and Wang, Pengcheng and Schneider, Jeff and Shi, Guanya}, booktitle = {Proceedings of The 8th Annual Learning for Dynamics and Control Conference}, pages = {705--736}, year = {2026}, editor = {Sukhatme, Gaurav and Lindemann, Lars and Tu, Stephen and Wierman, Adam and Atanasov, Nikolay}, volume = {331}, series = {Proceedings of Machine Learning Research}, month = {17--19 Jun}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v331/main/assets/lin26a/lin26a.pdf}, url = {https://proceedings.mlr.press/v331/lin26a.html}, abstract = {Model-based reinforcement learning (MBRL) algorithms that integrate sampling-based MPC with learned value or policy priors have shown great potential for solving complex continuous control problems. However, existing practice relies on online planning to collect high-quality data, resulting in value learning that is entirely dependent on off-policy experiences. Contrary to the belief that value learned from model-free policy iteration is sufficiently accurate and expressive, we found that severe value overestimation bias occurs, especially in high-dimensional tasks. Through both theoretical analysis and empirical evaluations, we identify that this overestimation stems from a structural policy mismatch: the exploration policy induced by the model-based planner diverges far from the exploitation policy evaluated by the value function. To improve value learning, we emphasize conservatism that mitigates \textit{out-of-distribution} queries. The proposed method, \method, addresses this by applying a soft-constrained policy update—a minimalist yet effective solution that can be seamlessly integrated into the existing plan-based MBRL pipeline without incurring additional computational overhead. Extensive experiments demonstrate that the proposed approach improves performance over baselines by large margins, particularly in 61-DoF humanoid control tasks} }
Endnote
%0 Conference Paper %T TD-M(PC)$^2$: Improving Temporal Difference MPC Through Policy Constraint %A Haotian Lin %A Pengcheng Wang %A Jeff Schneider %A Guanya Shi %B Proceedings of The 8th Annual Learning for Dynamics and Control Conference %C Proceedings of Machine Learning Research %D 2026 %E Gaurav Sukhatme %E Lars Lindemann %E Stephen Tu %E Adam Wierman %E Nikolay Atanasov %F pmlr-v331-lin26a %I PMLR %P 705--736 %U https://proceedings.mlr.press/v331/lin26a.html %V 331 %X Model-based reinforcement learning (MBRL) algorithms that integrate sampling-based MPC with learned value or policy priors have shown great potential for solving complex continuous control problems. However, existing practice relies on online planning to collect high-quality data, resulting in value learning that is entirely dependent on off-policy experiences. Contrary to the belief that value learned from model-free policy iteration is sufficiently accurate and expressive, we found that severe value overestimation bias occurs, especially in high-dimensional tasks. Through both theoretical analysis and empirical evaluations, we identify that this overestimation stems from a structural policy mismatch: the exploration policy induced by the model-based planner diverges far from the exploitation policy evaluated by the value function. To improve value learning, we emphasize conservatism that mitigates \textit{out-of-distribution} queries. The proposed method, \method, addresses this by applying a soft-constrained policy update—a minimalist yet effective solution that can be seamlessly integrated into the existing plan-based MBRL pipeline without incurring additional computational overhead. Extensive experiments demonstrate that the proposed approach improves performance over baselines by large margins, particularly in 61-DoF humanoid control tasks
APA
Lin, H., Wang, P., Schneider, J. & Shi, G.. (2026). TD-M(PC)$^2$: Improving Temporal Difference MPC Through Policy Constraint. Proceedings of The 8th Annual Learning for Dynamics and Control Conference, in Proceedings of Machine Learning Research 331:705-736 Available from https://proceedings.mlr.press/v331/lin26a.html.

Related Material