TD-M(PC)$^2$: Improving Temporal Difference MPC Through Policy Constraint

Haotian Lin; Pengcheng Wang; Jeff Schneider; Guanya Shi

TD-M(PC)$^2$: Improving Temporal Difference MPC Through Policy Constraint

Haotian Lin, Pengcheng Wang, Jeff Schneider, Guanya Shi

Proceedings of The 8th Annual Learning for Dynamics and Control Conference, PMLR 331:705-736, 2026.

Abstract

Model-based reinforcement learning (MBRL) algorithms that integrate sampling-based MPC with learned value or policy priors have shown great potential for solving complex continuous control problems. However, existing practice relies on online planning to collect high-quality data, resulting in value learning that is entirely dependent on off-policy experiences. Contrary to the belief that value learned from model-free policy iteration is sufficiently accurate and expressive, we found that severe value overestimation bias occurs, especially in high-dimensional tasks. Through both theoretical analysis and empirical evaluations, we identify that this overestimation stems from a structural policy mismatch: the exploration policy induced by the model-based planner diverges far from the exploitation policy evaluated by the value function. To improve value learning, we emphasize conservatism that mitigates \textit{out-of-distribution} queries. The proposed method, \method, addresses this by applying a soft-constrained policy update—a minimalist yet effective solution that can be seamlessly integrated into the existing plan-based MBRL pipeline without incurring additional computational overhead. Extensive experiments demonstrate that the proposed approach improves performance over baselines by large margins, particularly in 61-DoF humanoid control tasks

Cite this Paper

BibTeX

@InProceedings{pmlr-v331-lin26a,
  title = 	 {TD-M(PC)$^2$: Improving Temporal Difference MPC Through Policy Constraint},
  author =       {Lin, Haotian and Wang, Pengcheng and Schneider, Jeff and Shi, Guanya},
  booktitle = 	 {Proceedings of The 8th Annual Learning for Dynamics and Control Conference},
  pages = 	 {705--736},
  year = 	 {2026},
  editor = 	 {Sukhatme, Gaurav and Lindemann, Lars and Tu, Stephen and Wierman, Adam and Atanasov, Nikolay},
  volume = 	 {331},
  series = 	 {Proceedings of Machine Learning Research},
  month = 	 {17--19 Jun},
  publisher =    {PMLR},
  pdf = 	 {https://raw.githubusercontent.com/mlresearch/v331/main/assets/lin26a/lin26a.pdf},
  url = 	 {https://proceedings.mlr.press/v331/lin26a.html},
  abstract = 	 {Model-based reinforcement learning (MBRL) algorithms that integrate sampling-based MPC with learned value or policy priors have shown great potential for solving complex continuous control problems. However, existing practice relies on online planning to collect high-quality data, resulting in value learning that is entirely dependent on off-policy experiences. Contrary to the belief that value learned from model-free policy iteration is sufficiently accurate and expressive, we found that severe value overestimation bias occurs, especially in high-dimensional tasks. Through both theoretical analysis and empirical evaluations, we identify that this overestimation stems from a structural policy mismatch: the exploration policy induced by the model-based planner diverges far from the exploitation policy evaluated by the value function. To improve value learning, we emphasize conservatism that mitigates \textit{out-of-distribution} queries. The proposed method, \method, addresses this by applying a soft-constrained policy update—a minimalist yet effective solution that can be seamlessly integrated into the existing plan-based MBRL pipeline without incurring additional computational overhead.  Extensive experiments demonstrate that the proposed approach improves performance over baselines by large margins, particularly in 61-DoF humanoid control tasks}
}

Endnote

%0 Conference Paper
%T TD-M(PC)$^2$: Improving Temporal Difference MPC Through Policy Constraint
%A Haotian Lin
%A Pengcheng Wang
%A Jeff Schneider
%A Guanya Shi
%B Proceedings of The 8th Annual Learning for Dynamics and Control Conference
%C Proceedings of Machine Learning Research
%D 2026
%E Gaurav Sukhatme
%E Lars Lindemann
%E Stephen Tu
%E Adam Wierman
%E Nikolay Atanasov	
%F pmlr-v331-lin26a
%I PMLR
%P 705--736
%U https://proceedings.mlr.press/v331/lin26a.html
%V 331
%X Model-based reinforcement learning (MBRL) algorithms that integrate sampling-based MPC with learned value or policy priors have shown great potential for solving complex continuous control problems. However, existing practice relies on online planning to collect high-quality data, resulting in value learning that is entirely dependent on off-policy experiences. Contrary to the belief that value learned from model-free policy iteration is sufficiently accurate and expressive, we found that severe value overestimation bias occurs, especially in high-dimensional tasks. Through both theoretical analysis and empirical evaluations, we identify that this overestimation stems from a structural policy mismatch: the exploration policy induced by the model-based planner diverges far from the exploitation policy evaluated by the value function. To improve value learning, we emphasize conservatism that mitigates \textit{out-of-distribution} queries. The proposed method, \method, addresses this by applying a soft-constrained policy update—a minimalist yet effective solution that can be seamlessly integrated into the existing plan-based MBRL pipeline without incurring additional computational overhead.  Extensive experiments demonstrate that the proposed approach improves performance over baselines by large margins, particularly in 61-DoF humanoid control tasks

APA

Lin, H., Wang, P., Schneider, J. & Shi, G.. (2026). TD-M(PC)$^2$: Improving Temporal Difference MPC Through Policy Constraint. Proceedings of The 8th Annual Learning for Dynamics and Control Conference, in Proceedings of Machine Learning Research 331:705-736 Available from https://proceedings.mlr.press/v331/lin26a.html.

TD-M(PC)$^2$: Improving Temporal Difference MPC Through Policy Constraint

Abstract

Cite this Paper

Related Material