Offline Opponent Modeling with Truncated Q-driven Instant Policy Refinement

Yuheng Jing; Kai Li; Bingyun Liu; Ziwen Zhang; Haobo Fu; Qiang Fu; Junliang Xing; Jian Cheng

Offline Opponent Modeling with Truncated Q-driven Instant Policy Refinement

Yuheng Jing, Kai Li, Bingyun Liu, Ziwen Zhang, Haobo Fu, Qiang Fu, Junliang Xing, Jian Cheng

Proceedings of the 42nd International Conference on Machine Learning, PMLR 267:28256-28287, 2025.

Abstract

Offline Opponent Modeling (OOM) aims to learn an adaptive autonomous agent policy that dynamically adapts to opponents using an offline dataset from multi-agent games. Previous work assumes that the dataset is optimal. However, this assumption is difficult to satisfy in the real world. When the dataset is suboptimal, existing approaches struggle to work. To tackle this issue, we propose a simple and general algorithmic improvement framework, Truncated Q-driven Instant Policy Refinement (TIPR), to handle the suboptimality of OOM algorithms induced by datasets. The TIPR framework is plug-and-play in nature. Compared to original OOM algorithms, it requires only two extra steps: (1) Learn a horizon-truncated in-context action-value function, namely Truncated Q, using the offline dataset. The Truncated Q estimates the expected return within a fixed, truncated horizon and is conditioned on opponent information. (2) Use the learned Truncated Q to instantly decide whether to perform policy refinement and to generate policy after refinement during testing. Theoretically, we analyze the rationale of Truncated Q from the perspective of No Maximization Bias probability. Empirically, we conduct extensive comparison and ablation experiments in four representative competitive environments. TIPR effectively improves various OOM algorithms pretrained with suboptimal datasets.

Cite this Paper

BibTeX

@InProceedings{pmlr-v267-jing25a,
  title = 	 {Offline Opponent Modeling with Truncated Q-driven Instant Policy Refinement},
  author =       {Jing, Yuheng and Li, Kai and Liu, Bingyun and Zhang, Ziwen and Fu, Haobo and Fu, Qiang and Xing, Junliang and Cheng, Jian},
  booktitle = 	 {Proceedings of the 42nd International Conference on Machine Learning},
  pages = 	 {28256--28287},
  year = 	 {2025},
  editor = 	 {Singh, Aarti and Fazel, Maryam and Hsu, Daniel and Lacoste-Julien, Simon and Berkenkamp, Felix and Maharaj, Tegan and Wagstaff, Kiri and Zhu, Jerry},
  volume = 	 {267},
  series = 	 {Proceedings of Machine Learning Research},
  month = 	 {13--19 Jul},
  publisher =    {PMLR},
  pdf = 	 {https://raw.githubusercontent.com/mlresearch/v267/main/assets/jing25a/jing25a.pdf},
  url = 	 {https://proceedings.mlr.press/v267/jing25a.html},
  abstract = 	 {Offline Opponent Modeling (OOM) aims to learn an adaptive autonomous agent policy that dynamically adapts to opponents using an offline dataset from multi-agent games. Previous work assumes that the dataset is optimal. However, this assumption is difficult to satisfy in the real world. When the dataset is suboptimal, existing approaches struggle to work. To tackle this issue, we propose a simple and general algorithmic improvement framework, Truncated Q-driven Instant Policy Refinement (TIPR), to handle the suboptimality of OOM algorithms induced by datasets. The TIPR framework is plug-and-play in nature. Compared to original OOM algorithms, it requires only two extra steps: (1) Learn a horizon-truncated in-context action-value function, namely Truncated Q, using the offline dataset. The Truncated Q estimates the expected return within a fixed, truncated horizon and is conditioned on opponent information. (2) Use the learned Truncated Q to instantly decide whether to perform policy refinement and to generate policy after refinement during testing. Theoretically, we analyze the rationale of Truncated Q from the perspective of No Maximization Bias probability. Empirically, we conduct extensive comparison and ablation experiments in four representative competitive environments. TIPR effectively improves various OOM algorithms pretrained with suboptimal datasets.}
}

Endnote

%0 Conference Paper
%T Offline Opponent Modeling with Truncated Q-driven Instant Policy Refinement
%A Yuheng Jing
%A Kai Li
%A Bingyun Liu
%A Ziwen Zhang
%A Haobo Fu
%A Qiang Fu
%A Junliang Xing
%A Jian Cheng
%B Proceedings of the 42nd International Conference on Machine Learning
%C Proceedings of Machine Learning Research
%D 2025
%E Aarti Singh
%E Maryam Fazel
%E Daniel Hsu
%E Simon Lacoste-Julien
%E Felix Berkenkamp
%E Tegan Maharaj
%E Kiri Wagstaff
%E Jerry Zhu	
%F pmlr-v267-jing25a
%I PMLR
%P 28256--28287
%U https://proceedings.mlr.press/v267/jing25a.html
%V 267
%X Offline Opponent Modeling (OOM) aims to learn an adaptive autonomous agent policy that dynamically adapts to opponents using an offline dataset from multi-agent games. Previous work assumes that the dataset is optimal. However, this assumption is difficult to satisfy in the real world. When the dataset is suboptimal, existing approaches struggle to work. To tackle this issue, we propose a simple and general algorithmic improvement framework, Truncated Q-driven Instant Policy Refinement (TIPR), to handle the suboptimality of OOM algorithms induced by datasets. The TIPR framework is plug-and-play in nature. Compared to original OOM algorithms, it requires only two extra steps: (1) Learn a horizon-truncated in-context action-value function, namely Truncated Q, using the offline dataset. The Truncated Q estimates the expected return within a fixed, truncated horizon and is conditioned on opponent information. (2) Use the learned Truncated Q to instantly decide whether to perform policy refinement and to generate policy after refinement during testing. Theoretically, we analyze the rationale of Truncated Q from the perspective of No Maximization Bias probability. Empirically, we conduct extensive comparison and ablation experiments in four representative competitive environments. TIPR effectively improves various OOM algorithms pretrained with suboptimal datasets.

APA

Jing, Y., Li, K., Liu, B., Zhang, Z., Fu, H., Fu, Q., Xing, J. & Cheng, J.. (2025). Offline Opponent Modeling with Truncated Q-driven Instant Policy Refinement. Proceedings of the 42nd International Conference on Machine Learning, in Proceedings of Machine Learning Research 267:28256-28287 Available from https://proceedings.mlr.press/v267/jing25a.html.

Offline Opponent Modeling with Truncated Q-driven Instant Policy Refinement

Abstract

Cite this Paper

Related Material