Unveiling Markov heads in Pretrained Language Models for Offline Reinforcement Learning

Wenhao Zhao, Qiushui Xu, Linjie Xu, Lei Song, Jinyu Wang, Chunlai Zhou, Jiang Bian
Proceedings of the 42nd International Conference on Machine Learning, PMLR 267:77807-77821, 2025.

Abstract

Recently, incorporating knowledge from pretrained language models (PLMs) into decision transformers (DTs) has generated significant attention in offline reinforcement learning (RL). These PLMs perform well in RL tasks, raising an intriguing question: what kind of knowledge from PLMs has been transferred to RL to achieve such good results? This work first dives into this problem by analyzing each head quantitatively and points out Markov head, a crucial component that exists in the attention heads of PLMs. It leads to extreme attention on the last-input token and performs well only in short-term environments. Furthermore, we prove that this extreme attention cannot be changed by re-training embedding layer or fine-tuning. Inspired by our analysis, we propose a general method GPT-DTMA, which equips a pretrained DT with Mixture of Attention (MoA), to enable adaptive learning and accommodate diverse attention requirements during fine-tuning. Extensive experiments demonstrate the effectiveness of GPT-DTMA: it achieves superior performance in short-term environments compared to baselines, significantly reduces the performance gap of PLMs in long-term scenarios, and the experimental results also validate our theorems.

Cite this Paper


BibTeX
@InProceedings{pmlr-v267-zhao25z, title = {Unveiling {M}arkov heads in Pretrained Language Models for Offline Reinforcement Learning}, author = {Zhao, Wenhao and Xu, Qiushui and Xu, Linjie and Song, Lei and Wang, Jinyu and Zhou, Chunlai and Bian, Jiang}, booktitle = {Proceedings of the 42nd International Conference on Machine Learning}, pages = {77807--77821}, year = {2025}, editor = {Singh, Aarti and Fazel, Maryam and Hsu, Daniel and Lacoste-Julien, Simon and Berkenkamp, Felix and Maharaj, Tegan and Wagstaff, Kiri and Zhu, Jerry}, volume = {267}, series = {Proceedings of Machine Learning Research}, month = {13--19 Jul}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v267/main/assets/zhao25z/zhao25z.pdf}, url = {https://proceedings.mlr.press/v267/zhao25z.html}, abstract = {Recently, incorporating knowledge from pretrained language models (PLMs) into decision transformers (DTs) has generated significant attention in offline reinforcement learning (RL). These PLMs perform well in RL tasks, raising an intriguing question: what kind of knowledge from PLMs has been transferred to RL to achieve such good results? This work first dives into this problem by analyzing each head quantitatively and points out Markov head, a crucial component that exists in the attention heads of PLMs. It leads to extreme attention on the last-input token and performs well only in short-term environments. Furthermore, we prove that this extreme attention cannot be changed by re-training embedding layer or fine-tuning. Inspired by our analysis, we propose a general method GPT-DTMA, which equips a pretrained DT with Mixture of Attention (MoA), to enable adaptive learning and accommodate diverse attention requirements during fine-tuning. Extensive experiments demonstrate the effectiveness of GPT-DTMA: it achieves superior performance in short-term environments compared to baselines, significantly reduces the performance gap of PLMs in long-term scenarios, and the experimental results also validate our theorems.} }
Endnote
%0 Conference Paper %T Unveiling Markov heads in Pretrained Language Models for Offline Reinforcement Learning %A Wenhao Zhao %A Qiushui Xu %A Linjie Xu %A Lei Song %A Jinyu Wang %A Chunlai Zhou %A Jiang Bian %B Proceedings of the 42nd International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2025 %E Aarti Singh %E Maryam Fazel %E Daniel Hsu %E Simon Lacoste-Julien %E Felix Berkenkamp %E Tegan Maharaj %E Kiri Wagstaff %E Jerry Zhu %F pmlr-v267-zhao25z %I PMLR %P 77807--77821 %U https://proceedings.mlr.press/v267/zhao25z.html %V 267 %X Recently, incorporating knowledge from pretrained language models (PLMs) into decision transformers (DTs) has generated significant attention in offline reinforcement learning (RL). These PLMs perform well in RL tasks, raising an intriguing question: what kind of knowledge from PLMs has been transferred to RL to achieve such good results? This work first dives into this problem by analyzing each head quantitatively and points out Markov head, a crucial component that exists in the attention heads of PLMs. It leads to extreme attention on the last-input token and performs well only in short-term environments. Furthermore, we prove that this extreme attention cannot be changed by re-training embedding layer or fine-tuning. Inspired by our analysis, we propose a general method GPT-DTMA, which equips a pretrained DT with Mixture of Attention (MoA), to enable adaptive learning and accommodate diverse attention requirements during fine-tuning. Extensive experiments demonstrate the effectiveness of GPT-DTMA: it achieves superior performance in short-term environments compared to baselines, significantly reduces the performance gap of PLMs in long-term scenarios, and the experimental results also validate our theorems.
APA
Zhao, W., Xu, Q., Xu, L., Song, L., Wang, J., Zhou, C. & Bian, J.. (2025). Unveiling Markov heads in Pretrained Language Models for Offline Reinforcement Learning. Proceedings of the 42nd International Conference on Machine Learning, in Proceedings of Machine Learning Research 267:77807-77821 Available from https://proceedings.mlr.press/v267/zhao25z.html.

Related Material