Improving Transformer World Models for Data-Efficient RL

Antoine Dedieu, Joseph Ortiz, Xinghua Lou, Carter Wendelken, J Swaroop Guntupalli, Wolfgang Lehrach, Miguel Lazaro-Gredilla, Kevin Patrick Murphy
Proceedings of the 42nd International Conference on Machine Learning, PMLR 267:12944-12968, 2025.

Abstract

We present an approach to model-based RL that achieves a new state of the art performance on the challenging Craftax-classic benchmark, an open-world 2D survival game that requires agents to exhibit a wide range of general abilities—such as strong generalization, deep exploration, and long-term reasoning. With a series of careful design choices aimed at improving sample efficiency, our MBRL algorithm achieves a reward of 69.66% after only 1M environment steps, significantly outperforming DreamerV3, which achieves $53.2%$, and, for the first time, exceeds human performance of 65.0%. Our method starts by constructing a SOTA model-free baseline, using a novel policy architecture that combines CNNs and RNNs. We then add three improvements to the standard MBRL setup: (a) "Dyna with warmup", which trains the policy on real and imaginary data, (b) "nearest neighbor tokenizer" on image patches, which improves the scheme to create the transformer world model (TWM) inputs, and (c) "block teacher forcing", which allows the TWM to reason jointly about the future tokens of the next timestep.

Cite this Paper


BibTeX
@InProceedings{pmlr-v267-dedieu25a, title = {Improving Transformer World Models for Data-Efficient {RL}}, author = {Dedieu, Antoine and Ortiz, Joseph and Lou, Xinghua and Wendelken, Carter and Guntupalli, J Swaroop and Lehrach, Wolfgang and Lazaro-Gredilla, Miguel and Murphy, Kevin Patrick}, booktitle = {Proceedings of the 42nd International Conference on Machine Learning}, pages = {12944--12968}, year = {2025}, editor = {Singh, Aarti and Fazel, Maryam and Hsu, Daniel and Lacoste-Julien, Simon and Berkenkamp, Felix and Maharaj, Tegan and Wagstaff, Kiri and Zhu, Jerry}, volume = {267}, series = {Proceedings of Machine Learning Research}, month = {13--19 Jul}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v267/main/assets/dedieu25a/dedieu25a.pdf}, url = {https://proceedings.mlr.press/v267/dedieu25a.html}, abstract = {We present an approach to model-based RL that achieves a new state of the art performance on the challenging Craftax-classic benchmark, an open-world 2D survival game that requires agents to exhibit a wide range of general abilities—such as strong generalization, deep exploration, and long-term reasoning. With a series of careful design choices aimed at improving sample efficiency, our MBRL algorithm achieves a reward of 69.66% after only 1M environment steps, significantly outperforming DreamerV3, which achieves $53.2%$, and, for the first time, exceeds human performance of 65.0%. Our method starts by constructing a SOTA model-free baseline, using a novel policy architecture that combines CNNs and RNNs. We then add three improvements to the standard MBRL setup: (a) "Dyna with warmup", which trains the policy on real and imaginary data, (b) "nearest neighbor tokenizer" on image patches, which improves the scheme to create the transformer world model (TWM) inputs, and (c) "block teacher forcing", which allows the TWM to reason jointly about the future tokens of the next timestep.} }
Endnote
%0 Conference Paper %T Improving Transformer World Models for Data-Efficient RL %A Antoine Dedieu %A Joseph Ortiz %A Xinghua Lou %A Carter Wendelken %A J Swaroop Guntupalli %A Wolfgang Lehrach %A Miguel Lazaro-Gredilla %A Kevin Patrick Murphy %B Proceedings of the 42nd International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2025 %E Aarti Singh %E Maryam Fazel %E Daniel Hsu %E Simon Lacoste-Julien %E Felix Berkenkamp %E Tegan Maharaj %E Kiri Wagstaff %E Jerry Zhu %F pmlr-v267-dedieu25a %I PMLR %P 12944--12968 %U https://proceedings.mlr.press/v267/dedieu25a.html %V 267 %X We present an approach to model-based RL that achieves a new state of the art performance on the challenging Craftax-classic benchmark, an open-world 2D survival game that requires agents to exhibit a wide range of general abilities—such as strong generalization, deep exploration, and long-term reasoning. With a series of careful design choices aimed at improving sample efficiency, our MBRL algorithm achieves a reward of 69.66% after only 1M environment steps, significantly outperforming DreamerV3, which achieves $53.2%$, and, for the first time, exceeds human performance of 65.0%. Our method starts by constructing a SOTA model-free baseline, using a novel policy architecture that combines CNNs and RNNs. We then add three improvements to the standard MBRL setup: (a) "Dyna with warmup", which trains the policy on real and imaginary data, (b) "nearest neighbor tokenizer" on image patches, which improves the scheme to create the transformer world model (TWM) inputs, and (c) "block teacher forcing", which allows the TWM to reason jointly about the future tokens of the next timestep.
APA
Dedieu, A., Ortiz, J., Lou, X., Wendelken, C., Guntupalli, J.S., Lehrach, W., Lazaro-Gredilla, M. & Murphy, K.P.. (2025). Improving Transformer World Models for Data-Efficient RL. Proceedings of the 42nd International Conference on Machine Learning, in Proceedings of Machine Learning Research 267:12944-12968 Available from https://proceedings.mlr.press/v267/dedieu25a.html.

Related Material