LMAct: A Benchmark for In-Context Imitation Learning with Long Multimodal Demonstrations

Anian Ruoss, Fabio Pardo, Harris Chan, Bonnie Li, Volodymyr Mnih, Tim Genewein
Proceedings of the 42nd International Conference on Machine Learning, PMLR 267:52303-52344, 2025.

Abstract

In this paper, we present a benchmark to pressure-test today’s frontier models’ multimodal decision-making capabilities in the very long-context regime (up to one million tokens) and investigate whether these models can learn from large numbers of expert demonstrations in their context. We evaluate the performance of Claude 3.5 Sonnet, Gemini 1.5 Flash, Gemini 1.5 Pro, Gemini 2.0 Flash Experimental, GPT-4o, o1-mini, o1-preview, and o1 as policies across a battery of simple interactive decision-making tasks: playing tic-tac-toe, chess, and Atari, navigating grid worlds, solving crosswords, and controlling a simulated cheetah. We study increasing amounts of expert demonstrations in the context — from no demonstrations to 512 full episodes. Across our tasks, models rarely manage to fully reach expert performance, and often, presenting more demonstrations has little effect. Some models steadily improve with more demonstrations on a few tasks. We investigate the effect of encoding observations as text or images and the impact of chain-of-thought prompting. To help quantify the impact of other approaches and future innovations, we open source our benchmark that covers the zero-, few-, and many-shot regimes in a unified evaluation.

Cite this Paper


BibTeX
@InProceedings{pmlr-v267-ruoss25a, title = {{LMA}ct: A Benchmark for In-Context Imitation Learning with Long Multimodal Demonstrations}, author = {Ruoss, Anian and Pardo, Fabio and Chan, Harris and Li, Bonnie and Mnih, Volodymyr and Genewein, Tim}, booktitle = {Proceedings of the 42nd International Conference on Machine Learning}, pages = {52303--52344}, year = {2025}, editor = {Singh, Aarti and Fazel, Maryam and Hsu, Daniel and Lacoste-Julien, Simon and Berkenkamp, Felix and Maharaj, Tegan and Wagstaff, Kiri and Zhu, Jerry}, volume = {267}, series = {Proceedings of Machine Learning Research}, month = {13--19 Jul}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v267/main/assets/ruoss25a/ruoss25a.pdf}, url = {https://proceedings.mlr.press/v267/ruoss25a.html}, abstract = {In this paper, we present a benchmark to pressure-test today’s frontier models’ multimodal decision-making capabilities in the very long-context regime (up to one million tokens) and investigate whether these models can learn from large numbers of expert demonstrations in their context. We evaluate the performance of Claude 3.5 Sonnet, Gemini 1.5 Flash, Gemini 1.5 Pro, Gemini 2.0 Flash Experimental, GPT-4o, o1-mini, o1-preview, and o1 as policies across a battery of simple interactive decision-making tasks: playing tic-tac-toe, chess, and Atari, navigating grid worlds, solving crosswords, and controlling a simulated cheetah. We study increasing amounts of expert demonstrations in the context — from no demonstrations to 512 full episodes. Across our tasks, models rarely manage to fully reach expert performance, and often, presenting more demonstrations has little effect. Some models steadily improve with more demonstrations on a few tasks. We investigate the effect of encoding observations as text or images and the impact of chain-of-thought prompting. To help quantify the impact of other approaches and future innovations, we open source our benchmark that covers the zero-, few-, and many-shot regimes in a unified evaluation.} }
Endnote
%0 Conference Paper %T LMAct: A Benchmark for In-Context Imitation Learning with Long Multimodal Demonstrations %A Anian Ruoss %A Fabio Pardo %A Harris Chan %A Bonnie Li %A Volodymyr Mnih %A Tim Genewein %B Proceedings of the 42nd International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2025 %E Aarti Singh %E Maryam Fazel %E Daniel Hsu %E Simon Lacoste-Julien %E Felix Berkenkamp %E Tegan Maharaj %E Kiri Wagstaff %E Jerry Zhu %F pmlr-v267-ruoss25a %I PMLR %P 52303--52344 %U https://proceedings.mlr.press/v267/ruoss25a.html %V 267 %X In this paper, we present a benchmark to pressure-test today’s frontier models’ multimodal decision-making capabilities in the very long-context regime (up to one million tokens) and investigate whether these models can learn from large numbers of expert demonstrations in their context. We evaluate the performance of Claude 3.5 Sonnet, Gemini 1.5 Flash, Gemini 1.5 Pro, Gemini 2.0 Flash Experimental, GPT-4o, o1-mini, o1-preview, and o1 as policies across a battery of simple interactive decision-making tasks: playing tic-tac-toe, chess, and Atari, navigating grid worlds, solving crosswords, and controlling a simulated cheetah. We study increasing amounts of expert demonstrations in the context — from no demonstrations to 512 full episodes. Across our tasks, models rarely manage to fully reach expert performance, and often, presenting more demonstrations has little effect. Some models steadily improve with more demonstrations on a few tasks. We investigate the effect of encoding observations as text or images and the impact of chain-of-thought prompting. To help quantify the impact of other approaches and future innovations, we open source our benchmark that covers the zero-, few-, and many-shot regimes in a unified evaluation.
APA
Ruoss, A., Pardo, F., Chan, H., Li, B., Mnih, V. & Genewein, T.. (2025). LMAct: A Benchmark for In-Context Imitation Learning with Long Multimodal Demonstrations. Proceedings of the 42nd International Conference on Machine Learning, in Proceedings of Machine Learning Research 267:52303-52344 Available from https://proceedings.mlr.press/v267/ruoss25a.html.

Related Material