DynaMind: Reasoning over Abstract Video Dynamics for Embodied Decision-Making

Ziru Wang, Mengmeng Wang, Jade Dai, Teli Ma, Guo-Jun Qi, Yong Liu, Guang Dai, Jingdong Wang
Proceedings of the 42nd International Conference on Machine Learning, PMLR 267:64586-64603, 2025.

Abstract

Integrating natural language instructions and visual perception with decision-making is a critical challenge for embodied agents. Existing methods often struggle to balance the conciseness of language commands with the richness of video content. To bridge the gap between modalities, we propose extracting key spatiotemporal patterns from video that capture visual saliency and temporal evolution, referred to as dynamic representation. Building on this, we introduce DynaMind, a framework that enhances decision-making through dynamic reasoning. Specifically, we design an adaptive FrameScorer to evaluate video frames based on semantic consistency and visual saliency, assigning each frame an importance score. These scores are used to filter redundant video content and synthesize compact dynamic representations. Leveraging these representations, we predict critical future dynamics and apply a dynamic-guided policy to generate coherent and context-aware actions. Extensive results demonstrate that DynaMind significantly outperforms the baselines across several simulation benchmarks and real-world scenarios.

Cite this Paper


BibTeX
@InProceedings{pmlr-v267-wang25cz, title = {{D}yna{M}ind: Reasoning over Abstract Video Dynamics for Embodied Decision-Making}, author = {Wang, Ziru and Wang, Mengmeng and Dai, Jade and Ma, Teli and Qi, Guo-Jun and Liu, Yong and Dai, Guang and Wang, Jingdong}, booktitle = {Proceedings of the 42nd International Conference on Machine Learning}, pages = {64586--64603}, year = {2025}, editor = {Singh, Aarti and Fazel, Maryam and Hsu, Daniel and Lacoste-Julien, Simon and Berkenkamp, Felix and Maharaj, Tegan and Wagstaff, Kiri and Zhu, Jerry}, volume = {267}, series = {Proceedings of Machine Learning Research}, month = {13--19 Jul}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v267/main/assets/wang25cz/wang25cz.pdf}, url = {https://proceedings.mlr.press/v267/wang25cz.html}, abstract = {Integrating natural language instructions and visual perception with decision-making is a critical challenge for embodied agents. Existing methods often struggle to balance the conciseness of language commands with the richness of video content. To bridge the gap between modalities, we propose extracting key spatiotemporal patterns from video that capture visual saliency and temporal evolution, referred to as dynamic representation. Building on this, we introduce DynaMind, a framework that enhances decision-making through dynamic reasoning. Specifically, we design an adaptive FrameScorer to evaluate video frames based on semantic consistency and visual saliency, assigning each frame an importance score. These scores are used to filter redundant video content and synthesize compact dynamic representations. Leveraging these representations, we predict critical future dynamics and apply a dynamic-guided policy to generate coherent and context-aware actions. Extensive results demonstrate that DynaMind significantly outperforms the baselines across several simulation benchmarks and real-world scenarios.} }
Endnote
%0 Conference Paper %T DynaMind: Reasoning over Abstract Video Dynamics for Embodied Decision-Making %A Ziru Wang %A Mengmeng Wang %A Jade Dai %A Teli Ma %A Guo-Jun Qi %A Yong Liu %A Guang Dai %A Jingdong Wang %B Proceedings of the 42nd International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2025 %E Aarti Singh %E Maryam Fazel %E Daniel Hsu %E Simon Lacoste-Julien %E Felix Berkenkamp %E Tegan Maharaj %E Kiri Wagstaff %E Jerry Zhu %F pmlr-v267-wang25cz %I PMLR %P 64586--64603 %U https://proceedings.mlr.press/v267/wang25cz.html %V 267 %X Integrating natural language instructions and visual perception with decision-making is a critical challenge for embodied agents. Existing methods often struggle to balance the conciseness of language commands with the richness of video content. To bridge the gap between modalities, we propose extracting key spatiotemporal patterns from video that capture visual saliency and temporal evolution, referred to as dynamic representation. Building on this, we introduce DynaMind, a framework that enhances decision-making through dynamic reasoning. Specifically, we design an adaptive FrameScorer to evaluate video frames based on semantic consistency and visual saliency, assigning each frame an importance score. These scores are used to filter redundant video content and synthesize compact dynamic representations. Leveraging these representations, we predict critical future dynamics and apply a dynamic-guided policy to generate coherent and context-aware actions. Extensive results demonstrate that DynaMind significantly outperforms the baselines across several simulation benchmarks and real-world scenarios.
APA
Wang, Z., Wang, M., Dai, J., Ma, T., Qi, G., Liu, Y., Dai, G. & Wang, J.. (2025). DynaMind: Reasoning over Abstract Video Dynamics for Embodied Decision-Making. Proceedings of the 42nd International Conference on Machine Learning, in Proceedings of Machine Learning Research 267:64586-64603 Available from https://proceedings.mlr.press/v267/wang25cz.html.

Related Material