Towards Efficient Online Tuning of VLM Agents via Counterfactual Soft Reinforcement Learning

Lang Feng, Weihao Tan, Zhiyi Lyu, Longtao Zheng, Haiyang Xu, Ming Yan, Fei Huang, Bo An
Proceedings of the 42nd International Conference on Machine Learning, PMLR 267:16884-16903, 2025.

Abstract

Online fine-tuning vision-language model (VLM) agents with reinforcement learning (RL) has shown promise for equipping agents with multi-step, goal-oriented capabilities in dynamic environments. However, their open-ended textual action space and non-end-to-end nature of action generation present significant challenges to effective online exploration in RL, e.g., explosion of the exploration space. We propose a novel online fine-tuning method, Counterfactual Soft Reinforcement Learning (CoSo), better suited to the textual output space of VLM agents. Compared to prior methods that assign uniform uncertainty to all tokens, CoSo leverages counterfactual reasoning to dynamically assess the causal influence of individual tokens on post-processed actions. By prioritizing the exploration of action-critical tokens while reducing the impact of semantically redundant or low-impact tokens, CoSo enables a more targeted and efficient online rollout process. We provide theoretical analysis proving CoSo’s convergence and policy improvement guarantees, and extensive empirical evaluations supporting CoSo’s effectiveness. Our results across a diverse set of agent tasks, including Android device control, card gaming, and embodied AI, highlight its remarkable ability to enhance exploration efficiency and deliver consistent performance gains. The code is available at https://github.com/langfengQ/CoSo.

Cite this Paper


BibTeX
@InProceedings{pmlr-v267-feng25n, title = {Towards Efficient Online Tuning of {VLM} Agents via Counterfactual Soft Reinforcement Learning}, author = {Feng, Lang and Tan, Weihao and Lyu, Zhiyi and Zheng, Longtao and Xu, Haiyang and Yan, Ming and Huang, Fei and An, Bo}, booktitle = {Proceedings of the 42nd International Conference on Machine Learning}, pages = {16884--16903}, year = {2025}, editor = {Singh, Aarti and Fazel, Maryam and Hsu, Daniel and Lacoste-Julien, Simon and Berkenkamp, Felix and Maharaj, Tegan and Wagstaff, Kiri and Zhu, Jerry}, volume = {267}, series = {Proceedings of Machine Learning Research}, month = {13--19 Jul}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v267/main/assets/feng25n/feng25n.pdf}, url = {https://proceedings.mlr.press/v267/feng25n.html}, abstract = {Online fine-tuning vision-language model (VLM) agents with reinforcement learning (RL) has shown promise for equipping agents with multi-step, goal-oriented capabilities in dynamic environments. However, their open-ended textual action space and non-end-to-end nature of action generation present significant challenges to effective online exploration in RL, e.g., explosion of the exploration space. We propose a novel online fine-tuning method, Counterfactual Soft Reinforcement Learning (CoSo), better suited to the textual output space of VLM agents. Compared to prior methods that assign uniform uncertainty to all tokens, CoSo leverages counterfactual reasoning to dynamically assess the causal influence of individual tokens on post-processed actions. By prioritizing the exploration of action-critical tokens while reducing the impact of semantically redundant or low-impact tokens, CoSo enables a more targeted and efficient online rollout process. We provide theoretical analysis proving CoSo’s convergence and policy improvement guarantees, and extensive empirical evaluations supporting CoSo’s effectiveness. Our results across a diverse set of agent tasks, including Android device control, card gaming, and embodied AI, highlight its remarkable ability to enhance exploration efficiency and deliver consistent performance gains. The code is available at https://github.com/langfengQ/CoSo.} }
Endnote
%0 Conference Paper %T Towards Efficient Online Tuning of VLM Agents via Counterfactual Soft Reinforcement Learning %A Lang Feng %A Weihao Tan %A Zhiyi Lyu %A Longtao Zheng %A Haiyang Xu %A Ming Yan %A Fei Huang %A Bo An %B Proceedings of the 42nd International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2025 %E Aarti Singh %E Maryam Fazel %E Daniel Hsu %E Simon Lacoste-Julien %E Felix Berkenkamp %E Tegan Maharaj %E Kiri Wagstaff %E Jerry Zhu %F pmlr-v267-feng25n %I PMLR %P 16884--16903 %U https://proceedings.mlr.press/v267/feng25n.html %V 267 %X Online fine-tuning vision-language model (VLM) agents with reinforcement learning (RL) has shown promise for equipping agents with multi-step, goal-oriented capabilities in dynamic environments. However, their open-ended textual action space and non-end-to-end nature of action generation present significant challenges to effective online exploration in RL, e.g., explosion of the exploration space. We propose a novel online fine-tuning method, Counterfactual Soft Reinforcement Learning (CoSo), better suited to the textual output space of VLM agents. Compared to prior methods that assign uniform uncertainty to all tokens, CoSo leverages counterfactual reasoning to dynamically assess the causal influence of individual tokens on post-processed actions. By prioritizing the exploration of action-critical tokens while reducing the impact of semantically redundant or low-impact tokens, CoSo enables a more targeted and efficient online rollout process. We provide theoretical analysis proving CoSo’s convergence and policy improvement guarantees, and extensive empirical evaluations supporting CoSo’s effectiveness. Our results across a diverse set of agent tasks, including Android device control, card gaming, and embodied AI, highlight its remarkable ability to enhance exploration efficiency and deliver consistent performance gains. The code is available at https://github.com/langfengQ/CoSo.
APA
Feng, L., Tan, W., Lyu, Z., Zheng, L., Xu, H., Yan, M., Huang, F. & An, B.. (2025). Towards Efficient Online Tuning of VLM Agents via Counterfactual Soft Reinforcement Learning. Proceedings of the 42nd International Conference on Machine Learning, in Proceedings of Machine Learning Research 267:16884-16903 Available from https://proceedings.mlr.press/v267/feng25n.html.

Related Material