FLOWER: Democratizing Generalist Robot Policies with Efficient Vision-Language-Flow Models

Moritz Reuss, Hongyi Zhou, Marcel Rühle, Ömer Erdinç Yağmurlu, Fabian Otto, Rudolf Lioutikov
Proceedings of The 9th Conference on Robot Learning, PMLR 305:3736-3761, 2025.

Abstract

Developing efficient Vision-Language-Action (VLA) policies is crucial for practical robotics deployment, yet current approaches face prohibitive computational costs and resource requirements. Existing diffusion-based VLA policies require multi-billion-parameter models and massive datasets to achieve strong performance. We tackle this efficiency challenge with two contributions: intermediate-modality fusion, which reallocates capacity to the diffusion head by pruning up to 50% of LLM layers, and action-specific Global-AdaLN conditioning, which cuts parameters by 20% through modular adaptation. We integrate these advances into a novel 950 M-parameter VLA called FLOWER. Pretrained in just 200 H100 GPU hours, FLOWER delivers a 25.9% improvement over state-of-the-art baselines across 190 tasks spanning ten simulation and real-world benchmarks and demonstrates robustness across diverse robotic embodiments. All code, pretrained weights, and training recipes are publicly released to democratize efficient VLA development.

Cite this Paper


BibTeX
@InProceedings{pmlr-v305-reuss25a, title = {FLOWER: Democratizing Generalist Robot Policies with Efficient Vision-Language-Flow Models}, author = {Reuss, Moritz and Zhou, Hongyi and R\"{u}hle, Marcel and Ya\u{g}murlu, \"{O}mer Erdin\c{c} and Otto, Fabian and Lioutikov, Rudolf}, booktitle = {Proceedings of The 9th Conference on Robot Learning}, pages = {3736--3761}, year = {2025}, editor = {Lim, Joseph and Song, Shuran and Park, Hae-Won}, volume = {305}, series = {Proceedings of Machine Learning Research}, month = {27--30 Sep}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v305/main/assets/reuss25a/reuss25a.pdf}, url = {https://proceedings.mlr.press/v305/reuss25a.html}, abstract = {Developing efficient Vision-Language-Action (VLA) policies is crucial for practical robotics deployment, yet current approaches face prohibitive computational costs and resource requirements. Existing diffusion-based VLA policies require multi-billion-parameter models and massive datasets to achieve strong performance. We tackle this efficiency challenge with two contributions: intermediate-modality fusion, which reallocates capacity to the diffusion head by pruning up to 50% of LLM layers, and action-specific Global-AdaLN conditioning, which cuts parameters by 20% through modular adaptation. We integrate these advances into a novel 950 M-parameter VLA called FLOWER. Pretrained in just 200 H100 GPU hours, FLOWER delivers a 25.9% improvement over state-of-the-art baselines across 190 tasks spanning ten simulation and real-world benchmarks and demonstrates robustness across diverse robotic embodiments. All code, pretrained weights, and training recipes are publicly released to democratize efficient VLA development.} }
Endnote
%0 Conference Paper %T FLOWER: Democratizing Generalist Robot Policies with Efficient Vision-Language-Flow Models %A Moritz Reuss %A Hongyi Zhou %A Marcel Rühle %A Ömer Erdinç Yağmurlu %A Fabian Otto %A Rudolf Lioutikov %B Proceedings of The 9th Conference on Robot Learning %C Proceedings of Machine Learning Research %D 2025 %E Joseph Lim %E Shuran Song %E Hae-Won Park %F pmlr-v305-reuss25a %I PMLR %P 3736--3761 %U https://proceedings.mlr.press/v305/reuss25a.html %V 305 %X Developing efficient Vision-Language-Action (VLA) policies is crucial for practical robotics deployment, yet current approaches face prohibitive computational costs and resource requirements. Existing diffusion-based VLA policies require multi-billion-parameter models and massive datasets to achieve strong performance. We tackle this efficiency challenge with two contributions: intermediate-modality fusion, which reallocates capacity to the diffusion head by pruning up to 50% of LLM layers, and action-specific Global-AdaLN conditioning, which cuts parameters by 20% through modular adaptation. We integrate these advances into a novel 950 M-parameter VLA called FLOWER. Pretrained in just 200 H100 GPU hours, FLOWER delivers a 25.9% improvement over state-of-the-art baselines across 190 tasks spanning ten simulation and real-world benchmarks and demonstrates robustness across diverse robotic embodiments. All code, pretrained weights, and training recipes are publicly released to democratize efficient VLA development.
APA
Reuss, M., Zhou, H., Rühle, M., Yağmurlu, Ö.E., Otto, F. & Lioutikov, R.. (2025). FLOWER: Democratizing Generalist Robot Policies with Efficient Vision-Language-Flow Models. Proceedings of The 9th Conference on Robot Learning, in Proceedings of Machine Learning Research 305:3736-3761 Available from https://proceedings.mlr.press/v305/reuss25a.html.

Related Material