MOSAIC: Modular Foundation Models for Assistive and Interactive Cooking

Huaxiaoyue Wang, Kushal Kedia, Juntao Ren, Rahma Abdullah, Atiksh Bhardwaj, Angela Chao, Kelly Y Chen, Nathaniel Chin, Prithwish Dan, Xinyi Fan, Gonzalo Gonzalez-Pumariega, Aditya Kompella, Maximus Adrian Pace, Yash Sharma, Xiangwan Sun, Neha Sunkara, Sanjiban Choudhury
Proceedings of The 8th Conference on Robot Learning, PMLR 270:2220-2294, 2025.

Abstract

We present MOSAIC, a modular architecture for coordinating multiple robots to (a) interact with users using natural language and (b) manipulate an open vocabulary of everyday objects. At several levels, MOSAIC employs modularity: it leverages multiple large-scale pre-trained models for high-level tasks like language and image recognition, while using streamlined modules designed for low-level task-specific control. This decomposition allows us to reap the complementary benefits of foundation models and precise, more specialized models, enabling our system to scale to complex tasks that involve coordinating multiple robots and humans. First, we unit-test individual modules with 180 episodes of visuomotor picking, 60 episodes of human motion forecasting, and 46 online user evaluations of the task planner. We then extensively evaluate MOSAIC with 60 end-to-end trials. We discuss crucial design decisions, limitations of the current system, and open challenges in this domain

Cite this Paper


BibTeX
@InProceedings{pmlr-v270-wang25h, title = {MOSAIC: Modular Foundation Models for Assistive and Interactive Cooking}, author = {Wang, Huaxiaoyue and Kedia, Kushal and Ren, Juntao and Abdullah, Rahma and Bhardwaj, Atiksh and Chao, Angela and Chen, Kelly Y and Chin, Nathaniel and Dan, Prithwish and Fan, Xinyi and Gonzalez-Pumariega, Gonzalo and Kompella, Aditya and Pace, Maximus Adrian and Sharma, Yash and Sun, Xiangwan and Sunkara, Neha and Choudhury, Sanjiban}, booktitle = {Proceedings of The 8th Conference on Robot Learning}, pages = {2220--2294}, year = {2025}, editor = {Agrawal, Pulkit and Kroemer, Oliver and Burgard, Wolfram}, volume = {270}, series = {Proceedings of Machine Learning Research}, month = {06--09 Nov}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v270/main/assets/wang25h/wang25h.pdf}, url = {https://proceedings.mlr.press/v270/wang25h.html}, abstract = {We present MOSAIC, a modular architecture for coordinating multiple robots to (a) interact with users using natural language and (b) manipulate an open vocabulary of everyday objects. At several levels, MOSAIC employs modularity: it leverages multiple large-scale pre-trained models for high-level tasks like language and image recognition, while using streamlined modules designed for low-level task-specific control. This decomposition allows us to reap the complementary benefits of foundation models and precise, more specialized models, enabling our system to scale to complex tasks that involve coordinating multiple robots and humans. First, we unit-test individual modules with 180 episodes of visuomotor picking, 60 episodes of human motion forecasting, and 46 online user evaluations of the task planner. We then extensively evaluate MOSAIC with 60 end-to-end trials. We discuss crucial design decisions, limitations of the current system, and open challenges in this domain} }
Endnote
%0 Conference Paper %T MOSAIC: Modular Foundation Models for Assistive and Interactive Cooking %A Huaxiaoyue Wang %A Kushal Kedia %A Juntao Ren %A Rahma Abdullah %A Atiksh Bhardwaj %A Angela Chao %A Kelly Y Chen %A Nathaniel Chin %A Prithwish Dan %A Xinyi Fan %A Gonzalo Gonzalez-Pumariega %A Aditya Kompella %A Maximus Adrian Pace %A Yash Sharma %A Xiangwan Sun %A Neha Sunkara %A Sanjiban Choudhury %B Proceedings of The 8th Conference on Robot Learning %C Proceedings of Machine Learning Research %D 2025 %E Pulkit Agrawal %E Oliver Kroemer %E Wolfram Burgard %F pmlr-v270-wang25h %I PMLR %P 2220--2294 %U https://proceedings.mlr.press/v270/wang25h.html %V 270 %X We present MOSAIC, a modular architecture for coordinating multiple robots to (a) interact with users using natural language and (b) manipulate an open vocabulary of everyday objects. At several levels, MOSAIC employs modularity: it leverages multiple large-scale pre-trained models for high-level tasks like language and image recognition, while using streamlined modules designed for low-level task-specific control. This decomposition allows us to reap the complementary benefits of foundation models and precise, more specialized models, enabling our system to scale to complex tasks that involve coordinating multiple robots and humans. First, we unit-test individual modules with 180 episodes of visuomotor picking, 60 episodes of human motion forecasting, and 46 online user evaluations of the task planner. We then extensively evaluate MOSAIC with 60 end-to-end trials. We discuss crucial design decisions, limitations of the current system, and open challenges in this domain
APA
Wang, H., Kedia, K., Ren, J., Abdullah, R., Bhardwaj, A., Chao, A., Chen, K.Y., Chin, N., Dan, P., Fan, X., Gonzalez-Pumariega, G., Kompella, A., Pace, M.A., Sharma, Y., Sun, X., Sunkara, N. & Choudhury, S.. (2025). MOSAIC: Modular Foundation Models for Assistive and Interactive Cooking. Proceedings of The 8th Conference on Robot Learning, in Proceedings of Machine Learning Research 270:2220-2294 Available from https://proceedings.mlr.press/v270/wang25h.html.

Related Material