LLMs can see and hear without any training

Kumar Ashutosh; Yossi Gandelsman; Xinlei Chen; Ishan Misra; Rohit Girdhar

LLMs can see and hear without any training

Kumar Ashutosh, Yossi Gandelsman, Xinlei Chen, Ishan Misra, Rohit Girdhar

Proceedings of the 42nd International Conference on Machine Learning, PMLR 267:1762-1776, 2025.

Abstract

We present MILS: Multimodal Iterative LLM Solver, a surprisingly simple, training-free approach, to imbue multimodal capabilities into your favorite LLM. Leveraging their innate ability to perform multi-step reasoning, MILS prompts the LLM to generate candidate outputs, each of which are scored and fed back iteratively, eventually generating a solution to the task. This enables various applications that typically require training specialized models on task-specific data. In particular, we establish a new state-of-the-art on emergent zero-shot image, video and audio captioning. MILS seamlessly applies to media generation as well, discovering prompt rewrites to improve text-to-image generation, and even edit prompts for style transfer! Finally, being a gradient-free optimization approach, MILS can invert multimodal embeddings into text, enabling applications like cross-modal arithmetic.

Cite this Paper

BibTeX

@InProceedings{pmlr-v267-ashutosh25a,
  title = 	 {{LLM}s can see and hear without any training},
  author =       {Ashutosh, Kumar and Gandelsman, Yossi and Chen, Xinlei and Misra, Ishan and Girdhar, Rohit},
  booktitle = 	 {Proceedings of the 42nd International Conference on Machine Learning},
  pages = 	 {1762--1776},
  year = 	 {2025},
  editor = 	 {Singh, Aarti and Fazel, Maryam and Hsu, Daniel and Lacoste-Julien, Simon and Berkenkamp, Felix and Maharaj, Tegan and Wagstaff, Kiri and Zhu, Jerry},
  volume = 	 {267},
  series = 	 {Proceedings of Machine Learning Research},
  month = 	 {13--19 Jul},
  publisher =    {PMLR},
  pdf = 	 {https://raw.githubusercontent.com/mlresearch/v267/main/assets/ashutosh25a/ashutosh25a.pdf},
  url = 	 {https://proceedings.mlr.press/v267/ashutosh25a.html},
  abstract = 	 {We present MILS: Multimodal Iterative LLM Solver, a surprisingly simple, training-free approach, to imbue multimodal capabilities into your favorite LLM. Leveraging their innate ability to perform multi-step reasoning, MILS prompts the LLM to generate candidate outputs, each of which are scored and fed back iteratively, eventually generating a solution to the task. This enables various applications that typically require training specialized models on task-specific data. In particular, we establish a new state-of-the-art on emergent zero-shot image, video and audio captioning. MILS seamlessly applies to media generation as well, discovering prompt rewrites to improve text-to-image generation, and even edit prompts for style transfer! Finally, being a gradient-free optimization approach, MILS can invert multimodal embeddings into text, enabling applications like cross-modal arithmetic.}
}

Endnote

%0 Conference Paper
%T LLMs can see and hear without any training
%A Kumar Ashutosh
%A Yossi Gandelsman
%A Xinlei Chen
%A Ishan Misra
%A Rohit Girdhar
%B Proceedings of the 42nd International Conference on Machine Learning
%C Proceedings of Machine Learning Research
%D 2025
%E Aarti Singh
%E Maryam Fazel
%E Daniel Hsu
%E Simon Lacoste-Julien
%E Felix Berkenkamp
%E Tegan Maharaj
%E Kiri Wagstaff
%E Jerry Zhu	
%F pmlr-v267-ashutosh25a
%I PMLR
%P 1762--1776
%U https://proceedings.mlr.press/v267/ashutosh25a.html
%V 267
%X We present MILS: Multimodal Iterative LLM Solver, a surprisingly simple, training-free approach, to imbue multimodal capabilities into your favorite LLM. Leveraging their innate ability to perform multi-step reasoning, MILS prompts the LLM to generate candidate outputs, each of which are scored and fed back iteratively, eventually generating a solution to the task. This enables various applications that typically require training specialized models on task-specific data. In particular, we establish a new state-of-the-art on emergent zero-shot image, video and audio captioning. MILS seamlessly applies to media generation as well, discovering prompt rewrites to improve text-to-image generation, and even edit prompts for style transfer! Finally, being a gradient-free optimization approach, MILS can invert multimodal embeddings into text, enabling applications like cross-modal arithmetic.

APA

Ashutosh, K., Gandelsman, Y., Chen, X., Misra, I. & Girdhar, R.. (2025). LLMs can see and hear without any training. Proceedings of the 42nd International Conference on Machine Learning, in Proceedings of Machine Learning Research 267:1762-1776 Available from https://proceedings.mlr.press/v267/ashutosh25a.html.

LLMs can see and hear without any training

Abstract

Cite this Paper

Related Material