Explanatory Instructions: Towards Unified Vision Tasks Understanding and Zero-shot Generalization

Yang Shen, Xiu-Shen Wei, Yifan Sun, Yuxin Song, Tao Yuan, Jian Jin, He-Yang Xu, Yazhou Yao, Errui Ding
Proceedings of the 42nd International Conference on Machine Learning, PMLR 267:54515-54558, 2025.

Abstract

Computer Vision (CV) has yet to fully achieve the zero-shot task generalization observed in Natural Language Processing (NLP), despite following many of the milestones established in NLP, such as large transformer models, extensive pre-training, and the auto-regression paradigm, among others. In this paper, we rethink the reality that CV adopts discrete and terminological task definitions (e.g., "image segmentation"), and conjecture it is a key barrier that hampers zero-shot task generalization. Our hypothesis is that without truly understanding previously-seen tasks—due to these terminological definitions—deep models struggle to generalize to novel tasks. To verify this, we introduce Explanatory Instructions, which provide an intuitive way to define CV task objectives through detailed linguistic transformations from input images to outputs. We create a large-scale dataset comprising 12 million "image input $\to$ explanatory instruction $\to$ output" triplets, and train an auto-regressive-based vision-language model (AR-based VLM) that takes both images and explanatory instructions as input. By learning to follow these instructions, the AR-based VLM achieves instruction-level zero-shot capabilities for previously-seen tasks and demonstrates strong zero-shot generalization for unseen CV tasks. Code and dataset will be open-sourced.

Cite this Paper


BibTeX
@InProceedings{pmlr-v267-shen25h, title = {Explanatory Instructions: Towards Unified Vision Tasks Understanding and Zero-shot Generalization}, author = {Shen, Yang and Wei, Xiu-Shen and Sun, Yifan and Song, Yuxin and Yuan, Tao and Jin, Jian and Xu, He-Yang and Yao, Yazhou and Ding, Errui}, booktitle = {Proceedings of the 42nd International Conference on Machine Learning}, pages = {54515--54558}, year = {2025}, editor = {Singh, Aarti and Fazel, Maryam and Hsu, Daniel and Lacoste-Julien, Simon and Berkenkamp, Felix and Maharaj, Tegan and Wagstaff, Kiri and Zhu, Jerry}, volume = {267}, series = {Proceedings of Machine Learning Research}, month = {13--19 Jul}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v267/main/assets/shen25h/shen25h.pdf}, url = {https://proceedings.mlr.press/v267/shen25h.html}, abstract = {Computer Vision (CV) has yet to fully achieve the zero-shot task generalization observed in Natural Language Processing (NLP), despite following many of the milestones established in NLP, such as large transformer models, extensive pre-training, and the auto-regression paradigm, among others. In this paper, we rethink the reality that CV adopts discrete and terminological task definitions (e.g., "image segmentation"), and conjecture it is a key barrier that hampers zero-shot task generalization. Our hypothesis is that without truly understanding previously-seen tasks—due to these terminological definitions—deep models struggle to generalize to novel tasks. To verify this, we introduce Explanatory Instructions, which provide an intuitive way to define CV task objectives through detailed linguistic transformations from input images to outputs. We create a large-scale dataset comprising 12 million "image input $\to$ explanatory instruction $\to$ output" triplets, and train an auto-regressive-based vision-language model (AR-based VLM) that takes both images and explanatory instructions as input. By learning to follow these instructions, the AR-based VLM achieves instruction-level zero-shot capabilities for previously-seen tasks and demonstrates strong zero-shot generalization for unseen CV tasks. Code and dataset will be open-sourced.} }
Endnote
%0 Conference Paper %T Explanatory Instructions: Towards Unified Vision Tasks Understanding and Zero-shot Generalization %A Yang Shen %A Xiu-Shen Wei %A Yifan Sun %A Yuxin Song %A Tao Yuan %A Jian Jin %A He-Yang Xu %A Yazhou Yao %A Errui Ding %B Proceedings of the 42nd International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2025 %E Aarti Singh %E Maryam Fazel %E Daniel Hsu %E Simon Lacoste-Julien %E Felix Berkenkamp %E Tegan Maharaj %E Kiri Wagstaff %E Jerry Zhu %F pmlr-v267-shen25h %I PMLR %P 54515--54558 %U https://proceedings.mlr.press/v267/shen25h.html %V 267 %X Computer Vision (CV) has yet to fully achieve the zero-shot task generalization observed in Natural Language Processing (NLP), despite following many of the milestones established in NLP, such as large transformer models, extensive pre-training, and the auto-regression paradigm, among others. In this paper, we rethink the reality that CV adopts discrete and terminological task definitions (e.g., "image segmentation"), and conjecture it is a key barrier that hampers zero-shot task generalization. Our hypothesis is that without truly understanding previously-seen tasks—due to these terminological definitions—deep models struggle to generalize to novel tasks. To verify this, we introduce Explanatory Instructions, which provide an intuitive way to define CV task objectives through detailed linguistic transformations from input images to outputs. We create a large-scale dataset comprising 12 million "image input $\to$ explanatory instruction $\to$ output" triplets, and train an auto-regressive-based vision-language model (AR-based VLM) that takes both images and explanatory instructions as input. By learning to follow these instructions, the AR-based VLM achieves instruction-level zero-shot capabilities for previously-seen tasks and demonstrates strong zero-shot generalization for unseen CV tasks. Code and dataset will be open-sourced.
APA
Shen, Y., Wei, X., Sun, Y., Song, Y., Yuan, T., Jin, J., Xu, H., Yao, Y. & Ding, E.. (2025). Explanatory Instructions: Towards Unified Vision Tasks Understanding and Zero-shot Generalization. Proceedings of the 42nd International Conference on Machine Learning, in Proceedings of Machine Learning Research 267:54515-54558 Available from https://proceedings.mlr.press/v267/shen25h.html.

Related Material