VideoPoet: A Large Language Model for Zero-Shot Video Generation

Dan Kondratyuk; Lijun Yu; Xiuye Gu; Jose Lezama; Jonathan Huang; Grant Schindler; Rachel Hornung; Vighnesh Birodkar; Jimmy Yan; Ming-Chang Chiu; Krishna Somandepalli; Hassan Akbari; Yair Alon; Yong Cheng; Joshua V. Dillon; Agrim Gupta; Meera Hahn; Anja Hauth; David Hendon; Alonso Martinez; David Minnen; Mikhail Sirotenko; Kihyuk Sohn; Xuan Yang; Hartwig Adam; Ming-Hsuan Yang; Irfan Essa; Huisheng Wang; David A Ross; Bryan Seybold; Lu Jiang

VideoPoet: A Large Language Model for Zero-Shot Video Generation

Dan Kondratyuk, Lijun Yu, Xiuye Gu, Jose Lezama, Jonathan Huang, Grant Schindler, Rachel Hornung, Vighnesh Birodkar, Jimmy Yan, Ming-Chang Chiu, Krishna Somandepalli, Hassan Akbari, Yair Alon, Yong Cheng, Joshua V. Dillon, Agrim Gupta, Meera Hahn, Anja Hauth, David Hendon, Alonso Martinez, David Minnen, Mikhail Sirotenko, Kihyuk Sohn, Xuan Yang, Hartwig Adam, Ming-Hsuan Yang, Irfan Essa, Huisheng Wang, David A Ross, Bryan Seybold, Lu Jiang

Proceedings of the 41st International Conference on Machine Learning, PMLR 235:25105-25124, 2024.

Abstract

We present VideoPoet, a language model capable of synthesizing high-quality video from a large variety of conditioning signals. VideoPoet employs a decoder-only transformer architecture that processes multimodal inputs – including images, videos, text, and audio. The training protocol follows that of Large Language Models (LLMs), consisting of two stages: pretraining and task-specific adaptation. During pretraining, VideoPoet incorporates a mixture of multimodal generative objectives within an autoregressive Transformer framework. The pretrained LLM serves as a foundation that can be adapted for a range of video generation tasks. We present empirical results demonstrating the model’s state-of-the-art capabilities in zero-shot video generation, specifically highlighting the ability to generate high-fidelity motions. Project page: http://sites.research.google/videopoet/

Cite this Paper

BibTeX


@InProceedings{pmlr-v235-kondratyuk24a,
  title = 	 {{V}ideo{P}oet: A Large Language Model for Zero-Shot Video Generation},
  author =       {Kondratyuk, Dan and Yu, Lijun and Gu, Xiuye and Lezama, Jose and Huang, Jonathan and Schindler, Grant and Hornung, Rachel and Birodkar, Vighnesh and Yan, Jimmy and Chiu, Ming-Chang and Somandepalli, Krishna and Akbari, Hassan and Alon, Yair and Cheng, Yong and Dillon, Joshua V. and Gupta, Agrim and Hahn, Meera and Hauth, Anja and Hendon, David and Martinez, Alonso and Minnen, David and Sirotenko, Mikhail and Sohn, Kihyuk and Yang, Xuan and Adam, Hartwig and Yang, Ming-Hsuan and Essa, Irfan and Wang, Huisheng and Ross, David A and Seybold, Bryan and Jiang, Lu},
  booktitle = 	 {Proceedings of the 41st International Conference on Machine Learning},
  pages = 	 {25105--25124},
  year = 	 {2024},
  editor = 	 {Salakhutdinov, Ruslan and Kolter, Zico and Heller, Katherine and Weller, Adrian and Oliver, Nuria and Scarlett, Jonathan and Berkenkamp, Felix},
  volume = 	 {235},
  series = 	 {Proceedings of Machine Learning Research},
  month = 	 {21--27 Jul},
  publisher =    {PMLR},
  pdf = 	 {https://raw.githubusercontent.com/mlresearch/v235/main/assets/kondratyuk24a/kondratyuk24a.pdf},
  url = 	 {https://proceedings.mlr.press/v235/kondratyuk24a.html},
  abstract = 	 {We present VideoPoet, a language model capable of synthesizing high-quality video from a large variety of conditioning signals. VideoPoet employs a decoder-only transformer architecture that processes multimodal inputs – including images, videos, text, and audio. The training protocol follows that of Large Language Models (LLMs), consisting of two stages: pretraining and task-specific adaptation. During pretraining, VideoPoet incorporates a mixture of multimodal generative objectives within an autoregressive Transformer framework. The pretrained LLM serves as a foundation that can be adapted for a range of video generation tasks. We present empirical results demonstrating the model’s state-of-the-art capabilities in zero-shot video generation, specifically highlighting the ability to generate high-fidelity motions. Project page: http://sites.research.google/videopoet/}
}

Endnote

%0 Conference Paper
%T VideoPoet: A Large Language Model for Zero-Shot Video Generation
%A Dan Kondratyuk
%A Lijun Yu
%A Xiuye Gu
%A Jose Lezama
%A Jonathan Huang
%A Grant Schindler
%A Rachel Hornung
%A Vighnesh Birodkar
%A Jimmy Yan
%A Ming-Chang Chiu
%A Krishna Somandepalli
%A Hassan Akbari
%A Yair Alon
%A Yong Cheng
%A Joshua V. Dillon
%A Agrim Gupta
%A Meera Hahn
%A Anja Hauth
%A David Hendon
%A Alonso Martinez
%A David Minnen
%A Mikhail Sirotenko
%A Kihyuk Sohn
%A Xuan Yang
%A Hartwig Adam
%A Ming-Hsuan Yang
%A Irfan Essa
%A Huisheng Wang
%A David A Ross
%A Bryan Seybold
%A Lu Jiang
%B Proceedings of the 41st International Conference on Machine Learning
%C Proceedings of Machine Learning Research
%D 2024
%E Ruslan Salakhutdinov
%E Zico Kolter
%E Katherine Heller
%E Adrian Weller
%E Nuria Oliver
%E Jonathan Scarlett
%E Felix Berkenkamp	
%F pmlr-v235-kondratyuk24a
%I PMLR
%P 25105--25124
%U https://proceedings.mlr.press/v235/kondratyuk24a.html
%V 235
%X We present VideoPoet, a language model capable of synthesizing high-quality video from a large variety of conditioning signals. VideoPoet employs a decoder-only transformer architecture that processes multimodal inputs – including images, videos, text, and audio. The training protocol follows that of Large Language Models (LLMs), consisting of two stages: pretraining and task-specific adaptation. During pretraining, VideoPoet incorporates a mixture of multimodal generative objectives within an autoregressive Transformer framework. The pretrained LLM serves as a foundation that can be adapted for a range of video generation tasks. We present empirical results demonstrating the model’s state-of-the-art capabilities in zero-shot video generation, specifically highlighting the ability to generate high-fidelity motions. Project page: http://sites.research.google/videopoet/

APA


Kondratyuk, D., Yu, L., Gu, X., Lezama, J., Huang, J., Schindler, G., Hornung, R., Birodkar, V., Yan, J., Chiu, M., Somandepalli, K., Akbari, H., Alon, Y., Cheng, Y., Dillon, J.V., Gupta, A., Hahn, M., Hauth, A., Hendon, D., Martinez, A., Minnen, D., Sirotenko, M., Sohn, K., Yang, X., Adam, H., Yang, M., Essa, I., Wang, H., Ross, D.A., Seybold, B. & Jiang, L.. (2024). VideoPoet: A Large Language Model for Zero-Shot Video Generation. Proceedings of the 41st International Conference on Machine Learning, in Proceedings of Machine Learning Research 235:25105-25124 Available from https://proceedings.mlr.press/v235/kondratyuk24a.html.

VideoPoet: A Large Language Model for Zero-Shot Video Generation

Abstract

Cite this Paper

Related Material