Paying Attention to Video Generation

Rishika Bhagwatkar; Khurshed Fitter; Saketh Bachu; Akshay Kulkarni; Shital Chiddarwar

Paying Attention to Video Generation

Rishika Bhagwatkar, Khurshed Fitter, Saketh Bachu, Akshay Kulkarni, Shital Chiddarwar

NeurIPS 2020 Workshop on Pre-registration in Machine Learning, PMLR 148:139-154, 2021.

Abstract

Video generation is a challenging research topic which has been tackled by a variety of methods including Generative Adversarial Networks (GANs), Variational Autoencoders (VAE), optical flow and autoregressive models. However, most of the existing works model the task as image manipulation and learn pixel-level transforms. In contrast, we propose a latent vector manipulation approach using sequential models, particularly the Generative Pre-trained Transformer (GPT). Further, we propose a novel Attention-based Discretized Autoencoder (ADAE) which learns a finite-sized codebook that serves as a basis for latent space representations of frames, to be modelled by the sequential model. To tackle the reduced resolution or the diversity bottleneck caused by the finite codebook, we propose attention-based soft-alignment instead of a hard distance-based choice for sampling the latent vectors. We extensively evaluate the proposed approach on the BAIR Robot Pushing, Sky Time-lapse and Dinosaur Game datasets and compare with state-of-the-art (SOTA) approaches. Upon experimentation, we find that our model suffers mode collapse owing to a single vector latent space learned by the ADAE. The cause for this mode collapse is traced back to the peaky attention scores resulting from the codebook (Keys and Values) and the encoder’s output (Query). Through our findings, we highlight the importance of reliable latent space frame representations for successful sequential modelling.

Cite this Paper

BibTeX


@InProceedings{pmlr-v148-bhagwatkar21a,
  title = 	 {Paying Attention to Video Generation},
  author =       {Bhagwatkar, Rishika and Fitter, Khurshed and Bachu, Saketh and Kulkarni, Akshay and Chiddarwar, Shital},
  booktitle = 	 {NeurIPS 2020 Workshop on Pre-registration in Machine Learning},
  pages = 	 {139--154},
  year = 	 {2021},
  editor = 	 {Bertinetto, Luca and Henriques, João F. and Albanie, Samuel and Paganini, Michela and Varol, Gül},
  volume = 	 {148},
  series = 	 {Proceedings of Machine Learning Research},
  month = 	 {11 Dec},
  publisher =    {PMLR},
  pdf = 	 {http://proceedings.mlr.press/v148/bhagwatkar21a/bhagwatkar21a.pdf},
  url = 	 {https://proceedings.mlr.press/v148/bhagwatkar21a.html},
  abstract = 	 {Video generation is a challenging research topic which has been tackled by a variety of methods including Generative Adversarial Networks (GANs), Variational Autoencoders (VAE), optical flow and autoregressive models. However, most of the existing works model the task as image manipulation and learn pixel-level transforms. In contrast, we propose a latent vector manipulation approach using sequential models, particularly the Generative Pre-trained Transformer (GPT). Further, we propose a novel Attention-based Discretized Autoencoder (ADAE) which learns a finite-sized codebook that serves as a basis for latent space representations of frames, to be modelled by the sequential model. To tackle the reduced resolution or the diversity bottleneck caused by the finite codebook, we propose attention-based soft-alignment instead of a hard distance-based choice for sampling the latent vectors. We extensively evaluate the proposed approach on the BAIR Robot Pushing, Sky Time-lapse and Dinosaur Game datasets and compare with state-of-the-art (SOTA) approaches. Upon experimentation, we find that our model suffers mode collapse owing to a single vector latent space learned by the ADAE. The cause for this mode collapse is traced back to the peaky attention scores resulting from the codebook (Keys and Values) and the encoder’s output (Query). Through our findings, we highlight the importance of reliable latent space frame representations for successful sequential modelling.}
}

Endnote

%0 Conference Paper
%T Paying Attention to Video Generation
%A Rishika Bhagwatkar
%A Khurshed Fitter
%A Saketh Bachu
%A Akshay Kulkarni
%A Shital Chiddarwar
%B NeurIPS 2020 Workshop on Pre-registration in Machine Learning
%C Proceedings of Machine Learning Research
%D 2021
%E Luca Bertinetto
%E João F. Henriques
%E Samuel Albanie
%E Michela Paganini
%E Gül Varol	
%F pmlr-v148-bhagwatkar21a
%I PMLR
%P 139--154
%U https://proceedings.mlr.press/v148/bhagwatkar21a.html
%V 148
%X Video generation is a challenging research topic which has been tackled by a variety of methods including Generative Adversarial Networks (GANs), Variational Autoencoders (VAE), optical flow and autoregressive models. However, most of the existing works model the task as image manipulation and learn pixel-level transforms. In contrast, we propose a latent vector manipulation approach using sequential models, particularly the Generative Pre-trained Transformer (GPT). Further, we propose a novel Attention-based Discretized Autoencoder (ADAE) which learns a finite-sized codebook that serves as a basis for latent space representations of frames, to be modelled by the sequential model. To tackle the reduced resolution or the diversity bottleneck caused by the finite codebook, we propose attention-based soft-alignment instead of a hard distance-based choice for sampling the latent vectors. We extensively evaluate the proposed approach on the BAIR Robot Pushing, Sky Time-lapse and Dinosaur Game datasets and compare with state-of-the-art (SOTA) approaches. Upon experimentation, we find that our model suffers mode collapse owing to a single vector latent space learned by the ADAE. The cause for this mode collapse is traced back to the peaky attention scores resulting from the codebook (Keys and Values) and the encoder’s output (Query). Through our findings, we highlight the importance of reliable latent space frame representations for successful sequential modelling.

APA


Bhagwatkar, R., Fitter, K., Bachu, S., Kulkarni, A. & Chiddarwar, S.. (2021). Paying Attention to Video Generation. NeurIPS 2020 Workshop on Pre-registration in Machine Learning, in Proceedings of Machine Learning Research 148:139-154 Available from https://proceedings.mlr.press/v148/bhagwatkar21a.html.

Related Material

Download PDF