Unsupervised Learning of Video Representations using LSTMs

Nitish Srivastava; Elman Mansimov; Ruslan Salakhudinov

Unsupervised Learning of Video Representations using LSTMs

Nitish Srivastava, Elman Mansimov, Ruslan Salakhudinov

Proceedings of the 32nd International Conference on Machine Learning, PMLR 37:843-852, 2015.

Abstract

We use Long Short Term Memory (LSTM) networks to learn representations of video sequences. Our model uses an encoder LSTM to map an input sequence into a fixed length representation. This representation is decoded using single or multiple decoder LSTMs to perform different tasks, such as reconstructing the input sequence, or predicting the future sequence. We experiment with two kinds of input sequences – patches of image pixels and high-level representations (“percepts") of video frames extracted using a pretrained convolutional net. We explore different design choices such as whether the decoder LSTMs should condition on the generated output. We analyze the outputs of the model qualitatively to see how well the model can extrapolate the learned video representation into the future and into the past. We further evaluate the representations by finetuning them for a supervised learning problem – human action recognition on the UCF-101 and HMDB-51 datasets. We show that the representations help improve classification accuracy, especially when there are only few training examples. Even models pretrained on unrelated datasets (300 hours of YouTube videos) can help action recognition performance.

Cite this Paper

BibTeX


@InProceedings{pmlr-v37-srivastava15,
  title = 	 {Unsupervised Learning of Video Representations using LSTMs},
  author = 	 {Srivastava, Nitish and Mansimov, Elman and Salakhudinov, Ruslan},
  booktitle = 	 {Proceedings of the 32nd International Conference on Machine Learning},
  pages = 	 {843--852},
  year = 	 {2015},
  editor = 	 {Bach, Francis and Blei, David},
  volume = 	 {37},
  series = 	 {Proceedings of Machine Learning Research},
  address = 	 {Lille, France},
  month = 	 {07--09 Jul},
  publisher =    {PMLR},
  pdf = 	 {http://proceedings.mlr.press/v37/srivastava15.pdf},
  url = 	 {https://proceedings.mlr.press/v37/srivastava15.html},
  abstract = 	 {We use Long Short Term Memory (LSTM) networks to learn representations of video sequences. Our model uses an encoder LSTM to map an input sequence into a fixed length representation. This representation is decoded using single or multiple decoder LSTMs to perform different tasks, such as reconstructing the input sequence, or predicting the future sequence. We experiment with two kinds of input sequences – patches of image pixels and high-level representations (“percepts") of video frames extracted using a pretrained convolutional net. We explore different design choices such as whether the decoder LSTMs should condition on the generated output. We analyze the outputs of the model qualitatively to see how well the model can extrapolate the learned video representation into the future and into the past. We further evaluate the representations by finetuning them for a supervised learning problem – human action recognition on the UCF-101 and HMDB-51 datasets. We show that the representations help improve classification accuracy, especially when there are only few training examples. Even models pretrained on unrelated datasets (300 hours of YouTube videos) can help action recognition performance.}
}

Endnote

%0 Conference Paper
%T Unsupervised Learning of Video Representations using LSTMs
%A Nitish Srivastava
%A Elman Mansimov
%A Ruslan Salakhudinov
%B Proceedings of the 32nd International Conference on Machine Learning
%C Proceedings of Machine Learning Research
%D 2015
%E Francis Bach
%E David Blei	
%F pmlr-v37-srivastava15
%I PMLR
%P 843--852
%U https://proceedings.mlr.press/v37/srivastava15.html
%V 37
%X We use Long Short Term Memory (LSTM) networks to learn representations of video sequences. Our model uses an encoder LSTM to map an input sequence into a fixed length representation. This representation is decoded using single or multiple decoder LSTMs to perform different tasks, such as reconstructing the input sequence, or predicting the future sequence. We experiment with two kinds of input sequences – patches of image pixels and high-level representations (“percepts") of video frames extracted using a pretrained convolutional net. We explore different design choices such as whether the decoder LSTMs should condition on the generated output. We analyze the outputs of the model qualitatively to see how well the model can extrapolate the learned video representation into the future and into the past. We further evaluate the representations by finetuning them for a supervised learning problem – human action recognition on the UCF-101 and HMDB-51 datasets. We show that the representations help improve classification accuracy, especially when there are only few training examples. Even models pretrained on unrelated datasets (300 hours of YouTube videos) can help action recognition performance.

RIS


TY  - CPAPER
TI  - Unsupervised Learning of Video Representations using LSTMs
AU  - Nitish Srivastava
AU  - Elman Mansimov
AU  - Ruslan Salakhudinov
BT  - Proceedings of the 32nd International Conference on Machine Learning
DA  - 2015/06/01
ED  - Francis Bach
ED  - David Blei	
ID  - pmlr-v37-srivastava15
PB  - PMLR
DP  - Proceedings of Machine Learning Research
VL  - 37
SP  - 843
EP  - 852
L1  - http://proceedings.mlr.press/v37/srivastava15.pdf
UR  - https://proceedings.mlr.press/v37/srivastava15.html
AB  - We use Long Short Term Memory (LSTM) networks to learn representations of video sequences. Our model uses an encoder LSTM to map an input sequence into a fixed length representation. This representation is decoded using single or multiple decoder LSTMs to perform different tasks, such as reconstructing the input sequence, or predicting the future sequence. We experiment with two kinds of input sequences – patches of image pixels and high-level representations (“percepts") of video frames extracted using a pretrained convolutional net. We explore different design choices such as whether the decoder LSTMs should condition on the generated output. We analyze the outputs of the model qualitatively to see how well the model can extrapolate the learned video representation into the future and into the past. We further evaluate the representations by finetuning them for a supervised learning problem – human action recognition on the UCF-101 and HMDB-51 datasets. We show that the representations help improve classification accuracy, especially when there are only few training examples. Even models pretrained on unrelated datasets (300 hours of YouTube videos) can help action recognition performance.
ER  -

APA


Srivastava, N., Mansimov, E. & Salakhudinov, R.. (2015). Unsupervised Learning of Video Representations using LSTMs. Proceedings of the 32nd International Conference on Machine Learning, in Proceedings of Machine Learning Research 37:843-852 Available from https://proceedings.mlr.press/v37/srivastava15.html.

Related Material

Download PDF