Learning End-to-end Video Classification with Rank-Pooling

Basura Fernando; Stephen Gould

Learning End-to-end Video Classification with Rank-Pooling

Basura Fernando, Stephen Gould

Proceedings of The 33rd International Conference on Machine Learning, PMLR 48:1187-1196, 2016.

Abstract

We introduce a new model for representation learning and classification of video sequences. Our model is based on a convolutional neural network coupled with a novel temporal pooling layer. The temporal pooling layer relies on an inner-optimization problem to efficiently encode temporal semantics over arbitrarily long video clips into a fixed-length vector representation. Importantly, the representation and classification parameters of our model can be estimated jointly in an end-to-end manner by formulating learning as a bilevel optimization problem. Furthermore, the model can make use of any existing convolutional neural network architecture (e.g., AlexNet or VGG) without modification or introduction of additional parameters. We demonstrate our approach on action and activity recognition tasks.

Cite this Paper

BibTeX


@InProceedings{pmlr-v48-fernando16,
  title = 	 {Learning End-to-end Video Classification with Rank-Pooling},
  author = 	 {Fernando, Basura and Gould, Stephen},
  booktitle = 	 {Proceedings of The 33rd International Conference on Machine Learning},
  pages = 	 {1187--1196},
  year = 	 {2016},
  editor = 	 {Balcan, Maria Florina and Weinberger, Kilian Q.},
  volume = 	 {48},
  series = 	 {Proceedings of Machine Learning Research},
  address = 	 {New York, New York, USA},
  month = 	 {20--22 Jun},
  publisher =    {PMLR},
  pdf = 	 {http://proceedings.mlr.press/v48/fernando16.pdf},
  url = 	 {https://proceedings.mlr.press/v48/fernando16.html},
  abstract = 	 {We introduce a new model for representation learning and classification of video sequences. Our model is based on a convolutional neural network coupled with a novel temporal pooling layer. The temporal pooling layer relies on an inner-optimization problem to efficiently encode temporal semantics over arbitrarily long video clips into a fixed-length vector representation. Importantly, the representation and classification parameters of our model can be estimated jointly in an end-to-end manner by formulating learning as a bilevel optimization problem. Furthermore, the model can make use of any existing convolutional neural network architecture (e.g., AlexNet or VGG) without modification or introduction of additional parameters. We demonstrate our approach on action and activity recognition tasks.}
}

Endnote

%0 Conference Paper
%T Learning End-to-end Video Classification with Rank-Pooling
%A Basura Fernando
%A Stephen Gould
%B Proceedings of The 33rd International Conference on Machine Learning
%C Proceedings of Machine Learning Research
%D 2016
%E Maria Florina Balcan
%E Kilian Q. Weinberger	
%F pmlr-v48-fernando16
%I PMLR
%P 1187--1196
%U https://proceedings.mlr.press/v48/fernando16.html
%V 48
%X We introduce a new model for representation learning and classification of video sequences. Our model is based on a convolutional neural network coupled with a novel temporal pooling layer. The temporal pooling layer relies on an inner-optimization problem to efficiently encode temporal semantics over arbitrarily long video clips into a fixed-length vector representation. Importantly, the representation and classification parameters of our model can be estimated jointly in an end-to-end manner by formulating learning as a bilevel optimization problem. Furthermore, the model can make use of any existing convolutional neural network architecture (e.g., AlexNet or VGG) without modification or introduction of additional parameters. We demonstrate our approach on action and activity recognition tasks.

RIS


TY  - CPAPER
TI  - Learning End-to-end Video Classification with Rank-Pooling
AU  - Basura Fernando
AU  - Stephen Gould
BT  - Proceedings of The 33rd International Conference on Machine Learning
DA  - 2016/06/11
ED  - Maria Florina Balcan
ED  - Kilian Q. Weinberger	
ID  - pmlr-v48-fernando16
PB  - PMLR
DP  - Proceedings of Machine Learning Research
VL  - 48
SP  - 1187
EP  - 1196
L1  - http://proceedings.mlr.press/v48/fernando16.pdf
UR  - https://proceedings.mlr.press/v48/fernando16.html
AB  - We introduce a new model for representation learning and classification of video sequences. Our model is based on a convolutional neural network coupled with a novel temporal pooling layer. The temporal pooling layer relies on an inner-optimization problem to efficiently encode temporal semantics over arbitrarily long video clips into a fixed-length vector representation. Importantly, the representation and classification parameters of our model can be estimated jointly in an end-to-end manner by formulating learning as a bilevel optimization problem. Furthermore, the model can make use of any existing convolutional neural network architecture (e.g., AlexNet or VGG) without modification or introduction of additional parameters. We demonstrate our approach on action and activity recognition tasks.
ER  -

APA


Fernando, B. & Gould, S.. (2016). Learning End-to-end Video Classification with Rank-Pooling. Proceedings of The 33rd International Conference on Machine Learning, in Proceedings of Machine Learning Research 48:1187-1196 Available from https://proceedings.mlr.press/v48/fernando16.html.

Related Material

Download PDF