Learning End-to-end Video Classification with Rank-Pooling

Basura Fernando, Stephen Gould
Proceedings of The 33rd International Conference on Machine Learning, PMLR 48:1187-1196, 2016.

Abstract

We introduce a new model for representation learning and classification of video sequences. Our model is based on a convolutional neural network coupled with a novel temporal pooling layer. The temporal pooling layer relies on an inner-optimization problem to efficiently encode temporal semantics over arbitrarily long video clips into a fixed-length vector representation. Importantly, the representation and classification parameters of our model can be estimated jointly in an end-to-end manner by formulating learning as a bilevel optimization problem. Furthermore, the model can make use of any existing convolutional neural network architecture (e.g., AlexNet or VGG) without modification or introduction of additional parameters. We demonstrate our approach on action and activity recognition tasks.

Cite this Paper


BibTeX
@InProceedings{pmlr-v48-fernando16, title = {Learning End-to-end Video Classification with Rank-Pooling}, author = {Fernando, Basura and Gould, Stephen}, booktitle = {Proceedings of The 33rd International Conference on Machine Learning}, pages = {1187--1196}, year = {2016}, editor = {Balcan, Maria Florina and Weinberger, Kilian Q.}, volume = {48}, series = {Proceedings of Machine Learning Research}, address = {New York, New York, USA}, month = {20--22 Jun}, publisher = {PMLR}, pdf = {http://proceedings.mlr.press/v48/fernando16.pdf}, url = {http://proceedings.mlr.press/v48/fernando16.html}, abstract = {We introduce a new model for representation learning and classification of video sequences. Our model is based on a convolutional neural network coupled with a novel temporal pooling layer. The temporal pooling layer relies on an inner-optimization problem to efficiently encode temporal semantics over arbitrarily long video clips into a fixed-length vector representation. Importantly, the representation and classification parameters of our model can be estimated jointly in an end-to-end manner by formulating learning as a bilevel optimization problem. Furthermore, the model can make use of any existing convolutional neural network architecture (e.g., AlexNet or VGG) without modification or introduction of additional parameters. We demonstrate our approach on action and activity recognition tasks.} }
Endnote
%0 Conference Paper %T Learning End-to-end Video Classification with Rank-Pooling %A Basura Fernando %A Stephen Gould %B Proceedings of The 33rd International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2016 %E Maria Florina Balcan %E Kilian Q. Weinberger %F pmlr-v48-fernando16 %I PMLR %P 1187--1196 %U http://proceedings.mlr.press/v48/fernando16.html %V 48 %X We introduce a new model for representation learning and classification of video sequences. Our model is based on a convolutional neural network coupled with a novel temporal pooling layer. The temporal pooling layer relies on an inner-optimization problem to efficiently encode temporal semantics over arbitrarily long video clips into a fixed-length vector representation. Importantly, the representation and classification parameters of our model can be estimated jointly in an end-to-end manner by formulating learning as a bilevel optimization problem. Furthermore, the model can make use of any existing convolutional neural network architecture (e.g., AlexNet or VGG) without modification or introduction of additional parameters. We demonstrate our approach on action and activity recognition tasks.
RIS
TY - CPAPER TI - Learning End-to-end Video Classification with Rank-Pooling AU - Basura Fernando AU - Stephen Gould BT - Proceedings of The 33rd International Conference on Machine Learning DA - 2016/06/11 ED - Maria Florina Balcan ED - Kilian Q. Weinberger ID - pmlr-v48-fernando16 PB - PMLR DP - Proceedings of Machine Learning Research VL - 48 SP - 1187 EP - 1196 L1 - http://proceedings.mlr.press/v48/fernando16.pdf UR - http://proceedings.mlr.press/v48/fernando16.html AB - We introduce a new model for representation learning and classification of video sequences. Our model is based on a convolutional neural network coupled with a novel temporal pooling layer. The temporal pooling layer relies on an inner-optimization problem to efficiently encode temporal semantics over arbitrarily long video clips into a fixed-length vector representation. Importantly, the representation and classification parameters of our model can be estimated jointly in an end-to-end manner by formulating learning as a bilevel optimization problem. Furthermore, the model can make use of any existing convolutional neural network architecture (e.g., AlexNet or VGG) without modification or introduction of additional parameters. We demonstrate our approach on action and activity recognition tasks. ER -
APA
Fernando, B. & Gould, S.. (2016). Learning End-to-end Video Classification with Rank-Pooling. Proceedings of The 33rd International Conference on Machine Learning, in Proceedings of Machine Learning Research 48:1187-1196 Available from http://proceedings.mlr.press/v48/fernando16.html.

Related Material