Learning End-to-end Video Classification with Rank-Pooling

Basura Fernando, Stephen Gould
; Proceedings of The 33rd International Conference on Machine Learning, PMLR 48:1187-1196, 2016.

Abstract

We introduce a new model for representation learning and classification of video sequences. Our model is based on a convolutional neural network coupled with a novel temporal pooling layer. The temporal pooling layer relies on an inner-optimization problem to efficiently encode temporal semantics over arbitrarily long video clips into a fixed-length vector representation. Importantly, the representation and classification parameters of our model can be estimated jointly in an end-to-end manner by formulating learning as a bilevel optimization problem. Furthermore, the model can make use of any existing convolutional neural network architecture (e.g., AlexNet or VGG) without modification or introduction of additional parameters. We demonstrate our approach on action and activity recognition tasks.

Cite this Paper


BibTeX
@InProceedings{pmlr-v48-fernando16, title = {Learning End-to-end Video Classification with Rank-Pooling}, author = {Basura Fernando and Stephen Gould}, booktitle = {Proceedings of The 33rd International Conference on Machine Learning}, pages = {1187--1196}, year = {2016}, editor = {Maria Florina Balcan and Kilian Q. Weinberger}, volume = {48}, series = {Proceedings of Machine Learning Research}, address = {New York, New York, USA}, month = {20--22 Jun}, publisher = {PMLR}, pdf = {http://proceedings.mlr.press/v48/fernando16.pdf}, url = {http://proceedings.mlr.press/v48/fernando16.html}, abstract = {We introduce a new model for representation learning and classification of video sequences. Our model is based on a convolutional neural network coupled with a novel temporal pooling layer. The temporal pooling layer relies on an inner-optimization problem to efficiently encode temporal semantics over arbitrarily long video clips into a fixed-length vector representation. Importantly, the representation and classification parameters of our model can be estimated jointly in an end-to-end manner by formulating learning as a bilevel optimization problem. Furthermore, the model can make use of any existing convolutional neural network architecture (e.g., AlexNet or VGG) without modification or introduction of additional parameters. We demonstrate our approach on action and activity recognition tasks.} }
Endnote
%0 Conference Paper %T Learning End-to-end Video Classification with Rank-Pooling %A Basura Fernando %A Stephen Gould %B Proceedings of The 33rd International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2016 %E Maria Florina Balcan %E Kilian Q. Weinberger %F pmlr-v48-fernando16 %I PMLR %J Proceedings of Machine Learning Research %P 1187--1196 %U http://proceedings.mlr.press %V 48 %W PMLR %X We introduce a new model for representation learning and classification of video sequences. Our model is based on a convolutional neural network coupled with a novel temporal pooling layer. The temporal pooling layer relies on an inner-optimization problem to efficiently encode temporal semantics over arbitrarily long video clips into a fixed-length vector representation. Importantly, the representation and classification parameters of our model can be estimated jointly in an end-to-end manner by formulating learning as a bilevel optimization problem. Furthermore, the model can make use of any existing convolutional neural network architecture (e.g., AlexNet or VGG) without modification or introduction of additional parameters. We demonstrate our approach on action and activity recognition tasks.
RIS
TY - CPAPER TI - Learning End-to-end Video Classification with Rank-Pooling AU - Basura Fernando AU - Stephen Gould BT - Proceedings of The 33rd International Conference on Machine Learning PY - 2016/06/11 DA - 2016/06/11 ED - Maria Florina Balcan ED - Kilian Q. Weinberger ID - pmlr-v48-fernando16 PB - PMLR SP - 1187 DP - PMLR EP - 1196 L1 - http://proceedings.mlr.press/v48/fernando16.pdf UR - http://proceedings.mlr.press/v48/fernando16.html AB - We introduce a new model for representation learning and classification of video sequences. Our model is based on a convolutional neural network coupled with a novel temporal pooling layer. The temporal pooling layer relies on an inner-optimization problem to efficiently encode temporal semantics over arbitrarily long video clips into a fixed-length vector representation. Importantly, the representation and classification parameters of our model can be estimated jointly in an end-to-end manner by formulating learning as a bilevel optimization problem. Furthermore, the model can make use of any existing convolutional neural network architecture (e.g., AlexNet or VGG) without modification or introduction of additional parameters. We demonstrate our approach on action and activity recognition tasks. ER -
APA
Fernando, B. & Gould, S.. (2016). Learning End-to-end Video Classification with Rank-Pooling. Proceedings of The 33rd International Conference on Machine Learning, in PMLR 48:1187-1196

Related Material