Decomposing camera and object motion for an improved video sequence prediction

Meenakshi Sarkar, Debasish Ghose, Aniruddha Bala
NeurIPS 2020 Workshop on Pre-registration in Machine Learning, PMLR 148:358-374, 2021.

Abstract

We propose a novel deep learning framework that focuses on decomposing the motion or the flow of the pixels from the background for an improved and longer prediction of video sequences. We propose to generate multi-timestep pixel level prediction using a framework that is trained to learn the temporal and spatial dependencies encoded in the video data separately. The proposed framework, called Velocity Acceleration Network or VANet, is capable of predicting long term video frames for a static scenario, where the camera is stationary, as well as in dynamic partially observable cases, where the camera is mounted on a moving platform (cars or robots). This framework decomposes the flow of the image sequences into velocity and acceleration maps and learns the temporal transformations using a convolutional LSTM network. Our detailed empirical study on three different datasets (BAIR, KTH and KITTI) shows that conditioning recurrent networks like LSTMs with higher order optical flow maps results in improved inference capabilities for videos.

Cite this Paper


BibTeX
@InProceedings{pmlr-v148-sarkar21a, title = {Decomposing camera and object motion for an improved video sequence prediction}, author = {Sarkar, Meenakshi and Ghose, Debasish and Bala, Aniruddha}, booktitle = {NeurIPS 2020 Workshop on Pre-registration in Machine Learning}, pages = {358--374}, year = {2021}, editor = {Bertinetto, Luca and Henriques, João F. and Albanie, Samuel and Paganini, Michela and Varol, Gül}, volume = {148}, series = {Proceedings of Machine Learning Research}, month = {11 Dec}, publisher = {PMLR}, pdf = {http://proceedings.mlr.press/v148/sarkar21a/sarkar21a.pdf}, url = {https://proceedings.mlr.press/v148/sarkar21a.html}, abstract = {We propose a novel deep learning framework that focuses on decomposing the motion or the flow of the pixels from the background for an improved and longer prediction of video sequences. We propose to generate multi-timestep pixel level prediction using a framework that is trained to learn the temporal and spatial dependencies encoded in the video data separately. The proposed framework, called Velocity Acceleration Network or VANet, is capable of predicting long term video frames for a static scenario, where the camera is stationary, as well as in dynamic partially observable cases, where the camera is mounted on a moving platform (cars or robots). This framework decomposes the flow of the image sequences into velocity and acceleration maps and learns the temporal transformations using a convolutional LSTM network. Our detailed empirical study on three different datasets (BAIR, KTH and KITTI) shows that conditioning recurrent networks like LSTMs with higher order optical flow maps results in improved inference capabilities for videos.} }
Endnote
%0 Conference Paper %T Decomposing camera and object motion for an improved video sequence prediction %A Meenakshi Sarkar %A Debasish Ghose %A Aniruddha Bala %B NeurIPS 2020 Workshop on Pre-registration in Machine Learning %C Proceedings of Machine Learning Research %D 2021 %E Luca Bertinetto %E João F. Henriques %E Samuel Albanie %E Michela Paganini %E Gül Varol %F pmlr-v148-sarkar21a %I PMLR %P 358--374 %U https://proceedings.mlr.press/v148/sarkar21a.html %V 148 %X We propose a novel deep learning framework that focuses on decomposing the motion or the flow of the pixels from the background for an improved and longer prediction of video sequences. We propose to generate multi-timestep pixel level prediction using a framework that is trained to learn the temporal and spatial dependencies encoded in the video data separately. The proposed framework, called Velocity Acceleration Network or VANet, is capable of predicting long term video frames for a static scenario, where the camera is stationary, as well as in dynamic partially observable cases, where the camera is mounted on a moving platform (cars or robots). This framework decomposes the flow of the image sequences into velocity and acceleration maps and learns the temporal transformations using a convolutional LSTM network. Our detailed empirical study on three different datasets (BAIR, KTH and KITTI) shows that conditioning recurrent networks like LSTMs with higher order optical flow maps results in improved inference capabilities for videos.
APA
Sarkar, M., Ghose, D. & Bala, A.. (2021). Decomposing camera and object motion for an improved video sequence prediction. NeurIPS 2020 Workshop on Pre-registration in Machine Learning, in Proceedings of Machine Learning Research 148:358-374 Available from https://proceedings.mlr.press/v148/sarkar21a.html.

Related Material