[edit]
Decomposing camera and object motion for an improved video sequence prediction
NeurIPS 2020 Workshop on Pre-registration in Machine Learning, PMLR 148:358-374, 2021.
Abstract
We propose a novel deep learning framework that focuses on decomposing the motion or the flow of the pixels from the background for an improved and longer prediction of video sequences. We propose to generate multi-timestep pixel level prediction using a framework that is trained to learn the temporal and spatial dependencies encoded in the video data separately. The proposed framework, called Velocity Acceleration Network or VANet, is capable of predicting long term video frames for a static scenario, where the camera is stationary, as well as in dynamic partially observable cases, where the camera is mounted on a moving platform (cars or robots). This framework decomposes the flow of the image sequences into velocity and acceleration maps and learns the temporal transformations using a convolutional LSTM network. Our detailed empirical study on three different datasets (BAIR, KTH and KITTI) shows that conditioning recurrent networks like LSTMs with higher order optical flow maps results in improved inference capabilities for videos.