Learning Object-Centric Dynamic Modes from Video and Emerging Properties

Armand Comas, Christian Fernandez Lopez, Sandesh Ghimire, Haolin Li, Mario Sznaier, Octavia Camps
Proceedings of The 5th Annual Learning for Dynamics and Control Conference, PMLR 211:745-769, 2023.

Abstract

One of the long-term objectives of Machine Learning is to endow machines with the capacity of structuring and interpreting the world as we do. This is particularly challenging in scenes involving time series, such as video sequences, since seemingly different data can correspond to the same underlying dynamics. Recent approaches seek to decompose video sequences into their composing objects, attributes and dynamics in a self-supervised fashion, thus simplifying the task of learning suitable features that can be used to analyze each component. While existing methods can successfully disentangle dynamics from other components, there have been relatively few efforts in learning parsimonious representations of these underlying dynamics. In this paper, motivated by recent advances in non-linear identification, we propose a method to decompose a video into moving objects, their attributes and the dynamic modes of their trajectories. We model video dynamics as the output of a Koopman operator to be learned from the available data. In this context, the dynamic information contained in the scene is encapsulated in the eigenvalues and eigenvectors of the Koopman operator, providing an interpretable and parsimonious representation. We show that such decomposition can be used for instance to perform video analytics, predict future frames or generate synthetic video. We test our framework in a variety of datasets that encompass different dynamic scenarios, while illustrating the novel features that emerge from our dynamic modes decomposition: Video dynamics interpretation and user manipulation at test-time. We successfully forecast challenging object trajectories from pixels, achieving competitive performance while drawing useful insights.

Cite this Paper


BibTeX
@InProceedings{pmlr-v211-comas23a, title = {Learning Object-Centric Dynamic Modes from Video and Emerging Properties}, author = {Comas, Armand and Lopez, Christian Fernandez and Ghimire, Sandesh and Li, Haolin and Sznaier, Mario and Camps, Octavia}, booktitle = {Proceedings of The 5th Annual Learning for Dynamics and Control Conference}, pages = {745--769}, year = {2023}, editor = {Matni, Nikolai and Morari, Manfred and Pappas, George J.}, volume = {211}, series = {Proceedings of Machine Learning Research}, month = {15--16 Jun}, publisher = {PMLR}, pdf = {https://proceedings.mlr.press/v211/comas23a/comas23a.pdf}, url = {https://proceedings.mlr.press/v211/comas23a.html}, abstract = {One of the long-term objectives of Machine Learning is to endow machines with the capacity of structuring and interpreting the world as we do. This is particularly challenging in scenes involving time series, such as video sequences, since seemingly different data can correspond to the same underlying dynamics. Recent approaches seek to decompose video sequences into their composing objects, attributes and dynamics in a self-supervised fashion, thus simplifying the task of learning suitable features that can be used to analyze each component. While existing methods can successfully disentangle dynamics from other components, there have been relatively few efforts in learning parsimonious representations of these underlying dynamics. In this paper, motivated by recent advances in non-linear identification, we propose a method to decompose a video into moving objects, their attributes and the dynamic modes of their trajectories. We model video dynamics as the output of a Koopman operator to be learned from the available data. In this context, the dynamic information contained in the scene is encapsulated in the eigenvalues and eigenvectors of the Koopman operator, providing an interpretable and parsimonious representation. We show that such decomposition can be used for instance to perform video analytics, predict future frames or generate synthetic video. We test our framework in a variety of datasets that encompass different dynamic scenarios, while illustrating the novel features that emerge from our dynamic modes decomposition: Video dynamics interpretation and user manipulation at test-time. We successfully forecast challenging object trajectories from pixels, achieving competitive performance while drawing useful insights.} }
Endnote
%0 Conference Paper %T Learning Object-Centric Dynamic Modes from Video and Emerging Properties %A Armand Comas %A Christian Fernandez Lopez %A Sandesh Ghimire %A Haolin Li %A Mario Sznaier %A Octavia Camps %B Proceedings of The 5th Annual Learning for Dynamics and Control Conference %C Proceedings of Machine Learning Research %D 2023 %E Nikolai Matni %E Manfred Morari %E George J. Pappas %F pmlr-v211-comas23a %I PMLR %P 745--769 %U https://proceedings.mlr.press/v211/comas23a.html %V 211 %X One of the long-term objectives of Machine Learning is to endow machines with the capacity of structuring and interpreting the world as we do. This is particularly challenging in scenes involving time series, such as video sequences, since seemingly different data can correspond to the same underlying dynamics. Recent approaches seek to decompose video sequences into their composing objects, attributes and dynamics in a self-supervised fashion, thus simplifying the task of learning suitable features that can be used to analyze each component. While existing methods can successfully disentangle dynamics from other components, there have been relatively few efforts in learning parsimonious representations of these underlying dynamics. In this paper, motivated by recent advances in non-linear identification, we propose a method to decompose a video into moving objects, their attributes and the dynamic modes of their trajectories. We model video dynamics as the output of a Koopman operator to be learned from the available data. In this context, the dynamic information contained in the scene is encapsulated in the eigenvalues and eigenvectors of the Koopman operator, providing an interpretable and parsimonious representation. We show that such decomposition can be used for instance to perform video analytics, predict future frames or generate synthetic video. We test our framework in a variety of datasets that encompass different dynamic scenarios, while illustrating the novel features that emerge from our dynamic modes decomposition: Video dynamics interpretation and user manipulation at test-time. We successfully forecast challenging object trajectories from pixels, achieving competitive performance while drawing useful insights.
APA
Comas, A., Lopez, C.F., Ghimire, S., Li, H., Sznaier, M. & Camps, O.. (2023). Learning Object-Centric Dynamic Modes from Video and Emerging Properties. Proceedings of The 5th Annual Learning for Dynamics and Control Conference, in Proceedings of Machine Learning Research 211:745-769 Available from https://proceedings.mlr.press/v211/comas23a.html.

Related Material