VIM: Variational Independent Modules for Video Prediction

Rim Assouel, Lluis Castrejon, Aaron Courville, Nicolas Ballas, Yoshua Bengio
Proceedings of the First Conference on Causal Learning and Reasoning, PMLR 177:70-89, 2022.

Abstract

We introduce a variational inference model called VIM, for Variational Independent Modules, for sequential data that learns and infers latent representations as a set of objects and discovers modular causal mechanisms over these objects. These mechanisms - which we call modules - are independently parametrized, define the stochastic transitions of entities and are shared across entities. At each time step, our model infers from a low-level input sequence a high-level sequence of categorical latent variables to select which transition modules to apply to which high-level object. We evaluate this model in video prediction tasks where the goal is to predict multi-modal future events given previous observations. We demonstrate empirically that VIM can model 2D visual sequences in an interpretable way and is able to identify the underlying dynamically instantiated mechanisms of the generation process. We additionally show that the learnt modules can be composed at test time to generalize to out-of-distribution observations.

Cite this Paper


BibTeX
@InProceedings{pmlr-v177-assouel22a, title = {{VIM}: Variational Independent Modules for Video Prediction}, author = {Assouel, Rim and Castrejon, Lluis and Courville, Aaron and Ballas, Nicolas and Bengio, Yoshua}, booktitle = {Proceedings of the First Conference on Causal Learning and Reasoning}, pages = {70--89}, year = {2022}, editor = {Schölkopf, Bernhard and Uhler, Caroline and Zhang, Kun}, volume = {177}, series = {Proceedings of Machine Learning Research}, month = {11--13 Apr}, publisher = {PMLR}, pdf = {https://proceedings.mlr.press/v177/assouel22a/assouel22a.pdf}, url = {https://proceedings.mlr.press/v177/assouel22a.html}, abstract = {We introduce a variational inference model called VIM, for Variational Independent Modules, for sequential data that learns and infers latent representations as a set of objects and discovers modular causal mechanisms over these objects. These mechanisms - which we call modules - are independently parametrized, define the stochastic transitions of entities and are shared across entities. At each time step, our model infers from a low-level input sequence a high-level sequence of categorical latent variables to select which transition modules to apply to which high-level object. We evaluate this model in video prediction tasks where the goal is to predict multi-modal future events given previous observations. We demonstrate empirically that VIM can model 2D visual sequences in an interpretable way and is able to identify the underlying dynamically instantiated mechanisms of the generation process. We additionally show that the learnt modules can be composed at test time to generalize to out-of-distribution observations.} }
Endnote
%0 Conference Paper %T VIM: Variational Independent Modules for Video Prediction %A Rim Assouel %A Lluis Castrejon %A Aaron Courville %A Nicolas Ballas %A Yoshua Bengio %B Proceedings of the First Conference on Causal Learning and Reasoning %C Proceedings of Machine Learning Research %D 2022 %E Bernhard Schölkopf %E Caroline Uhler %E Kun Zhang %F pmlr-v177-assouel22a %I PMLR %P 70--89 %U https://proceedings.mlr.press/v177/assouel22a.html %V 177 %X We introduce a variational inference model called VIM, for Variational Independent Modules, for sequential data that learns and infers latent representations as a set of objects and discovers modular causal mechanisms over these objects. These mechanisms - which we call modules - are independently parametrized, define the stochastic transitions of entities and are shared across entities. At each time step, our model infers from a low-level input sequence a high-level sequence of categorical latent variables to select which transition modules to apply to which high-level object. We evaluate this model in video prediction tasks where the goal is to predict multi-modal future events given previous observations. We demonstrate empirically that VIM can model 2D visual sequences in an interpretable way and is able to identify the underlying dynamically instantiated mechanisms of the generation process. We additionally show that the learnt modules can be composed at test time to generalize to out-of-distribution observations.
APA
Assouel, R., Castrejon, L., Courville, A., Ballas, N. & Bengio, Y.. (2022). VIM: Variational Independent Modules for Video Prediction. Proceedings of the First Conference on Causal Learning and Reasoning, in Proceedings of Machine Learning Research 177:70-89 Available from https://proceedings.mlr.press/v177/assouel22a.html.

Related Material