$∞$-Video: A Training-Free Approach to Long Video Understanding via Continuous-Time Memory Consolidation

Saul Santos, António Farinhas, Daniel C Mcnamee, Andre Martins
Proceedings of the 42nd International Conference on Machine Learning, PMLR 267:52877-52893, 2025.

Abstract

Current video-language models struggle with long-video understanding due to limited context lengths and reliance on sparse frame subsampling, which often leads to information loss. In this paper, we introduce $\infty$-Video, which is able to process arbitrarily long videos through a continuous-time long-term memory (LTM) consolidation mechanism. Our framework augments video Q-formers by making them able to process unbounded video contexts efficiently and without requiring additional training. Through continuous attention, our approach dynamically allocates higher granularity to the most relevant video segments, forming "sticky" memories which evolve over time. Experiments with Video-LLaMA and VideoChat2 demonstrate improved performance in video question-answering tasks, showcasing the potential of continuous-time LTM mechanisms to enable scalable and training-free comprehension of long videos.

Cite this Paper


BibTeX
@InProceedings{pmlr-v267-santos25a, title = {$∞$-Video: A Training-Free Approach to Long Video Understanding via Continuous-Time Memory Consolidation}, author = {Santos, Saul and Farinhas, Ant\'{o}nio and Mcnamee, Daniel C and Martins, Andre}, booktitle = {Proceedings of the 42nd International Conference on Machine Learning}, pages = {52877--52893}, year = {2025}, editor = {Singh, Aarti and Fazel, Maryam and Hsu, Daniel and Lacoste-Julien, Simon and Berkenkamp, Felix and Maharaj, Tegan and Wagstaff, Kiri and Zhu, Jerry}, volume = {267}, series = {Proceedings of Machine Learning Research}, month = {13--19 Jul}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v267/main/assets/santos25a/santos25a.pdf}, url = {https://proceedings.mlr.press/v267/santos25a.html}, abstract = {Current video-language models struggle with long-video understanding due to limited context lengths and reliance on sparse frame subsampling, which often leads to information loss. In this paper, we introduce $\infty$-Video, which is able to process arbitrarily long videos through a continuous-time long-term memory (LTM) consolidation mechanism. Our framework augments video Q-formers by making them able to process unbounded video contexts efficiently and without requiring additional training. Through continuous attention, our approach dynamically allocates higher granularity to the most relevant video segments, forming "sticky" memories which evolve over time. Experiments with Video-LLaMA and VideoChat2 demonstrate improved performance in video question-answering tasks, showcasing the potential of continuous-time LTM mechanisms to enable scalable and training-free comprehension of long videos.} }
Endnote
%0 Conference Paper %T $∞$-Video: A Training-Free Approach to Long Video Understanding via Continuous-Time Memory Consolidation %A Saul Santos %A António Farinhas %A Daniel C Mcnamee %A Andre Martins %B Proceedings of the 42nd International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2025 %E Aarti Singh %E Maryam Fazel %E Daniel Hsu %E Simon Lacoste-Julien %E Felix Berkenkamp %E Tegan Maharaj %E Kiri Wagstaff %E Jerry Zhu %F pmlr-v267-santos25a %I PMLR %P 52877--52893 %U https://proceedings.mlr.press/v267/santos25a.html %V 267 %X Current video-language models struggle with long-video understanding due to limited context lengths and reliance on sparse frame subsampling, which often leads to information loss. In this paper, we introduce $\infty$-Video, which is able to process arbitrarily long videos through a continuous-time long-term memory (LTM) consolidation mechanism. Our framework augments video Q-formers by making them able to process unbounded video contexts efficiently and without requiring additional training. Through continuous attention, our approach dynamically allocates higher granularity to the most relevant video segments, forming "sticky" memories which evolve over time. Experiments with Video-LLaMA and VideoChat2 demonstrate improved performance in video question-answering tasks, showcasing the potential of continuous-time LTM mechanisms to enable scalable and training-free comprehension of long videos.
APA
Santos, S., Farinhas, A., Mcnamee, D.C. & Martins, A.. (2025). $∞$-Video: A Training-Free Approach to Long Video Understanding via Continuous-Time Memory Consolidation. Proceedings of the 42nd International Conference on Machine Learning, in Proceedings of Machine Learning Research 267:52877-52893 Available from https://proceedings.mlr.press/v267/santos25a.html.

Related Material