Momentor: Advancing Video Large Language Model with Fine-Grained Temporal Reasoning

Long Qian, Juncheng Li, Yu Wu, Yaobo Ye, Hao Fei, Tat-Seng Chua, Yueting Zhuang, Siliang Tang
Proceedings of the 41st International Conference on Machine Learning, PMLR 235:41340-41356, 2024.

Abstract

Large Language Models (LLMs) demonstrate remarkable proficiency in comprehending and handling text-based tasks. Many efforts are being made to transfer these attributes to video modality, which are termed Video-LLMs. However, existing Video-LLMs can only capture the coarse-grained semantics and are unable to effectively handle tasks related to comprehension or localization of specific video segments. In light of these challenges, we propose Momentor, a Video-LLM capable of accomplishing fine-grained temporal understanding tasks. To support the training of Momentor, we design an automatic data generation engine to construct Moment-10M, a large-scale video instruction dataset with segment-level instruction data. We train Momentor on Moment-10M, enabling it to perform segment-level reasoning and localization. Zero-shot evaluations on several tasks demonstrate that Momentor excels in fine-grained temporally grounded comprehension and localization.

Cite this Paper


BibTeX
@InProceedings{pmlr-v235-qian24a, title = {Momentor: Advancing Video Large Language Model with Fine-Grained Temporal Reasoning}, author = {Qian, Long and Li, Juncheng and Wu, Yu and Ye, Yaobo and Fei, Hao and Chua, Tat-Seng and Zhuang, Yueting and Tang, Siliang}, booktitle = {Proceedings of the 41st International Conference on Machine Learning}, pages = {41340--41356}, year = {2024}, editor = {Salakhutdinov, Ruslan and Kolter, Zico and Heller, Katherine and Weller, Adrian and Oliver, Nuria and Scarlett, Jonathan and Berkenkamp, Felix}, volume = {235}, series = {Proceedings of Machine Learning Research}, month = {21--27 Jul}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v235/main/assets/qian24a/qian24a.pdf}, url = {https://proceedings.mlr.press/v235/qian24a.html}, abstract = {Large Language Models (LLMs) demonstrate remarkable proficiency in comprehending and handling text-based tasks. Many efforts are being made to transfer these attributes to video modality, which are termed Video-LLMs. However, existing Video-LLMs can only capture the coarse-grained semantics and are unable to effectively handle tasks related to comprehension or localization of specific video segments. In light of these challenges, we propose Momentor, a Video-LLM capable of accomplishing fine-grained temporal understanding tasks. To support the training of Momentor, we design an automatic data generation engine to construct Moment-10M, a large-scale video instruction dataset with segment-level instruction data. We train Momentor on Moment-10M, enabling it to perform segment-level reasoning and localization. Zero-shot evaluations on several tasks demonstrate that Momentor excels in fine-grained temporally grounded comprehension and localization.} }
Endnote
%0 Conference Paper %T Momentor: Advancing Video Large Language Model with Fine-Grained Temporal Reasoning %A Long Qian %A Juncheng Li %A Yu Wu %A Yaobo Ye %A Hao Fei %A Tat-Seng Chua %A Yueting Zhuang %A Siliang Tang %B Proceedings of the 41st International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2024 %E Ruslan Salakhutdinov %E Zico Kolter %E Katherine Heller %E Adrian Weller %E Nuria Oliver %E Jonathan Scarlett %E Felix Berkenkamp %F pmlr-v235-qian24a %I PMLR %P 41340--41356 %U https://proceedings.mlr.press/v235/qian24a.html %V 235 %X Large Language Models (LLMs) demonstrate remarkable proficiency in comprehending and handling text-based tasks. Many efforts are being made to transfer these attributes to video modality, which are termed Video-LLMs. However, existing Video-LLMs can only capture the coarse-grained semantics and are unable to effectively handle tasks related to comprehension or localization of specific video segments. In light of these challenges, we propose Momentor, a Video-LLM capable of accomplishing fine-grained temporal understanding tasks. To support the training of Momentor, we design an automatic data generation engine to construct Moment-10M, a large-scale video instruction dataset with segment-level instruction data. We train Momentor on Moment-10M, enabling it to perform segment-level reasoning and localization. Zero-shot evaluations on several tasks demonstrate that Momentor excels in fine-grained temporally grounded comprehension and localization.
APA
Qian, L., Li, J., Wu, Y., Ye, Y., Fei, H., Chua, T., Zhuang, Y. & Tang, S.. (2024). Momentor: Advancing Video Large Language Model with Fine-Grained Temporal Reasoning. Proceedings of the 41st International Conference on Machine Learning, in Proceedings of Machine Learning Research 235:41340-41356 Available from https://proceedings.mlr.press/v235/qian24a.html.

Related Material