Momentor: Advancing Video Large Language Model with Fine-Grained Temporal Reasoning

Long Qian; Juncheng Li; Yu Wu; Yaobo Ye; Hao Fei; Tat-Seng Chua; Yueting Zhuang; Siliang Tang

Momentor: Advancing Video Large Language Model with Fine-Grained Temporal Reasoning

Long Qian, Juncheng Li, Yu Wu, Yaobo Ye, Hao Fei, Tat-Seng Chua, Yueting Zhuang, Siliang Tang

Proceedings of the 41st International Conference on Machine Learning, PMLR 235:41340-41356, 2024.

Abstract

Large Language Models (LLMs) demonstrate remarkable proficiency in comprehending and handling text-based tasks. Many efforts are being made to transfer these attributes to video modality, which are termed Video-LLMs. However, existing Video-LLMs can only capture the coarse-grained semantics and are unable to effectively handle tasks related to comprehension or localization of specific video segments. In light of these challenges, we propose Momentor, a Video-LLM capable of accomplishing fine-grained temporal understanding tasks. To support the training of Momentor, we design an automatic data generation engine to construct Moment-10M, a large-scale video instruction dataset with segment-level instruction data. We train Momentor on Moment-10M, enabling it to perform segment-level reasoning and localization. Zero-shot evaluations on several tasks demonstrate that Momentor excels in fine-grained temporally grounded comprehension and localization.

Cite this Paper

BibTeX


@InProceedings{pmlr-v235-qian24a,
  title = 	 {Momentor: Advancing Video Large Language Model with Fine-Grained Temporal Reasoning},
  author =       {Qian, Long and Li, Juncheng and Wu, Yu and Ye, Yaobo and Fei, Hao and Chua, Tat-Seng and Zhuang, Yueting and Tang, Siliang},
  booktitle = 	 {Proceedings of the 41st International Conference on Machine Learning},
  pages = 	 {41340--41356},
  year = 	 {2024},
  editor = 	 {Salakhutdinov, Ruslan and Kolter, Zico and Heller, Katherine and Weller, Adrian and Oliver, Nuria and Scarlett, Jonathan and Berkenkamp, Felix},
  volume = 	 {235},
  series = 	 {Proceedings of Machine Learning Research},
  month = 	 {21--27 Jul},
  publisher =    {PMLR},
  pdf = 	 {https://raw.githubusercontent.com/mlresearch/v235/main/assets/qian24a/qian24a.pdf},
  url = 	 {https://proceedings.mlr.press/v235/qian24a.html},
  abstract = 	 {Large Language Models (LLMs) demonstrate remarkable proficiency in comprehending and handling text-based tasks. Many efforts are being made to transfer these attributes to video modality, which are termed Video-LLMs. However, existing Video-LLMs can only capture the coarse-grained semantics and are unable to effectively handle tasks related to comprehension or localization of specific video segments. In light of these challenges, we propose Momentor, a Video-LLM capable of accomplishing fine-grained temporal understanding tasks. To support the training of Momentor, we design an automatic data generation engine to construct Moment-10M, a large-scale video instruction dataset with segment-level instruction data. We train Momentor on Moment-10M, enabling it to perform segment-level reasoning and localization. Zero-shot evaluations on several tasks demonstrate that Momentor excels in fine-grained temporally grounded comprehension and localization.}
}

Endnote

%0 Conference Paper
%T Momentor: Advancing Video Large Language Model with Fine-Grained Temporal Reasoning
%A Long Qian
%A Juncheng Li
%A Yu Wu
%A Yaobo Ye
%A Hao Fei
%A Tat-Seng Chua
%A Yueting Zhuang
%A Siliang Tang
%B Proceedings of the 41st International Conference on Machine Learning
%C Proceedings of Machine Learning Research
%D 2024
%E Ruslan Salakhutdinov
%E Zico Kolter
%E Katherine Heller
%E Adrian Weller
%E Nuria Oliver
%E Jonathan Scarlett
%E Felix Berkenkamp	
%F pmlr-v235-qian24a
%I PMLR
%P 41340--41356
%U https://proceedings.mlr.press/v235/qian24a.html
%V 235
%X Large Language Models (LLMs) demonstrate remarkable proficiency in comprehending and handling text-based tasks. Many efforts are being made to transfer these attributes to video modality, which are termed Video-LLMs. However, existing Video-LLMs can only capture the coarse-grained semantics and are unable to effectively handle tasks related to comprehension or localization of specific video segments. In light of these challenges, we propose Momentor, a Video-LLM capable of accomplishing fine-grained temporal understanding tasks. To support the training of Momentor, we design an automatic data generation engine to construct Moment-10M, a large-scale video instruction dataset with segment-level instruction data. We train Momentor on Moment-10M, enabling it to perform segment-level reasoning and localization. Zero-shot evaluations on several tasks demonstrate that Momentor excels in fine-grained temporally grounded comprehension and localization.

APA


Qian, L., Li, J., Wu, Y., Ye, Y., Fei, H., Chua, T., Zhuang, Y. & Tang, S.. (2024). Momentor: Advancing Video Large Language Model with Fine-Grained Temporal Reasoning. Proceedings of the 41st International Conference on Machine Learning, in Proceedings of Machine Learning Research 235:41340-41356 Available from https://proceedings.mlr.press/v235/qian24a.html.

Momentor: Advancing Video Large Language Model with Fine-Grained Temporal Reasoning

Abstract

Cite this Paper

Related Material