Video-of-Thought: Step-by-Step Video Reasoning from Perception to Cognition

Hao Fei; Shengqiong Wu; Wei Ji; Hanwang Zhang; Meishan Zhang; Mong-Li Lee; Wynne Hsu

Video-of-Thought: Step-by-Step Video Reasoning from Perception to Cognition

Hao Fei, Shengqiong Wu, Wei Ji, Hanwang Zhang, Meishan Zhang, Mong-Li Lee, Wynne Hsu

Proceedings of the 41st International Conference on Machine Learning, PMLR 235:13109-13125, 2024.

Abstract

Existing research of video understanding still struggles to achieve in-depth comprehension and reasoning in complex videos, primarily due to the under-exploration of two key bottlenecks: fine-grained spatial-temporal perceptive understanding and cognitive-level video scene comprehension. This paper bridges the gap by presenting a novel solution. We first introduce a novel video Multimodal Large Language Model (MLLM), MotionEpic, which achieves fine-grained pixel-level spatial-temporal video grounding by integrating video spatial-temporal scene graph (STSG) representation. Building upon MotionEpic, we then develop a Video-of-Thought (VoT) reasoning framework. VoT inherits the Chain-of-Thought (CoT) core, breaking down a complex task into simpler and manageable sub-problems, and addressing them step-by-step from a low-level pixel perception to high-level cognitive interpretation. Extensive experiments across various complex video QA benchmarks demonstrate that our overall framework strikingly boosts existing state-of-the-art. To our knowledge, this is the first attempt at successfully implementing the CoT technique for achieving human-level video reasoning, where we show great potential in extending it to a wider range of video understanding scenarios. Systems and codes will be open later.

Cite this Paper

BibTeX


@InProceedings{pmlr-v235-fei24a,
  title = 	 {Video-of-Thought: Step-by-Step Video Reasoning from Perception to Cognition},
  author =       {Fei, Hao and Wu, Shengqiong and Ji, Wei and Zhang, Hanwang and Zhang, Meishan and Lee, Mong-Li and Hsu, Wynne},
  booktitle = 	 {Proceedings of the 41st International Conference on Machine Learning},
  pages = 	 {13109--13125},
  year = 	 {2024},
  editor = 	 {Salakhutdinov, Ruslan and Kolter, Zico and Heller, Katherine and Weller, Adrian and Oliver, Nuria and Scarlett, Jonathan and Berkenkamp, Felix},
  volume = 	 {235},
  series = 	 {Proceedings of Machine Learning Research},
  month = 	 {21--27 Jul},
  publisher =    {PMLR},
  pdf = 	 {https://raw.githubusercontent.com/mlresearch/v235/main/assets/fei24a/fei24a.pdf},
  url = 	 {https://proceedings.mlr.press/v235/fei24a.html},
  abstract = 	 {Existing research of video understanding still struggles to achieve in-depth comprehension and reasoning in complex videos, primarily due to the under-exploration of two key bottlenecks: fine-grained spatial-temporal perceptive understanding and cognitive-level video scene comprehension. This paper bridges the gap by presenting a novel solution. We first introduce a novel video Multimodal Large Language Model (MLLM), MotionEpic, which achieves fine-grained pixel-level spatial-temporal video grounding by integrating video spatial-temporal scene graph (STSG) representation. Building upon MotionEpic, we then develop a Video-of-Thought (VoT) reasoning framework. VoT inherits the Chain-of-Thought (CoT) core, breaking down a complex task into simpler and manageable sub-problems, and addressing them step-by-step from a low-level pixel perception to high-level cognitive interpretation. Extensive experiments across various complex video QA benchmarks demonstrate that our overall framework strikingly boosts existing state-of-the-art. To our knowledge, this is the first attempt at successfully implementing the CoT technique for achieving human-level video reasoning, where we show great potential in extending it to a wider range of video understanding scenarios. Systems and codes will be open later.}
}

Endnote

%0 Conference Paper
%T Video-of-Thought: Step-by-Step Video Reasoning from Perception to Cognition
%A Hao Fei
%A Shengqiong Wu
%A Wei Ji
%A Hanwang Zhang
%A Meishan Zhang
%A Mong-Li Lee
%A Wynne Hsu
%B Proceedings of the 41st International Conference on Machine Learning
%C Proceedings of Machine Learning Research
%D 2024
%E Ruslan Salakhutdinov
%E Zico Kolter
%E Katherine Heller
%E Adrian Weller
%E Nuria Oliver
%E Jonathan Scarlett
%E Felix Berkenkamp	
%F pmlr-v235-fei24a
%I PMLR
%P 13109--13125
%U https://proceedings.mlr.press/v235/fei24a.html
%V 235
%X Existing research of video understanding still struggles to achieve in-depth comprehension and reasoning in complex videos, primarily due to the under-exploration of two key bottlenecks: fine-grained spatial-temporal perceptive understanding and cognitive-level video scene comprehension. This paper bridges the gap by presenting a novel solution. We first introduce a novel video Multimodal Large Language Model (MLLM), MotionEpic, which achieves fine-grained pixel-level spatial-temporal video grounding by integrating video spatial-temporal scene graph (STSG) representation. Building upon MotionEpic, we then develop a Video-of-Thought (VoT) reasoning framework. VoT inherits the Chain-of-Thought (CoT) core, breaking down a complex task into simpler and manageable sub-problems, and addressing them step-by-step from a low-level pixel perception to high-level cognitive interpretation. Extensive experiments across various complex video QA benchmarks demonstrate that our overall framework strikingly boosts existing state-of-the-art. To our knowledge, this is the first attempt at successfully implementing the CoT technique for achieving human-level video reasoning, where we show great potential in extending it to a wider range of video understanding scenarios. Systems and codes will be open later.

APA


Fei, H., Wu, S., Ji, W., Zhang, H., Zhang, M., Lee, M. & Hsu, W.. (2024). Video-of-Thought: Step-by-Step Video Reasoning from Perception to Cognition. Proceedings of the 41st International Conference on Machine Learning, in Proceedings of Machine Learning Research 235:13109-13125 Available from https://proceedings.mlr.press/v235/fei24a.html.

Video-of-Thought: Step-by-Step Video Reasoning from Perception to Cognition

Abstract

Cite this Paper

Related Material