LongVU: Spatiotemporal Adaptive Compression for Long Video-Language Understanding

Xiaoqian Shen; Yunyang Xiong; Changsheng Zhao; Lemeng Wu; Jun Chen; Chenchen Zhu; Zechun Liu; Fanyi Xiao; Balakrishnan Varadarajan; Florian Bordes; Zhuang Liu; Hu Xu; Hyunwoo J. Kim; Bilge Soran; Raghuraman Krishnamoorthi; Mohamed Elhoseiny; Vikas Chandra

LongVU: Spatiotemporal Adaptive Compression for Long Video-Language Understanding

Xiaoqian Shen, Yunyang Xiong, Changsheng Zhao, Lemeng Wu, Jun Chen, Chenchen Zhu, Zechun Liu, Fanyi Xiao, Balakrishnan Varadarajan, Florian Bordes, Zhuang Liu, Hu Xu, Hyunwoo J. Kim, Bilge Soran, Raghuraman Krishnamoorthi, Mohamed Elhoseiny, Vikas Chandra

Proceedings of the 42nd International Conference on Machine Learning, PMLR 267:54582-54599, 2025.

Abstract

Multimodal Large Language Models (MLLMs) have shown promising progress in understanding and analyzing video content. However, processing long videos remains a significant challenge constrained by LLM’s context size. To address this limitation, we propose LongVU, a spatiotemporal adaptive compression mechanism that reduces the number of video tokens while preserving visual details of long videos. Our idea is based on leveraging cross-modal query and inter-frame dependencies to adaptively reduce temporal and spatial redundancy in videos. Specifically, we leverage DINOv2 features to remove redundant frames that exhibit high similarity. Then we utilize text-guided cross-modal query for selective frame feature reduction. Further, we perform spatial token reduction across frames based on their temporal dependencies. Our adaptive compression strategy effectively processes a large number of frames with little visual information loss within given context length. Our LongVU consistently surpass existing methods across a variety of video understanding benchmarks, especially on hour-long video understanding tasks such as VideoMME and MLVU. Given a light-weight LLM, our LongVU also scales effectively into a smaller size with state-of-the-art video understanding performance.

Cite this Paper

BibTeX

@InProceedings{pmlr-v267-shen25j,
  title = 	 {{L}ong{VU}: Spatiotemporal Adaptive Compression for Long Video-Language Understanding},
  author =       {Shen, Xiaoqian and Xiong, Yunyang and Zhao, Changsheng and Wu, Lemeng and Chen, Jun and Zhu, Chenchen and Liu, Zechun and Xiao, Fanyi and Varadarajan, Balakrishnan and Bordes, Florian and Liu, Zhuang and Xu, Hu and Kim, Hyunwoo J. and Soran, Bilge and Krishnamoorthi, Raghuraman and Elhoseiny, Mohamed and Chandra, Vikas},
  booktitle = 	 {Proceedings of the 42nd International Conference on Machine Learning},
  pages = 	 {54582--54599},
  year = 	 {2025},
  editor = 	 {Singh, Aarti and Fazel, Maryam and Hsu, Daniel and Lacoste-Julien, Simon and Berkenkamp, Felix and Maharaj, Tegan and Wagstaff, Kiri and Zhu, Jerry},
  volume = 	 {267},
  series = 	 {Proceedings of Machine Learning Research},
  month = 	 {13--19 Jul},
  publisher =    {PMLR},
  pdf = 	 {https://raw.githubusercontent.com/mlresearch/v267/main/assets/shen25j/shen25j.pdf},
  url = 	 {https://proceedings.mlr.press/v267/shen25j.html},
  abstract = 	 {Multimodal Large Language Models (MLLMs) have shown promising progress in understanding and analyzing video content. However, processing long videos remains a significant challenge constrained by LLM’s context size. To address this limitation, we propose LongVU, a spatiotemporal adaptive compression mechanism that reduces the number of video tokens while preserving visual details of long videos. Our idea is based on leveraging cross-modal query and inter-frame dependencies to adaptively reduce temporal and spatial redundancy in videos. Specifically, we leverage DINOv2 features to remove redundant frames that exhibit high similarity. Then we utilize text-guided cross-modal query for selective frame feature reduction. Further, we perform spatial token reduction across frames based on their temporal dependencies. Our adaptive compression strategy effectively processes a large number of frames with little visual information loss within given context length. Our LongVU consistently surpass existing methods across a variety of video understanding benchmarks, especially on hour-long video understanding tasks such as VideoMME and MLVU. Given a light-weight LLM, our LongVU also scales effectively into a smaller size with state-of-the-art video understanding performance.}
}

Endnote

%0 Conference Paper
%T LongVU: Spatiotemporal Adaptive Compression for Long Video-Language Understanding
%A Xiaoqian Shen
%A Yunyang Xiong
%A Changsheng Zhao
%A Lemeng Wu
%A Jun Chen
%A Chenchen Zhu
%A Zechun Liu
%A Fanyi Xiao
%A Balakrishnan Varadarajan
%A Florian Bordes
%A Zhuang Liu
%A Hu Xu
%A Hyunwoo J. Kim
%A Bilge Soran
%A Raghuraman Krishnamoorthi
%A Mohamed Elhoseiny
%A Vikas Chandra
%B Proceedings of the 42nd International Conference on Machine Learning
%C Proceedings of Machine Learning Research
%D 2025
%E Aarti Singh
%E Maryam Fazel
%E Daniel Hsu
%E Simon Lacoste-Julien
%E Felix Berkenkamp
%E Tegan Maharaj
%E Kiri Wagstaff
%E Jerry Zhu	
%F pmlr-v267-shen25j
%I PMLR
%P 54582--54599
%U https://proceedings.mlr.press/v267/shen25j.html
%V 267
%X Multimodal Large Language Models (MLLMs) have shown promising progress in understanding and analyzing video content. However, processing long videos remains a significant challenge constrained by LLM’s context size. To address this limitation, we propose LongVU, a spatiotemporal adaptive compression mechanism that reduces the number of video tokens while preserving visual details of long videos. Our idea is based on leveraging cross-modal query and inter-frame dependencies to adaptively reduce temporal and spatial redundancy in videos. Specifically, we leverage DINOv2 features to remove redundant frames that exhibit high similarity. Then we utilize text-guided cross-modal query for selective frame feature reduction. Further, we perform spatial token reduction across frames based on their temporal dependencies. Our adaptive compression strategy effectively processes a large number of frames with little visual information loss within given context length. Our LongVU consistently surpass existing methods across a variety of video understanding benchmarks, especially on hour-long video understanding tasks such as VideoMME and MLVU. Given a light-weight LLM, our LongVU also scales effectively into a smaller size with state-of-the-art video understanding performance.

APA

Shen, X., Xiong, Y., Zhao, C., Wu, L., Chen, J., Zhu, C., Liu, Z., Xiao, F., Varadarajan, B., Bordes, F., Liu, Z., Xu, H., Kim, H.J., Soran, B., Krishnamoorthi, R., Elhoseiny, M. & Chandra, V.. (2025). LongVU: Spatiotemporal Adaptive Compression for Long Video-Language Understanding. Proceedings of the 42nd International Conference on Machine Learning, in Proceedings of Machine Learning Research 267:54582-54599 Available from https://proceedings.mlr.press/v267/shen25j.html.

LongVU: Spatiotemporal Adaptive Compression for Long Video-Language Understanding

Abstract

Cite this Paper

Related Material