Ex-VAD: Explainable Fine-grained Video Anomaly Detection Based on Visual-Language Models

Chao Huang; Yushu Shi; Jie Wen; Wei Wang; Yong Xu; Xiaochun Cao

Ex-VAD: Explainable Fine-grained Video Anomaly Detection Based on Visual-Language Models

Chao Huang, Yushu Shi, Jie Wen, Wei Wang, Yong Xu, Xiaochun Cao

Proceedings of the 42nd International Conference on Machine Learning, PMLR 267:25750-25761, 2025.

Abstract

With advancements in visual language models (VLMs) and large language models (LLMs), video anomaly detection (VAD) has progressed beyond binary classification to fine-grained categorization and multidimensional analysis. However, existing methods focus mainly on coarse-grained detection, lacking anomaly explanations. To address these challenges, we propose Ex-VAD, an Explainable Fine-grained Video Anomaly Detection approach that combines fine-grained classification with detailed explanations of anomalies. First, we use a VLM to extract frame-level captions, and an LLM converts them to video-level explanations, enhancing the model’s explainability. Second, integrating textual explanations of anomalies with visual information greatly enhances the model’s anomaly detection capability. Finally, we apply label-enhanced alignment to optimize feature fusion, enabling precise fine-grained detection. Extensive experimental results on the UCF-Crime and XD-Violence datasets demonstrate that Ex-VAD significantly outperforms existing State-of-The-Art methods.

Cite this Paper

BibTeX

@InProceedings{pmlr-v267-huang25ad,
  title = 	 {Ex-{VAD}: Explainable Fine-grained Video Anomaly Detection Based on Visual-Language Models},
  author =       {Huang, Chao and Shi, Yushu and Wen, Jie and Wang, Wei and Xu, Yong and Cao, Xiaochun},
  booktitle = 	 {Proceedings of the 42nd International Conference on Machine Learning},
  pages = 	 {25750--25761},
  year = 	 {2025},
  editor = 	 {Singh, Aarti and Fazel, Maryam and Hsu, Daniel and Lacoste-Julien, Simon and Berkenkamp, Felix and Maharaj, Tegan and Wagstaff, Kiri and Zhu, Jerry},
  volume = 	 {267},
  series = 	 {Proceedings of Machine Learning Research},
  month = 	 {13--19 Jul},
  publisher =    {PMLR},
  pdf = 	 {https://raw.githubusercontent.com/mlresearch/v267/main/assets/huang25ad/huang25ad.pdf},
  url = 	 {https://proceedings.mlr.press/v267/huang25ad.html},
  abstract = 	 {With advancements in visual language models (VLMs) and large language models (LLMs), video anomaly detection (VAD) has progressed beyond binary classification to fine-grained categorization and multidimensional analysis. However, existing methods focus mainly on coarse-grained detection, lacking anomaly explanations. To address these challenges, we propose Ex-VAD, an Explainable Fine-grained Video Anomaly Detection approach that combines fine-grained classification with detailed explanations of anomalies. First, we use a VLM to extract frame-level captions, and an LLM converts them to video-level explanations, enhancing the model’s explainability. Second, integrating textual explanations of anomalies with visual information greatly enhances the model’s anomaly detection capability. Finally, we apply label-enhanced alignment to optimize feature fusion, enabling precise fine-grained detection. Extensive experimental results on the UCF-Crime and XD-Violence datasets demonstrate that Ex-VAD significantly outperforms existing State-of-The-Art methods.}
}

Endnote

%0 Conference Paper
%T Ex-VAD: Explainable Fine-grained Video Anomaly Detection Based on Visual-Language Models
%A Chao Huang
%A Yushu Shi
%A Jie Wen
%A Wei Wang
%A Yong Xu
%A Xiaochun Cao
%B Proceedings of the 42nd International Conference on Machine Learning
%C Proceedings of Machine Learning Research
%D 2025
%E Aarti Singh
%E Maryam Fazel
%E Daniel Hsu
%E Simon Lacoste-Julien
%E Felix Berkenkamp
%E Tegan Maharaj
%E Kiri Wagstaff
%E Jerry Zhu	
%F pmlr-v267-huang25ad
%I PMLR
%P 25750--25761
%U https://proceedings.mlr.press/v267/huang25ad.html
%V 267
%X With advancements in visual language models (VLMs) and large language models (LLMs), video anomaly detection (VAD) has progressed beyond binary classification to fine-grained categorization and multidimensional analysis. However, existing methods focus mainly on coarse-grained detection, lacking anomaly explanations. To address these challenges, we propose Ex-VAD, an Explainable Fine-grained Video Anomaly Detection approach that combines fine-grained classification with detailed explanations of anomalies. First, we use a VLM to extract frame-level captions, and an LLM converts them to video-level explanations, enhancing the model’s explainability. Second, integrating textual explanations of anomalies with visual information greatly enhances the model’s anomaly detection capability. Finally, we apply label-enhanced alignment to optimize feature fusion, enabling precise fine-grained detection. Extensive experimental results on the UCF-Crime and XD-Violence datasets demonstrate that Ex-VAD significantly outperforms existing State-of-The-Art methods.

APA

Huang, C., Shi, Y., Wen, J., Wang, W., Xu, Y. & Cao, X.. (2025). Ex-VAD: Explainable Fine-grained Video Anomaly Detection Based on Visual-Language Models. Proceedings of the 42nd International Conference on Machine Learning, in Proceedings of Machine Learning Research 267:25750-25761 Available from https://proceedings.mlr.press/v267/huang25ad.html.

Ex-VAD: Explainable Fine-grained Video Anomaly Detection Based on Visual-Language Models

Abstract

Cite this Paper

Related Material