GRADEO: Towards Human-Like Evaluation for Text-to-Video Generation via Multi-Step Reasoning

Zhun Mou, Bin Xia, Zhengchao Huang, Wenming Yang, Jiaya Jia
Proceedings of the 42nd International Conference on Machine Learning, PMLR 267:44971-44996, 2025.

Abstract

Recent great advances in video generation models have demonstrated their potential to produce high-quality videos, bringing challenges to effective evaluation. Unlike human evaluation, existing automated evaluation metrics lack high-level semantic understanding and reasoning capabilities for video, thus making them infeasible and unexplainable. To fill this gap, we curate GRADEO-Instruct, a multi-dimensional T2V evaluation instruction tuning dataset, including 3.3k videos from over 10 existing video generation models and multi-step reasoning assessments converted by 16k human annotations. We then introduce GRADEO, one of the first specifically designed video evaluation models, which grades AI-generated videos for explainable scores and assessments through multi-step reasoning. Experiments show that our method aligns better with human evaluations than existing methods. Furthermore, our benchmarking reveals that current video generation models struggle to produce content that aligns with human reasoning and complex real-world scenarios. The models, datasets, and codes will be released soon.

Cite this Paper


BibTeX
@InProceedings{pmlr-v267-mou25a, title = {{GRADEO}: Towards Human-Like Evaluation for Text-to-Video Generation via Multi-Step Reasoning}, author = {Mou, Zhun and Xia, Bin and Huang, Zhengchao and Yang, Wenming and Jia, Jiaya}, booktitle = {Proceedings of the 42nd International Conference on Machine Learning}, pages = {44971--44996}, year = {2025}, editor = {Singh, Aarti and Fazel, Maryam and Hsu, Daniel and Lacoste-Julien, Simon and Berkenkamp, Felix and Maharaj, Tegan and Wagstaff, Kiri and Zhu, Jerry}, volume = {267}, series = {Proceedings of Machine Learning Research}, month = {13--19 Jul}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v267/main/assets/mou25a/mou25a.pdf}, url = {https://proceedings.mlr.press/v267/mou25a.html}, abstract = {Recent great advances in video generation models have demonstrated their potential to produce high-quality videos, bringing challenges to effective evaluation. Unlike human evaluation, existing automated evaluation metrics lack high-level semantic understanding and reasoning capabilities for video, thus making them infeasible and unexplainable. To fill this gap, we curate GRADEO-Instruct, a multi-dimensional T2V evaluation instruction tuning dataset, including 3.3k videos from over 10 existing video generation models and multi-step reasoning assessments converted by 16k human annotations. We then introduce GRADEO, one of the first specifically designed video evaluation models, which grades AI-generated videos for explainable scores and assessments through multi-step reasoning. Experiments show that our method aligns better with human evaluations than existing methods. Furthermore, our benchmarking reveals that current video generation models struggle to produce content that aligns with human reasoning and complex real-world scenarios. The models, datasets, and codes will be released soon.} }
Endnote
%0 Conference Paper %T GRADEO: Towards Human-Like Evaluation for Text-to-Video Generation via Multi-Step Reasoning %A Zhun Mou %A Bin Xia %A Zhengchao Huang %A Wenming Yang %A Jiaya Jia %B Proceedings of the 42nd International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2025 %E Aarti Singh %E Maryam Fazel %E Daniel Hsu %E Simon Lacoste-Julien %E Felix Berkenkamp %E Tegan Maharaj %E Kiri Wagstaff %E Jerry Zhu %F pmlr-v267-mou25a %I PMLR %P 44971--44996 %U https://proceedings.mlr.press/v267/mou25a.html %V 267 %X Recent great advances in video generation models have demonstrated their potential to produce high-quality videos, bringing challenges to effective evaluation. Unlike human evaluation, existing automated evaluation metrics lack high-level semantic understanding and reasoning capabilities for video, thus making them infeasible and unexplainable. To fill this gap, we curate GRADEO-Instruct, a multi-dimensional T2V evaluation instruction tuning dataset, including 3.3k videos from over 10 existing video generation models and multi-step reasoning assessments converted by 16k human annotations. We then introduce GRADEO, one of the first specifically designed video evaluation models, which grades AI-generated videos for explainable scores and assessments through multi-step reasoning. Experiments show that our method aligns better with human evaluations than existing methods. Furthermore, our benchmarking reveals that current video generation models struggle to produce content that aligns with human reasoning and complex real-world scenarios. The models, datasets, and codes will be released soon.
APA
Mou, Z., Xia, B., Huang, Z., Yang, W. & Jia, J.. (2025). GRADEO: Towards Human-Like Evaluation for Text-to-Video Generation via Multi-Step Reasoning. Proceedings of the 42nd International Conference on Machine Learning, in Proceedings of Machine Learning Research 267:44971-44996 Available from https://proceedings.mlr.press/v267/mou25a.html.

Related Material