Understanding Complexity in VideoQA via Visual Program Generation

Cristobal Eyzaguirre, Igor Vasiljevic, Achal Dave, Jiajun Wu, Rares Andrei Ambrus, Thomas Kollar, Juan Carlos Niebles, Pavel Tokmakov
Proceedings of the 42nd International Conference on Machine Learning, PMLR 267:15613-15636, 2025.

Abstract

We propose a data-driven approach to analyzing query complexity in Video Question Answering (VideoQA). Previous efforts in benchmark design have relied on human expertise to design challenging questions, yet we experimentally show that humans struggle to predict which questions are difficult for machine learning models. Our automatic approach leverages recent advances in code generation for visual question answering, using the complexity of generated code as a proxy for question difficulty. We demonstrate that this measure correlates significantly better with model performance than human estimates. To operationalize this insight, we propose an algorithm for estimating question complexity from code. It identifies fine-grained primitives that correlate with the hardest questions for any given set of models, making it easy to scale to new approaches in the future. Finally, to further illustrate the utility of our method, we extend it to automatically generate complex questions, constructing a new benchmark that is 1.9 times harder than the popular NExT-QA.

Cite this Paper


BibTeX
@InProceedings{pmlr-v267-eyzaguirre25a, title = {Understanding Complexity in {V}ideo{QA} via Visual Program Generation}, author = {Eyzaguirre, Cristobal and Vasiljevic, Igor and Dave, Achal and Wu, Jiajun and Ambrus, Rares Andrei and Kollar, Thomas and Niebles, Juan Carlos and Tokmakov, Pavel}, booktitle = {Proceedings of the 42nd International Conference on Machine Learning}, pages = {15613--15636}, year = {2025}, editor = {Singh, Aarti and Fazel, Maryam and Hsu, Daniel and Lacoste-Julien, Simon and Berkenkamp, Felix and Maharaj, Tegan and Wagstaff, Kiri and Zhu, Jerry}, volume = {267}, series = {Proceedings of Machine Learning Research}, month = {13--19 Jul}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v267/main/assets/eyzaguirre25a/eyzaguirre25a.pdf}, url = {https://proceedings.mlr.press/v267/eyzaguirre25a.html}, abstract = {We propose a data-driven approach to analyzing query complexity in Video Question Answering (VideoQA). Previous efforts in benchmark design have relied on human expertise to design challenging questions, yet we experimentally show that humans struggle to predict which questions are difficult for machine learning models. Our automatic approach leverages recent advances in code generation for visual question answering, using the complexity of generated code as a proxy for question difficulty. We demonstrate that this measure correlates significantly better with model performance than human estimates. To operationalize this insight, we propose an algorithm for estimating question complexity from code. It identifies fine-grained primitives that correlate with the hardest questions for any given set of models, making it easy to scale to new approaches in the future. Finally, to further illustrate the utility of our method, we extend it to automatically generate complex questions, constructing a new benchmark that is 1.9 times harder than the popular NExT-QA.} }
Endnote
%0 Conference Paper %T Understanding Complexity in VideoQA via Visual Program Generation %A Cristobal Eyzaguirre %A Igor Vasiljevic %A Achal Dave %A Jiajun Wu %A Rares Andrei Ambrus %A Thomas Kollar %A Juan Carlos Niebles %A Pavel Tokmakov %B Proceedings of the 42nd International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2025 %E Aarti Singh %E Maryam Fazel %E Daniel Hsu %E Simon Lacoste-Julien %E Felix Berkenkamp %E Tegan Maharaj %E Kiri Wagstaff %E Jerry Zhu %F pmlr-v267-eyzaguirre25a %I PMLR %P 15613--15636 %U https://proceedings.mlr.press/v267/eyzaguirre25a.html %V 267 %X We propose a data-driven approach to analyzing query complexity in Video Question Answering (VideoQA). Previous efforts in benchmark design have relied on human expertise to design challenging questions, yet we experimentally show that humans struggle to predict which questions are difficult for machine learning models. Our automatic approach leverages recent advances in code generation for visual question answering, using the complexity of generated code as a proxy for question difficulty. We demonstrate that this measure correlates significantly better with model performance than human estimates. To operationalize this insight, we propose an algorithm for estimating question complexity from code. It identifies fine-grained primitives that correlate with the hardest questions for any given set of models, making it easy to scale to new approaches in the future. Finally, to further illustrate the utility of our method, we extend it to automatically generate complex questions, constructing a new benchmark that is 1.9 times harder than the popular NExT-QA.
APA
Eyzaguirre, C., Vasiljevic, I., Dave, A., Wu, J., Ambrus, R.A., Kollar, T., Niebles, J.C. & Tokmakov, P.. (2025). Understanding Complexity in VideoQA via Visual Program Generation. Proceedings of the 42nd International Conference on Machine Learning, in Proceedings of Machine Learning Research 267:15613-15636 Available from https://proceedings.mlr.press/v267/eyzaguirre25a.html.

Related Material