Multi-Factor Adaptive Vision Selection for Egocentric Video Question Answering

Haoyu Zhang; Meng Liu; Zixin Liu; Xuemeng Song; Yaowei Wang; Liqiang Nie

Multi-Factor Adaptive Vision Selection for Egocentric Video Question Answering

Haoyu Zhang, Meng Liu, Zixin Liu, Xuemeng Song, Yaowei Wang, Liqiang Nie

Proceedings of the 41st International Conference on Machine Learning, PMLR 235:59310-59328, 2024.

Abstract

The challenge of interpreting the world from a human perspective in Artificial Intelligence (AI) is particularly evident in egocentric video question answering, which grapples with issues like small object recognition, noise suppression, and spatial-temporal reasoning. To address these challenges, we introduce the Multi-Factor Adaptive vision Selection (MFAS) framework. MFAS integrates a patch partition and merging module for enhanced small object recognition, a prior-guided patch selection module for noise suppression and focused analysis, and a hierarchical aggregation network to aggregate visual semantics guided by questions. Extensive experiments on several public egocentric datasets have validated the effectiveness and generalization of our framework. Code and data are available in https://github.com/Hyu-Zhang/EgoVideoQA.

Cite this Paper

BibTeX


@InProceedings{pmlr-v235-zhang24aj,
  title = 	 {Multi-Factor Adaptive Vision Selection for Egocentric Video Question Answering},
  author =       {Zhang, Haoyu and Liu, Meng and Liu, Zixin and Song, Xuemeng and Wang, Yaowei and Nie, Liqiang},
  booktitle = 	 {Proceedings of the 41st International Conference on Machine Learning},
  pages = 	 {59310--59328},
  year = 	 {2024},
  editor = 	 {Salakhutdinov, Ruslan and Kolter, Zico and Heller, Katherine and Weller, Adrian and Oliver, Nuria and Scarlett, Jonathan and Berkenkamp, Felix},
  volume = 	 {235},
  series = 	 {Proceedings of Machine Learning Research},
  month = 	 {21--27 Jul},
  publisher =    {PMLR},
  pdf = 	 {https://raw.githubusercontent.com/mlresearch/v235/main/assets/zhang24aj/zhang24aj.pdf},
  url = 	 {https://proceedings.mlr.press/v235/zhang24aj.html},
  abstract = 	 {The challenge of interpreting the world from a human perspective in Artificial Intelligence (AI) is particularly evident in egocentric video question answering, which grapples with issues like small object recognition, noise suppression, and spatial-temporal reasoning. To address these challenges, we introduce the Multi-Factor Adaptive vision Selection (MFAS) framework. MFAS integrates a patch partition and merging module for enhanced small object recognition, a prior-guided patch selection module for noise suppression and focused analysis, and a hierarchical aggregation network to aggregate visual semantics guided by questions. Extensive experiments on several public egocentric datasets have validated the effectiveness and generalization of our framework. Code and data are available in https://github.com/Hyu-Zhang/EgoVideoQA.}
}

Endnote

%0 Conference Paper
%T Multi-Factor Adaptive Vision Selection for Egocentric Video Question Answering
%A Haoyu Zhang
%A Meng Liu
%A Zixin Liu
%A Xuemeng Song
%A Yaowei Wang
%A Liqiang Nie
%B Proceedings of the 41st International Conference on Machine Learning
%C Proceedings of Machine Learning Research
%D 2024
%E Ruslan Salakhutdinov
%E Zico Kolter
%E Katherine Heller
%E Adrian Weller
%E Nuria Oliver
%E Jonathan Scarlett
%E Felix Berkenkamp	
%F pmlr-v235-zhang24aj
%I PMLR
%P 59310--59328
%U https://proceedings.mlr.press/v235/zhang24aj.html
%V 235
%X The challenge of interpreting the world from a human perspective in Artificial Intelligence (AI) is particularly evident in egocentric video question answering, which grapples with issues like small object recognition, noise suppression, and spatial-temporal reasoning. To address these challenges, we introduce the Multi-Factor Adaptive vision Selection (MFAS) framework. MFAS integrates a patch partition and merging module for enhanced small object recognition, a prior-guided patch selection module for noise suppression and focused analysis, and a hierarchical aggregation network to aggregate visual semantics guided by questions. Extensive experiments on several public egocentric datasets have validated the effectiveness and generalization of our framework. Code and data are available in https://github.com/Hyu-Zhang/EgoVideoQA.

APA


Zhang, H., Liu, M., Liu, Z., Song, X., Wang, Y. & Nie, L.. (2024). Multi-Factor Adaptive Vision Selection for Egocentric Video Question Answering. Proceedings of the 41st International Conference on Machine Learning, in Proceedings of Machine Learning Research 235:59310-59328 Available from https://proceedings.mlr.press/v235/zhang24aj.html.

Multi-Factor Adaptive Vision Selection for Egocentric Video Question Answering

Abstract

Cite this Paper

Related Material