[edit]
Interaction-aware Dynamic 3D Gaze Estimation in Videos
Proceedings of The 2nd Gaze Meets ML workshop, PMLR 226:107-124, 2024.
Abstract
Human gaze in in-the-wild and outdoor human activities is a continuous and dynamic process that is driven by the anatomical eye movements such as fixations, saccades and smooth pursuit. However, learning gaze dynamics in videos remains as a challenging task as annotating human gaze in videos is labor-expensive. In this paper, we propose a novel method for dynamic 3D gaze estimation in videos by utilizing the human interaction labels. Our model contains a temporal gaze estimator which is built upon Autoregressive Transformer structures. Besides, our model learns the spatial relationship of gaze among multiple subjects, by constructing a Human Interaction Graph from predicted gaze and update the gaze feature with a structure-aware Transformer. Our model predict future gaze conditioned on historical gaze and the gaze interactions in an autoregressive manner. We propose a multi-state training algorithm to alternately update the Interaction module and dynamic gaze estimation module, when training on a mixture of labeled and unlabeled sequences. We show significant improvements in both within-domain gaze estimation accuracy and cross-domain generalization on the physically-unconstrained gaze estimation benchmark.