Leveraging Multi-Modal Saliency and Fusion for Gaze Target Detection

Athul Mathew, Arshad Khan, Thariq Khalid, AL-Tam Faroq, Riad Souissi
Proceedings of The 2nd Gaze Meets ML workshop, PMLR 226:161-180, 2024.

Abstract

Gaze target detection (GTD) is the task of predicting where a person in an image is looking. This is a challenging task, as it requires the ability to understand the relationship between the person’s head, body, and eyes, as well as the surrounding environment. In this paper, we propose a novel method for GTD that fuses multiple pieces of information extracted from an image. First, we project the 2D image into a 3D representation using monocular depth estimation. We then extract a depth-infused saliency module map, which highlights the most salient (attention - grabbing) regions in image for the subject in consideration. We also extract face and depth modalities from the image, and finally fuse all the extracted modalities to identify the gaze target. We quantitatively evaluated our method, including the ablation analysis on three publicly available datasets, namely VideoAttentionTarget, GazeFollow and GOO-Real, and showed that it outperforms other state-of-the-art methods. This suggests that our method is a promising new approach for GTD.

Cite this Paper


BibTeX
@InProceedings{pmlr-v226-mathew24a, title = {Leveraging Multi-Modal Saliency and Fusion for Gaze Target Detection}, author = {Mathew, Athul and Khan, Arshad and Khalid, Thariq and Faroq, AL-Tam and Souissi, Riad}, booktitle = {Proceedings of The 2nd Gaze Meets ML workshop}, pages = {161--180}, year = {2024}, editor = {Madu Blessing, Amarachi and Wu, Joy and Zanca, Dario and Krupinski, Elizabeth and Kashyap, Satyananda and Karargyris, Alexandros}, volume = {226}, series = {Proceedings of Machine Learning Research}, month = {16 Dec}, publisher = {PMLR}, pdf = {https://proceedings.mlr.press/v226/mathew24a/mathew24a.pdf}, url = {https://proceedings.mlr.press/v226/mathew24a.html}, abstract = {Gaze target detection (GTD) is the task of predicting where a person in an image is looking. This is a challenging task, as it requires the ability to understand the relationship between the person’s head, body, and eyes, as well as the surrounding environment. In this paper, we propose a novel method for GTD that fuses multiple pieces of information extracted from an image. First, we project the 2D image into a 3D representation using monocular depth estimation. We then extract a depth-infused saliency module map, which highlights the most salient (attention - grabbing) regions in image for the subject in consideration. We also extract face and depth modalities from the image, and finally fuse all the extracted modalities to identify the gaze target. We quantitatively evaluated our method, including the ablation analysis on three publicly available datasets, namely VideoAttentionTarget, GazeFollow and GOO-Real, and showed that it outperforms other state-of-the-art methods. This suggests that our method is a promising new approach for GTD.} }
Endnote
%0 Conference Paper %T Leveraging Multi-Modal Saliency and Fusion for Gaze Target Detection %A Athul Mathew %A Arshad Khan %A Thariq Khalid %A AL-Tam Faroq %A Riad Souissi %B Proceedings of The 2nd Gaze Meets ML workshop %C Proceedings of Machine Learning Research %D 2024 %E Amarachi Madu Blessing %E Joy Wu %E Dario Zanca %E Elizabeth Krupinski %E Satyananda Kashyap %E Alexandros Karargyris %F pmlr-v226-mathew24a %I PMLR %P 161--180 %U https://proceedings.mlr.press/v226/mathew24a.html %V 226 %X Gaze target detection (GTD) is the task of predicting where a person in an image is looking. This is a challenging task, as it requires the ability to understand the relationship between the person’s head, body, and eyes, as well as the surrounding environment. In this paper, we propose a novel method for GTD that fuses multiple pieces of information extracted from an image. First, we project the 2D image into a 3D representation using monocular depth estimation. We then extract a depth-infused saliency module map, which highlights the most salient (attention - grabbing) regions in image for the subject in consideration. We also extract face and depth modalities from the image, and finally fuse all the extracted modalities to identify the gaze target. We quantitatively evaluated our method, including the ablation analysis on three publicly available datasets, namely VideoAttentionTarget, GazeFollow and GOO-Real, and showed that it outperforms other state-of-the-art methods. This suggests that our method is a promising new approach for GTD.
APA
Mathew, A., Khan, A., Khalid, T., Faroq, A. & Souissi, R.. (2024). Leveraging Multi-Modal Saliency and Fusion for Gaze Target Detection. Proceedings of The 2nd Gaze Meets ML workshop, in Proceedings of Machine Learning Research 226:161-180 Available from https://proceedings.mlr.press/v226/mathew24a.html.

Related Material