Learning to count visual objects by combining “what” and “where” in recurrent memory

Jessica A.F. Thompson, Hannah Sheahan, Christopher Summerfield
Proceedings of The 1st Gaze Meets ML workshop, PMLR 210:199-218, 2023.

Abstract

Counting the number of objects in a visual scene is easy for humans but challenging for modern deep neural networks. Here we explore what makes this problem hard and study the neural computations that allow transfer of counting ability to new objects and contexts. Previous work has implicated posterior parietal cortex (PPC) in numerosity perception and in visual scene understanding more broadly. It has been proposed that action-related saccadic signals computed in PPC provide object-invariant information about the number and arrangement of scene elements, and may contribute to relational reasoning in visual displays. Here, we built a glimpsing recurrent neural network that combines gaze contents (“what”) and gaze location (“where”) to count the number of items in a visual array. The network successfully learns to count and generalizes to several out-of-distribution test sets, including images with novel items. Through ablations and comparison to control models, we establish the contribution of brain-inspired computational principles to this generalization ability. This work provides a proof-of-principle demonstration that a neural network that combines “what” and “where” can learn a generalizable concept of numerosity and points to a promising approach for other visual reasoning tasks.

Cite this Paper


BibTeX
@InProceedings{pmlr-v210-thompson23a, title = {Learning to count visual objects by combining “what” and “where” in recurrent memory}, author = {Thompson, Jessica A.F. and Sheahan, Hannah and Summerfield, Christopher}, booktitle = {Proceedings of The 1st Gaze Meets ML workshop}, pages = {199--218}, year = {2023}, editor = {Lourentzou, Ismini and Wu, Joy and Kashyap, Satyananda and Karargyris, Alexandros and Celi, Leo Anthony and Kawas, Ban and Talathi, Sachin}, volume = {210}, series = {Proceedings of Machine Learning Research}, month = {03 Dec}, publisher = {PMLR}, pdf = {https://proceedings.mlr.press/v210/thompson23a/thompson23a.pdf}, url = {https://proceedings.mlr.press/v210/thompson23a.html}, abstract = {Counting the number of objects in a visual scene is easy for humans but challenging for modern deep neural networks. Here we explore what makes this problem hard and study the neural computations that allow transfer of counting ability to new objects and contexts. Previous work has implicated posterior parietal cortex (PPC) in numerosity perception and in visual scene understanding more broadly. It has been proposed that action-related saccadic signals computed in PPC provide object-invariant information about the number and arrangement of scene elements, and may contribute to relational reasoning in visual displays. Here, we built a glimpsing recurrent neural network that combines gaze contents (“what”) and gaze location (“where”) to count the number of items in a visual array. The network successfully learns to count and generalizes to several out-of-distribution test sets, including images with novel items. Through ablations and comparison to control models, we establish the contribution of brain-inspired computational principles to this generalization ability. This work provides a proof-of-principle demonstration that a neural network that combines “what” and “where” can learn a generalizable concept of numerosity and points to a promising approach for other visual reasoning tasks.} }
Endnote
%0 Conference Paper %T Learning to count visual objects by combining “what” and “where” in recurrent memory %A Jessica A.F. Thompson %A Hannah Sheahan %A Christopher Summerfield %B Proceedings of The 1st Gaze Meets ML workshop %C Proceedings of Machine Learning Research %D 2023 %E Ismini Lourentzou %E Joy Wu %E Satyananda Kashyap %E Alexandros Karargyris %E Leo Anthony Celi %E Ban Kawas %E Sachin Talathi %F pmlr-v210-thompson23a %I PMLR %P 199--218 %U https://proceedings.mlr.press/v210/thompson23a.html %V 210 %X Counting the number of objects in a visual scene is easy for humans but challenging for modern deep neural networks. Here we explore what makes this problem hard and study the neural computations that allow transfer of counting ability to new objects and contexts. Previous work has implicated posterior parietal cortex (PPC) in numerosity perception and in visual scene understanding more broadly. It has been proposed that action-related saccadic signals computed in PPC provide object-invariant information about the number and arrangement of scene elements, and may contribute to relational reasoning in visual displays. Here, we built a glimpsing recurrent neural network that combines gaze contents (“what”) and gaze location (“where”) to count the number of items in a visual array. The network successfully learns to count and generalizes to several out-of-distribution test sets, including images with novel items. Through ablations and comparison to control models, we establish the contribution of brain-inspired computational principles to this generalization ability. This work provides a proof-of-principle demonstration that a neural network that combines “what” and “where” can learn a generalizable concept of numerosity and points to a promising approach for other visual reasoning tasks.
APA
Thompson, J.A., Sheahan, H. & Summerfield, C.. (2023). Learning to count visual objects by combining “what” and “where” in recurrent memory. Proceedings of The 1st Gaze Meets ML workshop, in Proceedings of Machine Learning Research 210:199-218 Available from https://proceedings.mlr.press/v210/thompson23a.html.

Related Material