SimGroupAttn: Similarity-Guided Group Attention for Vision Transformer to Incorporate Population Information in Plant Disease Detection

Wangyang Wu, Ribana Roscher, Niklas Tötsch
Proceedings of the 7th Northern Lights Deep Learning Conference (NLDL), PMLR 307:496-507, 2026.

Abstract

In this paper, we address the problem of Vision Transformer (ViT) models being limited to intra-image attention, which prevents them from leveraging cross-sample information. This is highly relevant in agricultural data such as plant disease detection, an important challenge in agriculture where early and reliable diagnosis helps protect yields and food security. Yet existing methods often fail to capture subtle or overlapping symptoms that only become evident when considered in a population context. Our approach $\textit{SimGroupAttn}$ extends masked image modeling by enabling image patches to attend not only within their own image but also to similar regions across other images in the same batch. Guided by cosine similarity score which is trained jointly with model weights,$\textit{SimGroupAttn}$ incorporates population-level context into the learned representations, making them more robust and discriminative. Extensive experiments on PlantPathology dataset demonstrate that our approach outperforms Simple Masked Image Modeling (SimMIM) and Masked Autoencoders (MAE) in linear probing and classification task. It improves top-1 accuracy by up to 6.5% in linear probing for complex classes and 3.5% in classification compared with the best baseline model performance under the same settings.

Cite this Paper


BibTeX
@InProceedings{pmlr-v307-wu26a, title = {SimGroupAttn: Similarity-Guided Group Attention for Vision Transformer to Incorporate Population Information in Plant Disease Detection}, author = {Wu, Wangyang and Roscher, Ribana and T{\"o}tsch, Niklas}, booktitle = {Proceedings of the 7th Northern Lights Deep Learning Conference (NLDL)}, pages = {496--507}, year = {2026}, editor = {Kim, Hyeongji and Ramírez Rivera, Adín and Ricaud, Benjamin}, volume = {307}, series = {Proceedings of Machine Learning Research}, month = {06--08 Jan}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v307/main/assets/wu26a/wu26a.pdf}, url = {https://proceedings.mlr.press/v307/wu26a.html}, abstract = {In this paper, we address the problem of Vision Transformer (ViT) models being limited to intra-image attention, which prevents them from leveraging cross-sample information. This is highly relevant in agricultural data such as plant disease detection, an important challenge in agriculture where early and reliable diagnosis helps protect yields and food security. Yet existing methods often fail to capture subtle or overlapping symptoms that only become evident when considered in a population context. Our approach $\textit{SimGroupAttn}$ extends masked image modeling by enabling image patches to attend not only within their own image but also to similar regions across other images in the same batch. Guided by cosine similarity score which is trained jointly with model weights,$\textit{SimGroupAttn}$ incorporates population-level context into the learned representations, making them more robust and discriminative. Extensive experiments on PlantPathology dataset demonstrate that our approach outperforms Simple Masked Image Modeling (SimMIM) and Masked Autoencoders (MAE) in linear probing and classification task. It improves top-1 accuracy by up to 6.5% in linear probing for complex classes and 3.5% in classification compared with the best baseline model performance under the same settings.} }
Endnote
%0 Conference Paper %T SimGroupAttn: Similarity-Guided Group Attention for Vision Transformer to Incorporate Population Information in Plant Disease Detection %A Wangyang Wu %A Ribana Roscher %A Niklas Tötsch %B Proceedings of the 7th Northern Lights Deep Learning Conference (NLDL) %C Proceedings of Machine Learning Research %D 2026 %E Hyeongji Kim %E Adín Ramírez Rivera %E Benjamin Ricaud %F pmlr-v307-wu26a %I PMLR %P 496--507 %U https://proceedings.mlr.press/v307/wu26a.html %V 307 %X In this paper, we address the problem of Vision Transformer (ViT) models being limited to intra-image attention, which prevents them from leveraging cross-sample information. This is highly relevant in agricultural data such as plant disease detection, an important challenge in agriculture where early and reliable diagnosis helps protect yields and food security. Yet existing methods often fail to capture subtle or overlapping symptoms that only become evident when considered in a population context. Our approach $\textit{SimGroupAttn}$ extends masked image modeling by enabling image patches to attend not only within their own image but also to similar regions across other images in the same batch. Guided by cosine similarity score which is trained jointly with model weights,$\textit{SimGroupAttn}$ incorporates population-level context into the learned representations, making them more robust and discriminative. Extensive experiments on PlantPathology dataset demonstrate that our approach outperforms Simple Masked Image Modeling (SimMIM) and Masked Autoencoders (MAE) in linear probing and classification task. It improves top-1 accuracy by up to 6.5% in linear probing for complex classes and 3.5% in classification compared with the best baseline model performance under the same settings.
APA
Wu, W., Roscher, R. & Tötsch, N.. (2026). SimGroupAttn: Similarity-Guided Group Attention for Vision Transformer to Incorporate Population Information in Plant Disease Detection. Proceedings of the 7th Northern Lights Deep Learning Conference (NLDL), in Proceedings of Machine Learning Research 307:496-507 Available from https://proceedings.mlr.press/v307/wu26a.html.

Related Material