[edit]
FIFA-RS: Fine-grained Image-Feature Alignment for Structural Anomaly Reasoning in Remote Sensing
Proceedings of the The 39th Canadian Conference on Artificial Intelligence, PMLR 318:552-563, 2026.
Abstract
Traditional remote sensing change detection paradigms typically rely on bi-temporal image pairs to identify surface variations. However, in time-critical scenarios such as post-disaster assessment, pre-event images may be unavailable or subject to severe registration errors. To address this limitation, we propose \textbf{FIFA-RS}, a zero-shot framework that formulates change detection as a \textbf{single-temporal structural anomaly reasoning} problem. FIFA-RS enhances the ability of vision–language models to characterize anthropogenic structures without relying on temporal references. Built upon a frozen CLIP backbone, the proposed framework adopts a lightweight two-stage adaptation strategy that combines token-level high-pass adaptation with an image-only 2D spatial high-pass enhancement branch. The former suppresses token-level common bias and emphasizes relative feature differences, while the latter sharpens local geometric structures such as building contours and boundaries. These structurally enhanced features are further aggregated through learnable multi-scale fusion for dense pixel-level anomaly localization. Extensive experiments indicate that FIFA-RS exhibits strong cross-dataset generalization across diverse remote sensing scenarios. When trained on LEVIR-CD using only post-event images and evaluated on the WHU Building Dataset in a zero-shot setting, the proposed method achieves a \textbf{95.07% Pixel AUC} and a \textbf{58.51% F1-score}. These results suggest that lightweight structural adaptation provides an effective and efficient solution for single-temporal remote sensing analysis.