FIFA-RS: Fine-grained Image-Feature Alignment for Structural Anomaly Reasoning in Remote Sensing

SHIH-CHIH LIN; Jia-Xian Jian; YunTung Chu; Wei-Chieh Sun

FIFA-RS: Fine-grained Image-Feature Alignment for Structural Anomaly Reasoning in Remote Sensing

SHIH-CHIH LIN, Jia-Xian Jian, YunTung Chu, Wei-Chieh Sun

Proceedings of the The 39th Canadian Conference on Artificial Intelligence, PMLR 318:552-563, 2026.

Abstract

Traditional remote sensing change detection paradigms typically rely on bi-temporal image pairs to identify surface variations. However, in time-critical scenarios such as post-disaster assessment, pre-event images may be unavailable or subject to severe registration errors. To address this limitation, we propose \textbf{FIFA-RS}, a zero-shot framework that formulates change detection as a \textbf{single-temporal structural anomaly reasoning} problem. FIFA-RS enhances the ability of vision–language models to characterize anthropogenic structures without relying on temporal references. Built upon a frozen CLIP backbone, the proposed framework adopts a lightweight two-stage adaptation strategy that combines token-level high-pass adaptation with an image-only 2D spatial high-pass enhancement branch. The former suppresses token-level common bias and emphasizes relative feature differences, while the latter sharpens local geometric structures such as building contours and boundaries. These structurally enhanced features are further aggregated through learnable multi-scale fusion for dense pixel-level anomaly localization. Extensive experiments indicate that FIFA-RS exhibits strong cross-dataset generalization across diverse remote sensing scenarios. When trained on LEVIR-CD using only post-event images and evaluated on the WHU Building Dataset in a zero-shot setting, the proposed method achieves a \textbf{95.07% Pixel AUC} and a \textbf{58.51% F1-score}. These results suggest that lightweight structural adaptation provides an effective and efficient solution for single-temporal remote sensing analysis.

Cite this Paper

BibTeX

@InProceedings{pmlr-v318-lin26a,
  title = 	 {FIFA-RS: Fine-grained Image-Feature Alignment for Structural Anomaly Reasoning in Remote Sensing},
  author =       {LIN, SHIH-CHIH and Jian, Jia-Xian and Chu, YunTung and Sun, Wei-Chieh},
  booktitle = 	 {Proceedings of the The 39th Canadian Conference on Artificial Intelligence},
  pages = 	 {552--563},
  year = 	 {2026},
  editor = 	 {Bouzar-Benlabiod, Lydia and Leung, Carson},
  volume = 	 {318},
  series = 	 {Proceedings of Machine Learning Research},
  month = 	 {25--29 May},
  publisher =    {PMLR},
  pdf = 	 {https://raw.githubusercontent.com/mlresearch/v318/main/assets/lin26a/lin26a.pdf},
  url = 	 {https://proceedings.mlr.press/v318/lin26a.html},
  abstract = 	 {Traditional remote sensing change detection paradigms typically rely on bi-temporal image pairs to identify surface variations. However, in time-critical scenarios such as post-disaster assessment, pre-event images may be unavailable or subject to severe registration errors. To address this limitation, we propose \textbf{FIFA-RS}, a zero-shot framework that formulates change detection as a \textbf{single-temporal structural anomaly reasoning} problem. FIFA-RS enhances the ability of vision–language models to characterize anthropogenic structures without relying on temporal references. Built upon a frozen CLIP backbone, the proposed framework adopts a lightweight two-stage adaptation strategy that combines token-level high-pass adaptation with an image-only 2D spatial high-pass enhancement branch. The former suppresses token-level common bias and emphasizes relative feature differences, while the latter sharpens local geometric structures such as building contours and boundaries. These structurally enhanced features are further aggregated through learnable multi-scale fusion for dense pixel-level anomaly localization. Extensive experiments indicate that FIFA-RS exhibits strong cross-dataset generalization across diverse remote sensing scenarios. When trained on LEVIR-CD using only post-event images and evaluated on the WHU Building Dataset in a zero-shot setting, the proposed method achieves a \textbf{95.07% Pixel AUC} and a \textbf{58.51% F1-score}. These results suggest that lightweight structural adaptation provides an effective and efficient solution for single-temporal remote sensing analysis.}
}

Endnote

%0 Conference Paper
%T FIFA-RS: Fine-grained Image-Feature Alignment for Structural Anomaly Reasoning in Remote Sensing
%A SHIH-CHIH LIN
%A Jia-Xian Jian
%A YunTung Chu
%A Wei-Chieh Sun
%B Proceedings of the The 39th Canadian Conference on Artificial Intelligence
%C Proceedings of Machine Learning Research
%D 2026
%E Lydia Bouzar-Benlabiod
%E Carson Leung	
%F pmlr-v318-lin26a
%I PMLR
%P 552--563
%U https://proceedings.mlr.press/v318/lin26a.html
%V 318
%X Traditional remote sensing change detection paradigms typically rely on bi-temporal image pairs to identify surface variations. However, in time-critical scenarios such as post-disaster assessment, pre-event images may be unavailable or subject to severe registration errors. To address this limitation, we propose \textbf{FIFA-RS}, a zero-shot framework that formulates change detection as a \textbf{single-temporal structural anomaly reasoning} problem. FIFA-RS enhances the ability of vision–language models to characterize anthropogenic structures without relying on temporal references. Built upon a frozen CLIP backbone, the proposed framework adopts a lightweight two-stage adaptation strategy that combines token-level high-pass adaptation with an image-only 2D spatial high-pass enhancement branch. The former suppresses token-level common bias and emphasizes relative feature differences, while the latter sharpens local geometric structures such as building contours and boundaries. These structurally enhanced features are further aggregated through learnable multi-scale fusion for dense pixel-level anomaly localization. Extensive experiments indicate that FIFA-RS exhibits strong cross-dataset generalization across diverse remote sensing scenarios. When trained on LEVIR-CD using only post-event images and evaluated on the WHU Building Dataset in a zero-shot setting, the proposed method achieves a \textbf{95.07% Pixel AUC} and a \textbf{58.51% F1-score}. These results suggest that lightweight structural adaptation provides an effective and efficient solution for single-temporal remote sensing analysis.

APA

LIN, S., Jian, J., Chu, Y. & Sun, W.. (2026). FIFA-RS: Fine-grained Image-Feature Alignment for Structural Anomaly Reasoning in Remote Sensing. Proceedings of the The 39th Canadian Conference on Artificial Intelligence, in Proceedings of Machine Learning Research 318:552-563 Available from https://proceedings.mlr.press/v318/lin26a.html.

Related Material

Download PDF