[edit]
Real-time Breast Lesion Detection in Videos via Spatial-temporal Feature Aggregation
Proceedings of The 8th International Conference on Medical Imaging with Deep Learning, PMLR 301:1372-1383, 2026.
Abstract
Recently, transformer-based detectors have shown impressiveperformance for breast lesion detection in ultrasound videos. However,these methods often require substantial computational resource and ex-hibit low inference speed, which poses challenges towards real-time ap-plicability. To address this issue, we introduce a fast yet accurate spatial-temporal transformer, named FA-DETR, to efficiently aggregate multi-scale spatial-temporal features for breast lesion detection in ultrasoundvideos. Our FA-DETR is based on a lightweight spatial-temporal self-attention module, which seamlessly fuses spatial and temporal featuresextracted from each video frame. In the decoding phase, we employ IoU-aware query selection to generate independent queries for each frame.These queries gain access to rich spatial-temporal information throughthe encoder embeddings’ cross-attention and frame-aware cross-attentionmechanisms. Experiments conducted on a public breast lesion ultrasoundvideo dataset demonstrate that our FA-DETR achieves state-of-the-artperformance with an absolute gain of 3.8% in terms of overall AP whilebeing 2.5 times faster, compared to the best existing approach in theliterature. Our code and models will be publicly released.