Real-time Breast Lesion Detection in Videos via Spatial-temporal Feature Aggregation

Chao Qin, Jiale Cao, Fahad Shahbaz Khan, Salman Khan, Huazhu Fu, Ehud Ahissar, Rao Muhammad Anwer
Proceedings of The 8th International Conference on Medical Imaging with Deep Learning, PMLR 301:1372-1383, 2026.

Abstract

Recently, transformer-based detectors have shown impressiveperformance for breast lesion detection in ultrasound videos. However,these methods often require substantial computational resource and ex-hibit low inference speed, which poses challenges towards real-time ap-plicability. To address this issue, we introduce a fast yet accurate spatial-temporal transformer, named FA-DETR, to efficiently aggregate multi-scale spatial-temporal features for breast lesion detection in ultrasoundvideos. Our FA-DETR is based on a lightweight spatial-temporal self-attention module, which seamlessly fuses spatial and temporal featuresextracted from each video frame. In the decoding phase, we employ IoU-aware query selection to generate independent queries for each frame.These queries gain access to rich spatial-temporal information throughthe encoder embeddings’ cross-attention and frame-aware cross-attentionmechanisms. Experiments conducted on a public breast lesion ultrasoundvideo dataset demonstrate that our FA-DETR achieves state-of-the-artperformance with an absolute gain of 3.8% in terms of overall AP whilebeing 2.5 times faster, compared to the best existing approach in theliterature. Our code and models will be publicly released.

Cite this Paper


BibTeX
@InProceedings{pmlr-v301-qin26a, title = {Real-time Breast Lesion Detection in Videos via Spatial-temporal Feature Aggregation}, author = {Qin, Chao and Cao, Jiale and Khan, Fahad Shahbaz and Khan, Salman and Fu, Huazhu and Ahissar, Ehud and Anwer, Rao Muhammad}, booktitle = {Proceedings of The 8th International Conference on Medical Imaging with Deep Learning}, pages = {1372--1383}, year = {2026}, editor = {Tasdizen, Tolga and Elhabian, Shireen and Summers, Ronald and Chen, Chen and Koch, Lisa and Zhuang, Yan}, volume = {301}, series = {Proceedings of Machine Learning Research}, month = {09--11 Jul}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v301/main/assets/qin26a/qin26a.pdf}, url = {https://proceedings.mlr.press/v301/qin26a.html}, abstract = {Recently, transformer-based detectors have shown impressiveperformance for breast lesion detection in ultrasound videos. However,these methods often require substantial computational resource and ex-hibit low inference speed, which poses challenges towards real-time ap-plicability. To address this issue, we introduce a fast yet accurate spatial-temporal transformer, named FA-DETR, to efficiently aggregate multi-scale spatial-temporal features for breast lesion detection in ultrasoundvideos. Our FA-DETR is based on a lightweight spatial-temporal self-attention module, which seamlessly fuses spatial and temporal featuresextracted from each video frame. In the decoding phase, we employ IoU-aware query selection to generate independent queries for each frame.These queries gain access to rich spatial-temporal information throughthe encoder embeddings’ cross-attention and frame-aware cross-attentionmechanisms. Experiments conducted on a public breast lesion ultrasoundvideo dataset demonstrate that our FA-DETR achieves state-of-the-artperformance with an absolute gain of 3.8% in terms of overall AP whilebeing 2.5 times faster, compared to the best existing approach in theliterature. Our code and models will be publicly released.} }
Endnote
%0 Conference Paper %T Real-time Breast Lesion Detection in Videos via Spatial-temporal Feature Aggregation %A Chao Qin %A Jiale Cao %A Fahad Shahbaz Khan %A Salman Khan %A Huazhu Fu %A Ehud Ahissar %A Rao Muhammad Anwer %B Proceedings of The 8th International Conference on Medical Imaging with Deep Learning %C Proceedings of Machine Learning Research %D 2026 %E Tolga Tasdizen %E Shireen Elhabian %E Ronald Summers %E Chen Chen %E Lisa Koch %E Yan Zhuang %F pmlr-v301-qin26a %I PMLR %P 1372--1383 %U https://proceedings.mlr.press/v301/qin26a.html %V 301 %X Recently, transformer-based detectors have shown impressiveperformance for breast lesion detection in ultrasound videos. However,these methods often require substantial computational resource and ex-hibit low inference speed, which poses challenges towards real-time ap-plicability. To address this issue, we introduce a fast yet accurate spatial-temporal transformer, named FA-DETR, to efficiently aggregate multi-scale spatial-temporal features for breast lesion detection in ultrasoundvideos. Our FA-DETR is based on a lightweight spatial-temporal self-attention module, which seamlessly fuses spatial and temporal featuresextracted from each video frame. In the decoding phase, we employ IoU-aware query selection to generate independent queries for each frame.These queries gain access to rich spatial-temporal information throughthe encoder embeddings’ cross-attention and frame-aware cross-attentionmechanisms. Experiments conducted on a public breast lesion ultrasoundvideo dataset demonstrate that our FA-DETR achieves state-of-the-artperformance with an absolute gain of 3.8% in terms of overall AP whilebeing 2.5 times faster, compared to the best existing approach in theliterature. Our code and models will be publicly released.
APA
Qin, C., Cao, J., Khan, F.S., Khan, S., Fu, H., Ahissar, E. & Anwer, R.M.. (2026). Real-time Breast Lesion Detection in Videos via Spatial-temporal Feature Aggregation. Proceedings of The 8th International Conference on Medical Imaging with Deep Learning, in Proceedings of Machine Learning Research 301:1372-1383 Available from https://proceedings.mlr.press/v301/qin26a.html.

Related Material