Attention Distillation for Detection Transformers: Application to Real-Time Video Object Detection in Ultrasound

Jonathan Rubin, Ramon Erkamp, Ragha Srinivasa Naidu, Anumod Odungatta Thodiyil, Alvin Chen
Proceedings of Machine Learning for Health, PMLR 158:26-37, 2021.

Abstract

We introduce a method for efficient knowledge distillation of transformer-based object detectors. The proposed “attention distillation” makes use of the self-attention matrices generated within the layers of the state-of-art detection transformer (DETR) model. Localization information from the attention maps of a large teacher network are distilled into smaller student networks capable of running at much higher speeds. We further investigate distilling spatio-temporal information captured by 3D detection transformer networks into 2D object detectors that only process single frames. We apply the approach to the clinically important problem of detecting medical instruments in real-time from ultrasound video sequences, where inference speed is critical on computationally resource-limited hardware. We observe that, via attention distillation, student networks are able to approach the detection performance of larger teacher networks, while meeting strict computational requirements. Experiments demonstrate notable gains in accuracy and speed compared to detection transformer models trained without attention distillation.

Cite this Paper


BibTeX
@InProceedings{pmlr-v158-rubin21a, title = {Attention Distillation for Detection Transformers: Application to Real-Time Video Object Detection in Ultrasound}, author = {Rubin, Jonathan and Erkamp, Ramon and Naidu, Ragha Srinivasa and Thodiyil, Anumod Odungatta and Chen, Alvin}, booktitle = {Proceedings of Machine Learning for Health}, pages = {26--37}, year = {2021}, editor = {Roy, Subhrajit and Pfohl, Stephen and Rocheteau, Emma and Tadesse, Girmaw Abebe and Oala, Luis and Falck, Fabian and Zhou, Yuyin and Shen, Liyue and Zamzmi, Ghada and Mugambi, Purity and Zirikly, Ayah and McDermott, Matthew B. A. and Alsentzer, Emily}, volume = {158}, series = {Proceedings of Machine Learning Research}, month = {04 Dec}, publisher = {PMLR}, pdf = {https://proceedings.mlr.press/v158/rubin21a/rubin21a.pdf}, url = {https://proceedings.mlr.press/v158/rubin21a.html}, abstract = {We introduce a method for efficient knowledge distillation of transformer-based object detectors. The proposed “attention distillation” makes use of the self-attention matrices generated within the layers of the state-of-art detection transformer (DETR) model. Localization information from the attention maps of a large teacher network are distilled into smaller student networks capable of running at much higher speeds. We further investigate distilling spatio-temporal information captured by 3D detection transformer networks into 2D object detectors that only process single frames. We apply the approach to the clinically important problem of detecting medical instruments in real-time from ultrasound video sequences, where inference speed is critical on computationally resource-limited hardware. We observe that, via attention distillation, student networks are able to approach the detection performance of larger teacher networks, while meeting strict computational requirements. Experiments demonstrate notable gains in accuracy and speed compared to detection transformer models trained without attention distillation.} }
Endnote
%0 Conference Paper %T Attention Distillation for Detection Transformers: Application to Real-Time Video Object Detection in Ultrasound %A Jonathan Rubin %A Ramon Erkamp %A Ragha Srinivasa Naidu %A Anumod Odungatta Thodiyil %A Alvin Chen %B Proceedings of Machine Learning for Health %C Proceedings of Machine Learning Research %D 2021 %E Subhrajit Roy %E Stephen Pfohl %E Emma Rocheteau %E Girmaw Abebe Tadesse %E Luis Oala %E Fabian Falck %E Yuyin Zhou %E Liyue Shen %E Ghada Zamzmi %E Purity Mugambi %E Ayah Zirikly %E Matthew B. A. McDermott %E Emily Alsentzer %F pmlr-v158-rubin21a %I PMLR %P 26--37 %U https://proceedings.mlr.press/v158/rubin21a.html %V 158 %X We introduce a method for efficient knowledge distillation of transformer-based object detectors. The proposed “attention distillation” makes use of the self-attention matrices generated within the layers of the state-of-art detection transformer (DETR) model. Localization information from the attention maps of a large teacher network are distilled into smaller student networks capable of running at much higher speeds. We further investigate distilling spatio-temporal information captured by 3D detection transformer networks into 2D object detectors that only process single frames. We apply the approach to the clinically important problem of detecting medical instruments in real-time from ultrasound video sequences, where inference speed is critical on computationally resource-limited hardware. We observe that, via attention distillation, student networks are able to approach the detection performance of larger teacher networks, while meeting strict computational requirements. Experiments demonstrate notable gains in accuracy and speed compared to detection transformer models trained without attention distillation.
APA
Rubin, J., Erkamp, R., Naidu, R.S., Thodiyil, A.O. & Chen, A.. (2021). Attention Distillation for Detection Transformers: Application to Real-Time Video Object Detection in Ultrasound. Proceedings of Machine Learning for Health, in Proceedings of Machine Learning Research 158:26-37 Available from https://proceedings.mlr.press/v158/rubin21a.html.

Related Material