Learning Adaptive and View-Invariant Vision Transformer for Real-Time UAV Tracking

Yongxin Li, Mengyuan Liu, You Wu, Xucheng Wang, Xiangyang Yang, Shuiwang Li
Proceedings of the 41st International Conference on Machine Learning, PMLR 235:28403-28420, 2024.

Abstract

Harnessing transformer-based models, visual tracking has made substantial strides. However, the sluggish performance of current trackers limits their practicality on devices with constrained computational capabilities, especially for real-time unmanned aerial vehicle (UAV) tracking. Addressing this challenge, we introduce AVTrack, an adaptive computation framework tailored to selectively activate transformer blocks for real-time UAV tracking in this work. Our novel Activation Module (AM) dynamically optimizes ViT architecture, selectively engaging relevant components and enhancing inference efficiency without compromising much tracking performance. Moreover, we bolster the effectiveness of ViTs, particularly in addressing challenges arising from extreme changes in viewing angles commonly encountered in UAV tracking, by learning view-invariant representations through mutual information maximization. Extensive experiments on five tracking benchmarks affirm the effectiveness and versatility of our approach, positioning it as a state-of-the-art solution in visual tracking. Code is released at: https://github.com/wuyou3474/AVTrack.

Cite this Paper


BibTeX
@InProceedings{pmlr-v235-li24ax, title = {Learning Adaptive and View-Invariant Vision Transformer for Real-Time {UAV} Tracking}, author = {Li, Yongxin and Liu, Mengyuan and Wu, You and Wang, Xucheng and Yang, Xiangyang and Li, Shuiwang}, booktitle = {Proceedings of the 41st International Conference on Machine Learning}, pages = {28403--28420}, year = {2024}, editor = {Salakhutdinov, Ruslan and Kolter, Zico and Heller, Katherine and Weller, Adrian and Oliver, Nuria and Scarlett, Jonathan and Berkenkamp, Felix}, volume = {235}, series = {Proceedings of Machine Learning Research}, month = {21--27 Jul}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v235/main/assets/li24ax/li24ax.pdf}, url = {https://proceedings.mlr.press/v235/li24ax.html}, abstract = {Harnessing transformer-based models, visual tracking has made substantial strides. However, the sluggish performance of current trackers limits their practicality on devices with constrained computational capabilities, especially for real-time unmanned aerial vehicle (UAV) tracking. Addressing this challenge, we introduce AVTrack, an adaptive computation framework tailored to selectively activate transformer blocks for real-time UAV tracking in this work. Our novel Activation Module (AM) dynamically optimizes ViT architecture, selectively engaging relevant components and enhancing inference efficiency without compromising much tracking performance. Moreover, we bolster the effectiveness of ViTs, particularly in addressing challenges arising from extreme changes in viewing angles commonly encountered in UAV tracking, by learning view-invariant representations through mutual information maximization. Extensive experiments on five tracking benchmarks affirm the effectiveness and versatility of our approach, positioning it as a state-of-the-art solution in visual tracking. Code is released at: https://github.com/wuyou3474/AVTrack.} }
Endnote
%0 Conference Paper %T Learning Adaptive and View-Invariant Vision Transformer for Real-Time UAV Tracking %A Yongxin Li %A Mengyuan Liu %A You Wu %A Xucheng Wang %A Xiangyang Yang %A Shuiwang Li %B Proceedings of the 41st International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2024 %E Ruslan Salakhutdinov %E Zico Kolter %E Katherine Heller %E Adrian Weller %E Nuria Oliver %E Jonathan Scarlett %E Felix Berkenkamp %F pmlr-v235-li24ax %I PMLR %P 28403--28420 %U https://proceedings.mlr.press/v235/li24ax.html %V 235 %X Harnessing transformer-based models, visual tracking has made substantial strides. However, the sluggish performance of current trackers limits their practicality on devices with constrained computational capabilities, especially for real-time unmanned aerial vehicle (UAV) tracking. Addressing this challenge, we introduce AVTrack, an adaptive computation framework tailored to selectively activate transformer blocks for real-time UAV tracking in this work. Our novel Activation Module (AM) dynamically optimizes ViT architecture, selectively engaging relevant components and enhancing inference efficiency without compromising much tracking performance. Moreover, we bolster the effectiveness of ViTs, particularly in addressing challenges arising from extreme changes in viewing angles commonly encountered in UAV tracking, by learning view-invariant representations through mutual information maximization. Extensive experiments on five tracking benchmarks affirm the effectiveness and versatility of our approach, positioning it as a state-of-the-art solution in visual tracking. Code is released at: https://github.com/wuyou3474/AVTrack.
APA
Li, Y., Liu, M., Wu, Y., Wang, X., Yang, X. & Li, S.. (2024). Learning Adaptive and View-Invariant Vision Transformer for Real-Time UAV Tracking. Proceedings of the 41st International Conference on Machine Learning, in Proceedings of Machine Learning Research 235:28403-28420 Available from https://proceedings.mlr.press/v235/li24ax.html.

Related Material