[edit]
ViBid: Linear Vision Transformer with Bidirectional Normalization
Proceedings of the Thirty-Ninth Conference on Uncertainty in Artificial Intelligence, PMLR 216:1996-2005, 2023.
Abstract
The vision transformer has achieved state-of-the-art performance in various vision tasks; however, the memory consumption is larger than those of previous convolutional neural network based models because of $O(N^2)$ time and memory complexity of the general self-attention models. Many approaches aim to change the complexity to $O(N)$ to solve this problem; however, they stack deep convolutional layers to retain locality or complicate the architecture as seen in window attention, to compensate for the performance degradation. To solve these problems, we propose ViBid algorithm, which resolves the complexity problem of $O(N^2)$ by replacing Softmax with bidirectional normalization (BiNorm). In addition, it has a much simpler architecture than the existing transformer model with $O(N)$ complexity. Owing to our simple architecture, we were able to use larger resolutions for training, and we obtained a lighter and superior GPU throughput model with competitive performance. ViBid can be used with any transformer method that uses queries, keys, and values (QKV) because of BiNorm, and it is quite universal due to its simple architectural structure.