Multi-Dimensional Hyena for Spatial Inductive Bias

Itamar Zimerman, Lior Wolf
Proceedings of The 27th International Conference on Artificial Intelligence and Statistics, PMLR 238:973-981, 2024.

Abstract

The advantage of Vision Transformers over CNNs is only fully manifested when trained over a large dataset, mainly due to the reduced inductive bias towards spatial locality within the transformer’s self-attention mechanism. In this work, we present a data-efficient vision transformer that does not rely on self-attention. Instead, it employs a novel generalization to multiple axes of the very recent Hyena layer. We propose several alternative approaches for obtaining this generalization and delve into their unique distinctions and considerations from both empirical and theoretical perspectives. The proposed Hyena N-D layer boosts the performance of various Vision Transformer architectures, such as ViT, Swin, and DeiT across multiple datasets. Furthermore, in the small dataset regime, our Hyena-based ViT is favorable to ViT variants from the recent literature that are specifically designed for solving the same challenge. Finally, we show that a hybrid approach that is based on Hyena N-D for the first layers in ViT, followed by layers that incorporate conventional attention, consistently boosts the performance of various vision transformer architectures. Our code is attached as supplementary.

Cite this Paper


BibTeX
@InProceedings{pmlr-v238-zimerman24a, title = {Multi-Dimensional {H}yena for Spatial Inductive Bias}, author = {Zimerman, Itamar and Wolf, Lior}, booktitle = {Proceedings of The 27th International Conference on Artificial Intelligence and Statistics}, pages = {973--981}, year = {2024}, editor = {Dasgupta, Sanjoy and Mandt, Stephan and Li, Yingzhen}, volume = {238}, series = {Proceedings of Machine Learning Research}, month = {02--04 May}, publisher = {PMLR}, pdf = {https://proceedings.mlr.press/v238/zimerman24a/zimerman24a.pdf}, url = {https://proceedings.mlr.press/v238/zimerman24a.html}, abstract = {The advantage of Vision Transformers over CNNs is only fully manifested when trained over a large dataset, mainly due to the reduced inductive bias towards spatial locality within the transformer’s self-attention mechanism. In this work, we present a data-efficient vision transformer that does not rely on self-attention. Instead, it employs a novel generalization to multiple axes of the very recent Hyena layer. We propose several alternative approaches for obtaining this generalization and delve into their unique distinctions and considerations from both empirical and theoretical perspectives. The proposed Hyena N-D layer boosts the performance of various Vision Transformer architectures, such as ViT, Swin, and DeiT across multiple datasets. Furthermore, in the small dataset regime, our Hyena-based ViT is favorable to ViT variants from the recent literature that are specifically designed for solving the same challenge. Finally, we show that a hybrid approach that is based on Hyena N-D for the first layers in ViT, followed by layers that incorporate conventional attention, consistently boosts the performance of various vision transformer architectures. Our code is attached as supplementary.} }
Endnote
%0 Conference Paper %T Multi-Dimensional Hyena for Spatial Inductive Bias %A Itamar Zimerman %A Lior Wolf %B Proceedings of The 27th International Conference on Artificial Intelligence and Statistics %C Proceedings of Machine Learning Research %D 2024 %E Sanjoy Dasgupta %E Stephan Mandt %E Yingzhen Li %F pmlr-v238-zimerman24a %I PMLR %P 973--981 %U https://proceedings.mlr.press/v238/zimerman24a.html %V 238 %X The advantage of Vision Transformers over CNNs is only fully manifested when trained over a large dataset, mainly due to the reduced inductive bias towards spatial locality within the transformer’s self-attention mechanism. In this work, we present a data-efficient vision transformer that does not rely on self-attention. Instead, it employs a novel generalization to multiple axes of the very recent Hyena layer. We propose several alternative approaches for obtaining this generalization and delve into their unique distinctions and considerations from both empirical and theoretical perspectives. The proposed Hyena N-D layer boosts the performance of various Vision Transformer architectures, such as ViT, Swin, and DeiT across multiple datasets. Furthermore, in the small dataset regime, our Hyena-based ViT is favorable to ViT variants from the recent literature that are specifically designed for solving the same challenge. Finally, we show that a hybrid approach that is based on Hyena N-D for the first layers in ViT, followed by layers that incorporate conventional attention, consistently boosts the performance of various vision transformer architectures. Our code is attached as supplementary.
APA
Zimerman, I. & Wolf, L.. (2024). Multi-Dimensional Hyena for Spatial Inductive Bias. Proceedings of The 27th International Conference on Artificial Intelligence and Statistics, in Proceedings of Machine Learning Research 238:973-981 Available from https://proceedings.mlr.press/v238/zimerman24a.html.

Related Material