Stochastic Layer-Wise Shuffle for Improving Vision Mamba Training

Zizheng Huang, Haoxing Chen, Jiaqi Li, Jun Lan, Huijia Zhu, Weiqiang Wang, Limin Wang
Proceedings of the 42nd International Conference on Machine Learning, PMLR 267:25127-25140, 2025.

Abstract

Recent Vision Mamba (Vim) models exhibit nearly linear complexity in sequence length, making them highly attractive for processing visual data. However, the training methodologies and their potential are still not sufficiently explored. In this paper, we investigate strategies for Vim and propose Stochastic Layer-Wise Shuffle (SLWS), a novel regularization method that can effectively improve the Vim training. Without architectural modifications, this approach enables the non-hierarchical Vim to get leading performance on ImageNet-1K compared with the similar type counterparts. Our method operates through four simple steps per layer: probability allocation to assign layer-dependent shuffle rates, operation sampling via Bernoulli trials, sequence shuffling of input tokens, and order restoration of outputs. SLWS distinguishes itself through three principles: (1) Plug-and-play: No architectural modifications are needed, and it is deactivated during inference. (2) Simple but effective: The four-step process introduces only random permutations and negligible overhead. (3) Intuitive design: Shuffling probabilities grow linearly with layer depth, aligning with the hierarchical semantic abstraction in vision models. Our work underscores the importance of tailored training strategies for Vim models and provides a helpful way to explore their scalability. Code and models are available at https://github.com/huangzizheng01/ShuffleMamba

Cite this Paper


BibTeX
@InProceedings{pmlr-v267-huang25d, title = {Stochastic Layer-Wise Shuffle for Improving Vision Mamba Training}, author = {Huang, Zizheng and Chen, Haoxing and Li, Jiaqi and Lan, Jun and Zhu, Huijia and Wang, Weiqiang and Wang, Limin}, booktitle = {Proceedings of the 42nd International Conference on Machine Learning}, pages = {25127--25140}, year = {2025}, editor = {Singh, Aarti and Fazel, Maryam and Hsu, Daniel and Lacoste-Julien, Simon and Berkenkamp, Felix and Maharaj, Tegan and Wagstaff, Kiri and Zhu, Jerry}, volume = {267}, series = {Proceedings of Machine Learning Research}, month = {13--19 Jul}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v267/main/assets/huang25d/huang25d.pdf}, url = {https://proceedings.mlr.press/v267/huang25d.html}, abstract = {Recent Vision Mamba (Vim) models exhibit nearly linear complexity in sequence length, making them highly attractive for processing visual data. However, the training methodologies and their potential are still not sufficiently explored. In this paper, we investigate strategies for Vim and propose Stochastic Layer-Wise Shuffle (SLWS), a novel regularization method that can effectively improve the Vim training. Without architectural modifications, this approach enables the non-hierarchical Vim to get leading performance on ImageNet-1K compared with the similar type counterparts. Our method operates through four simple steps per layer: probability allocation to assign layer-dependent shuffle rates, operation sampling via Bernoulli trials, sequence shuffling of input tokens, and order restoration of outputs. SLWS distinguishes itself through three principles: (1) Plug-and-play: No architectural modifications are needed, and it is deactivated during inference. (2) Simple but effective: The four-step process introduces only random permutations and negligible overhead. (3) Intuitive design: Shuffling probabilities grow linearly with layer depth, aligning with the hierarchical semantic abstraction in vision models. Our work underscores the importance of tailored training strategies for Vim models and provides a helpful way to explore their scalability. Code and models are available at https://github.com/huangzizheng01/ShuffleMamba} }
Endnote
%0 Conference Paper %T Stochastic Layer-Wise Shuffle for Improving Vision Mamba Training %A Zizheng Huang %A Haoxing Chen %A Jiaqi Li %A Jun Lan %A Huijia Zhu %A Weiqiang Wang %A Limin Wang %B Proceedings of the 42nd International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2025 %E Aarti Singh %E Maryam Fazel %E Daniel Hsu %E Simon Lacoste-Julien %E Felix Berkenkamp %E Tegan Maharaj %E Kiri Wagstaff %E Jerry Zhu %F pmlr-v267-huang25d %I PMLR %P 25127--25140 %U https://proceedings.mlr.press/v267/huang25d.html %V 267 %X Recent Vision Mamba (Vim) models exhibit nearly linear complexity in sequence length, making them highly attractive for processing visual data. However, the training methodologies and their potential are still not sufficiently explored. In this paper, we investigate strategies for Vim and propose Stochastic Layer-Wise Shuffle (SLWS), a novel regularization method that can effectively improve the Vim training. Without architectural modifications, this approach enables the non-hierarchical Vim to get leading performance on ImageNet-1K compared with the similar type counterparts. Our method operates through four simple steps per layer: probability allocation to assign layer-dependent shuffle rates, operation sampling via Bernoulli trials, sequence shuffling of input tokens, and order restoration of outputs. SLWS distinguishes itself through three principles: (1) Plug-and-play: No architectural modifications are needed, and it is deactivated during inference. (2) Simple but effective: The four-step process introduces only random permutations and negligible overhead. (3) Intuitive design: Shuffling probabilities grow linearly with layer depth, aligning with the hierarchical semantic abstraction in vision models. Our work underscores the importance of tailored training strategies for Vim models and provides a helpful way to explore their scalability. Code and models are available at https://github.com/huangzizheng01/ShuffleMamba
APA
Huang, Z., Chen, H., Li, J., Lan, J., Zhu, H., Wang, W. & Wang, L.. (2025). Stochastic Layer-Wise Shuffle for Improving Vision Mamba Training. Proceedings of the 42nd International Conference on Machine Learning, in Proceedings of Machine Learning Research 267:25127-25140 Available from https://proceedings.mlr.press/v267/huang25d.html.

Related Material