How rotation invariant algorithms are fooled by noise on sparse targets

Manfred K. Warmuth, Wojciech Kot\polishlowski, Matt Jones, Ehsan Amid
Proceedings of The 36th International Conference on Algorithmic Learning Theory, PMLR 272:1316-1360, 2025.

Abstract

It is well known that rotation invariant algorithms are sub-optimal for learning sparse linear problems, when the number of examples is below the input dimension. This includes any gradient descent trained neural net with a fully-connected input layer initialized with a rotationally symmetric distribution. The simplest sparse problem is learning a single feature out of d features. In that case the classification error or regression loss of rotation invariant algorithms grows with 1−n/d, where n is the number of examples seen. These lower bounds become vacuous when the number of examples n reaches the dimension d. After d examples, the gradient space has full rank and any weight vector can be expressed, including the unit vector that determines the target feature. In this work, we show that when noise is added to this sparse linear problem, rotation invariant algorithms are still sub-optimal after seeing d or more examples. We prove this via a lower bound for the Bayes optimal algorithm on a rotationally symmetrized problem. We then prove much better upper bounds on the same problem for a large variety of algorithms that are non-invariant by rotations. Finally, we analyze the gradient flow trajectories of many standard optimization algorithms (such as AdaGrad) on the same noisy feature learning problem, and show how they veer away from the noisy sparse targets. We then contrast them with a group of non-rotation invariant algorithms that veer towards the sparse targets. We believe that the lower bounds method and trajectory categorization will be crucial for analyzing other families of algorithms with different classes of invariances.

Cite this Paper


BibTeX
@InProceedings{pmlr-v272-warmuth25a, title = {How rotation invariant algorithms are fooled by noise on sparse targets}, author = {Warmuth, Manfred K. and Kot{\polishl}owski, Wojciech and Jones, Matt and Amid, Ehsan}, booktitle = {Proceedings of The 36th International Conference on Algorithmic Learning Theory}, pages = {1316--1360}, year = {2025}, editor = {Kamath, Gautam and Loh, Po-Ling}, volume = {272}, series = {Proceedings of Machine Learning Research}, month = {24--27 Feb}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v272/main/assets/warmuth25a/warmuth25a.pdf}, url = {https://proceedings.mlr.press/v272/warmuth25a.html}, abstract = {It is well known that rotation invariant algorithms are sub-optimal for learning sparse linear problems, when the number of examples is below the input dimension. This includes any gradient descent trained neural net with a fully-connected input layer initialized with a rotationally symmetric distribution. The simplest sparse problem is learning a single feature out of d features. In that case the classification error or regression loss of rotation invariant algorithms grows with 1−n/d, where n is the number of examples seen. These lower bounds become vacuous when the number of examples n reaches the dimension d. After d examples, the gradient space has full rank and any weight vector can be expressed, including the unit vector that determines the target feature. In this work, we show that when noise is added to this sparse linear problem, rotation invariant algorithms are still sub-optimal after seeing d or more examples. We prove this via a lower bound for the Bayes optimal algorithm on a rotationally symmetrized problem. We then prove much better upper bounds on the same problem for a large variety of algorithms that are non-invariant by rotations. Finally, we analyze the gradient flow trajectories of many standard optimization algorithms (such as AdaGrad) on the same noisy feature learning problem, and show how they veer away from the noisy sparse targets. We then contrast them with a group of non-rotation invariant algorithms that veer towards the sparse targets. We believe that the lower bounds method and trajectory categorization will be crucial for analyzing other families of algorithms with different classes of invariances.} }
Endnote
%0 Conference Paper %T How rotation invariant algorithms are fooled by noise on sparse targets %A Manfred K. Warmuth %A Wojciech Kot\polishlowski %A Matt Jones %A Ehsan Amid %B Proceedings of The 36th International Conference on Algorithmic Learning Theory %C Proceedings of Machine Learning Research %D 2025 %E Gautam Kamath %E Po-Ling Loh %F pmlr-v272-warmuth25a %I PMLR %P 1316--1360 %U https://proceedings.mlr.press/v272/warmuth25a.html %V 272 %X It is well known that rotation invariant algorithms are sub-optimal for learning sparse linear problems, when the number of examples is below the input dimension. This includes any gradient descent trained neural net with a fully-connected input layer initialized with a rotationally symmetric distribution. The simplest sparse problem is learning a single feature out of d features. In that case the classification error or regression loss of rotation invariant algorithms grows with 1−n/d, where n is the number of examples seen. These lower bounds become vacuous when the number of examples n reaches the dimension d. After d examples, the gradient space has full rank and any weight vector can be expressed, including the unit vector that determines the target feature. In this work, we show that when noise is added to this sparse linear problem, rotation invariant algorithms are still sub-optimal after seeing d or more examples. We prove this via a lower bound for the Bayes optimal algorithm on a rotationally symmetrized problem. We then prove much better upper bounds on the same problem for a large variety of algorithms that are non-invariant by rotations. Finally, we analyze the gradient flow trajectories of many standard optimization algorithms (such as AdaGrad) on the same noisy feature learning problem, and show how they veer away from the noisy sparse targets. We then contrast them with a group of non-rotation invariant algorithms that veer towards the sparse targets. We believe that the lower bounds method and trajectory categorization will be crucial for analyzing other families of algorithms with different classes of invariances.
APA
Warmuth, M.K., Kot\polishlowski, W., Jones, M. & Amid, E.. (2025). How rotation invariant algorithms are fooled by noise on sparse targets. Proceedings of The 36th International Conference on Algorithmic Learning Theory, in Proceedings of Machine Learning Research 272:1316-1360 Available from https://proceedings.mlr.press/v272/warmuth25a.html.

Related Material