How catastrophic can catastrophic forgetting be in linear regression?

Itay Evron, Edward Moroshko, Rachel Ward, Nathan Srebro, Daniel Soudry
Proceedings of Thirty Fifth Conference on Learning Theory, PMLR 178:4028-4079, 2022.

Abstract

To better understand catastrophic forgetting, we study fitting an overparameterized linear model to a sequence of tasks with different input distributions. We analyze how much the model forgets the true labels of earlier tasks after training on subsequent tasks, obtaining exact expressions and bounds. We establish connections between continual learning in the linear setting and two other research areas – alternating projections and the Kaczmarz method. In specific settings, we highlight differences between forgetting and convergence to the offline solution as studied in those areas. In particular, when $T$ tasks in $d$ dimensions are presented cyclically for $k$ iterations, we prove an upper bound of $T^2\min\{1/\sqrt{k},d/k\}$ on the forgetting. This stands in contrast to the convergence to the offline solution, which can be arbitrarily slow according to existing alternating projection results. We further show that the $T^2$ factor can be lifted when tasks are presented in a random ordering.

Cite this Paper


BibTeX
@InProceedings{pmlr-v178-evron22a, title = {How catastrophic can catastrophic forgetting be in linear regression?}, author = {Evron, Itay and Moroshko, Edward and Ward, Rachel and Srebro, Nathan and Soudry, Daniel}, booktitle = {Proceedings of Thirty Fifth Conference on Learning Theory}, pages = {4028--4079}, year = {2022}, editor = {Loh, Po-Ling and Raginsky, Maxim}, volume = {178}, series = {Proceedings of Machine Learning Research}, month = {02--05 Jul}, publisher = {PMLR}, pdf = {https://proceedings.mlr.press/v178/evron22a/evron22a.pdf}, url = {https://proceedings.mlr.press/v178/evron22a.html}, abstract = {To better understand catastrophic forgetting, we study fitting an overparameterized linear model to a sequence of tasks with different input distributions. We analyze how much the model forgets the true labels of earlier tasks after training on subsequent tasks, obtaining exact expressions and bounds. We establish connections between continual learning in the linear setting and two other research areas – alternating projections and the Kaczmarz method. In specific settings, we highlight differences between forgetting and convergence to the offline solution as studied in those areas. In particular, when $T$ tasks in $d$ dimensions are presented cyclically for $k$ iterations, we prove an upper bound of $T^2\min\{1/\sqrt{k},d/k\}$ on the forgetting. This stands in contrast to the convergence to the offline solution, which can be arbitrarily slow according to existing alternating projection results. We further show that the $T^2$ factor can be lifted when tasks are presented in a random ordering.} }
Endnote
%0 Conference Paper %T How catastrophic can catastrophic forgetting be in linear regression? %A Itay Evron %A Edward Moroshko %A Rachel Ward %A Nathan Srebro %A Daniel Soudry %B Proceedings of Thirty Fifth Conference on Learning Theory %C Proceedings of Machine Learning Research %D 2022 %E Po-Ling Loh %E Maxim Raginsky %F pmlr-v178-evron22a %I PMLR %P 4028--4079 %U https://proceedings.mlr.press/v178/evron22a.html %V 178 %X To better understand catastrophic forgetting, we study fitting an overparameterized linear model to a sequence of tasks with different input distributions. We analyze how much the model forgets the true labels of earlier tasks after training on subsequent tasks, obtaining exact expressions and bounds. We establish connections between continual learning in the linear setting and two other research areas – alternating projections and the Kaczmarz method. In specific settings, we highlight differences between forgetting and convergence to the offline solution as studied in those areas. In particular, when $T$ tasks in $d$ dimensions are presented cyclically for $k$ iterations, we prove an upper bound of $T^2\min\{1/\sqrt{k},d/k\}$ on the forgetting. This stands in contrast to the convergence to the offline solution, which can be arbitrarily slow according to existing alternating projection results. We further show that the $T^2$ factor can be lifted when tasks are presented in a random ordering.
APA
Evron, I., Moroshko, E., Ward, R., Srebro, N. & Soudry, D.. (2022). How catastrophic can catastrophic forgetting be in linear regression?. Proceedings of Thirty Fifth Conference on Learning Theory, in Proceedings of Machine Learning Research 178:4028-4079 Available from https://proceedings.mlr.press/v178/evron22a.html.

Related Material