Does Continual Learning Equally Forget All Parameters?

Haiyan Zhao; Tianyi Zhou; Guodong Long; Jing Jiang; Chengqi Zhang

Does Continual Learning Equally Forget All Parameters?

Haiyan Zhao, Tianyi Zhou, Guodong Long, Jing Jiang, Chengqi Zhang

Proceedings of the 40th International Conference on Machine Learning, PMLR 202:42280-42303, 2023.

Abstract

Distribution shift (e.g., task or domain shift) in continual learning (CL) usually results in catastrophic forgetting of previously learned knowledge. Although it can be alleviated by repeatedly replaying buffered data, the every-step replay is time-consuming. In this paper, we study which modules in neural networks are more prone to forgetting by investigating their training dynamics during CL. Our proposed metrics show that only a few modules are more task-specific and sensitive to task change, while others can be shared across tasks as common knowledge. Hence, we attribute forgetting mainly to the former and find that finetuning them only on a small buffer at the end of any CL method can bring non-trivial improvement. Due to the small number of finetuned parameters, such ”Forgetting Prioritized Finetuning (FPF)” is efficient in computation. We further propose a more efficient and simpler method that entirely removes the every-step replay and replaces them by only

$k$ -times of FPF periodically triggered during CL. Surprisingly, this ”

$k$ -FPF” performs comparably to FPF and outperforms the SOTA CL methods but significantly reduces their computational overhead and cost. In experiments on several benchmarks of class- and domain-incremental CL, FPF consistently improves existing CL methods by a large margin, and

$k$ -FPF further excels in efficiency without degrading the accuracy. We also empirically studied the impact of buffer size, epochs per task, and finetuning modules on the cost and accuracy of our methods.

Cite this Paper

BibTeX


@InProceedings{pmlr-v202-zhao23n,
  title = 	 {Does Continual Learning Equally Forget All Parameters?},
  author =       {Zhao, Haiyan and Zhou, Tianyi and Long, Guodong and Jiang, Jing and Zhang, Chengqi},
  booktitle = 	 {Proceedings of the 40th International Conference on Machine Learning},
  pages = 	 {42280--42303},
  year = 	 {2023},
  editor = 	 {Krause, Andreas and Brunskill, Emma and Cho, Kyunghyun and Engelhardt, Barbara and Sabato, Sivan and Scarlett, Jonathan},
  volume = 	 {202},
  series = 	 {Proceedings of Machine Learning Research},
  month = 	 {23--29 Jul},
  publisher =    {PMLR},
  pdf = 	 {https://proceedings.mlr.press/v202/zhao23n/zhao23n.pdf},
  url = 	 {https://proceedings.mlr.press/v202/zhao23n.html},
  abstract = 	 {Distribution shift (e.g., task or domain shift) in continual learning (CL) usually results in catastrophic forgetting of previously learned knowledge. Although it can be alleviated by repeatedly replaying buffered data, the every-step replay is time-consuming. In this paper, we study which modules in neural networks are more prone to forgetting by investigating their training dynamics during CL. Our proposed metrics show that only a few modules are more task-specific and sensitive to task change, while others can be shared across tasks as common knowledge. Hence, we attribute forgetting mainly to the former and find that finetuning them only on a small buffer at the end of any CL method can bring non-trivial improvement. Due to the small number of finetuned parameters, such ”Forgetting Prioritized Finetuning (FPF)” is efficient in computation. We further propose a more efficient and simpler method that entirely removes the every-step replay and replaces them by only $k$-times of FPF periodically triggered during CL. Surprisingly, this ”$k$-FPF” performs comparably to FPF and outperforms the SOTA CL methods but significantly reduces their computational overhead and cost. In experiments on several benchmarks of class- and domain-incremental CL, FPF consistently improves existing CL methods by a large margin, and $k$-FPF further excels in efficiency without degrading the accuracy. We also empirically studied the impact of buffer size, epochs per task, and finetuning modules on the cost and accuracy of our methods.}
}

Endnote

%0 Conference Paper
%T Does Continual Learning Equally Forget All Parameters?
%A Haiyan Zhao
%A Tianyi Zhou
%A Guodong Long
%A Jing Jiang
%A Chengqi Zhang
%B Proceedings of the 40th International Conference on Machine Learning
%C Proceedings of Machine Learning Research
%D 2023
%E Andreas Krause
%E Emma Brunskill
%E Kyunghyun Cho
%E Barbara Engelhardt
%E Sivan Sabato
%E Jonathan Scarlett	
%F pmlr-v202-zhao23n
%I PMLR
%P 42280--42303
%U https://proceedings.mlr.press/v202/zhao23n.html
%V 202
%X Distribution shift (e.g., task or domain shift) in continual learning (CL) usually results in catastrophic forgetting of previously learned knowledge. Although it can be alleviated by repeatedly replaying buffered data, the every-step replay is time-consuming. In this paper, we study which modules in neural networks are more prone to forgetting by investigating their training dynamics during CL. Our proposed metrics show that only a few modules are more task-specific and sensitive to task change, while others can be shared across tasks as common knowledge. Hence, we attribute forgetting mainly to the former and find that finetuning them only on a small buffer at the end of any CL method can bring non-trivial improvement. Due to the small number of finetuned parameters, such ”Forgetting Prioritized Finetuning (FPF)” is efficient in computation. We further propose a more efficient and simpler method that entirely removes the every-step replay and replaces them by only $k$-times of FPF periodically triggered during CL. Surprisingly, this ”$k$-FPF” performs comparably to FPF and outperforms the SOTA CL methods but significantly reduces their computational overhead and cost. In experiments on several benchmarks of class- and domain-incremental CL, FPF consistently improves existing CL methods by a large margin, and $k$-FPF further excels in efficiency without degrading the accuracy. We also empirically studied the impact of buffer size, epochs per task, and finetuning modules on the cost and accuracy of our methods.

APA


Zhao, H., Zhou, T., Long, G., Jiang, J. & Zhang, C.. (2023). Does Continual Learning Equally Forget All Parameters?. Proceedings of the 40th International Conference on Machine Learning, in Proceedings of Machine Learning Research 202:42280-42303 Available from https://proceedings.mlr.press/v202/zhao23n.html.

Does Continual Learning Equally Forget All Parameters?

Abstract

Cite this Paper

Related Material