Invariance Makes LLM Unlearning Resilient Even to Unanticipated Downstream Fine-Tuning

Changsheng Wang, Yihua Zhang, Jinghan Jia, Parikshit Ram, Dennis Wei, Yuguang Yao, Soumyadeep Pal, Nathalie Baracaldo, Sijia Liu
Proceedings of the 42nd International Conference on Machine Learning, PMLR 267:65464-65479, 2025.

Abstract

Machine unlearning presents a promising approach to mitigating privacy and safety concerns in large language models (LLMs) by enabling the selective removal of targeted data or knowledge while preserving model utility. However, existing unlearning methods remain over-sensitive to downstream fine-tuning, which can rapidly recover what is supposed to be unlearned information even when the fine-tuning task is entirely unrelated to the unlearning objective. To enhance robustness, we introduce the concept of ‘invariance’ into unlearning for the first time from the perspective of invariant risk minimization (IRM), a principle for environment-agnostic training. By leveraging IRM, we develop a new invariance-regularized LLM unlearning framework, termed invariant LLM unlearning (ILU). We show that the proposed invariance regularization, even using only a single fine-tuning dataset during ILU training, can enable unlearning robustness to generalize effectively across diverse and new fine-tuning tasks at test time. A task vector analysis is also provided to further elucidate the rationale behind ILU’s effectiveness. Extensive experiments on the WMDP benchmark, which focuses on removing an LLM’s hazardous knowledge generation capabilities, reveal that ILU significantly outperforms state-of-the-art unlearning methods, including negative preference optimization (NPO) and representation misdirection for unlearning (RMU). Notably, ILU achieves superior unlearning robustness across diverse downstream fine-tuning scenarios (e.g., math, paraphrase detection, and sentiment analysis) while preserving the fine-tuning performance.

Cite this Paper


BibTeX
@InProceedings{pmlr-v267-wang25en, title = {Invariance Makes {LLM} Unlearning Resilient Even to Unanticipated Downstream Fine-Tuning}, author = {Wang, Changsheng and Zhang, Yihua and Jia, Jinghan and Ram, Parikshit and Wei, Dennis and Yao, Yuguang and Pal, Soumyadeep and Baracaldo, Nathalie and Liu, Sijia}, booktitle = {Proceedings of the 42nd International Conference on Machine Learning}, pages = {65464--65479}, year = {2025}, editor = {Singh, Aarti and Fazel, Maryam and Hsu, Daniel and Lacoste-Julien, Simon and Berkenkamp, Felix and Maharaj, Tegan and Wagstaff, Kiri and Zhu, Jerry}, volume = {267}, series = {Proceedings of Machine Learning Research}, month = {13--19 Jul}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v267/main/assets/wang25en/wang25en.pdf}, url = {https://proceedings.mlr.press/v267/wang25en.html}, abstract = {Machine unlearning presents a promising approach to mitigating privacy and safety concerns in large language models (LLMs) by enabling the selective removal of targeted data or knowledge while preserving model utility. However, existing unlearning methods remain over-sensitive to downstream fine-tuning, which can rapidly recover what is supposed to be unlearned information even when the fine-tuning task is entirely unrelated to the unlearning objective. To enhance robustness, we introduce the concept of ‘invariance’ into unlearning for the first time from the perspective of invariant risk minimization (IRM), a principle for environment-agnostic training. By leveraging IRM, we develop a new invariance-regularized LLM unlearning framework, termed invariant LLM unlearning (ILU). We show that the proposed invariance regularization, even using only a single fine-tuning dataset during ILU training, can enable unlearning robustness to generalize effectively across diverse and new fine-tuning tasks at test time. A task vector analysis is also provided to further elucidate the rationale behind ILU’s effectiveness. Extensive experiments on the WMDP benchmark, which focuses on removing an LLM’s hazardous knowledge generation capabilities, reveal that ILU significantly outperforms state-of-the-art unlearning methods, including negative preference optimization (NPO) and representation misdirection for unlearning (RMU). Notably, ILU achieves superior unlearning robustness across diverse downstream fine-tuning scenarios (e.g., math, paraphrase detection, and sentiment analysis) while preserving the fine-tuning performance.} }
Endnote
%0 Conference Paper %T Invariance Makes LLM Unlearning Resilient Even to Unanticipated Downstream Fine-Tuning %A Changsheng Wang %A Yihua Zhang %A Jinghan Jia %A Parikshit Ram %A Dennis Wei %A Yuguang Yao %A Soumyadeep Pal %A Nathalie Baracaldo %A Sijia Liu %B Proceedings of the 42nd International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2025 %E Aarti Singh %E Maryam Fazel %E Daniel Hsu %E Simon Lacoste-Julien %E Felix Berkenkamp %E Tegan Maharaj %E Kiri Wagstaff %E Jerry Zhu %F pmlr-v267-wang25en %I PMLR %P 65464--65479 %U https://proceedings.mlr.press/v267/wang25en.html %V 267 %X Machine unlearning presents a promising approach to mitigating privacy and safety concerns in large language models (LLMs) by enabling the selective removal of targeted data or knowledge while preserving model utility. However, existing unlearning methods remain over-sensitive to downstream fine-tuning, which can rapidly recover what is supposed to be unlearned information even when the fine-tuning task is entirely unrelated to the unlearning objective. To enhance robustness, we introduce the concept of ‘invariance’ into unlearning for the first time from the perspective of invariant risk minimization (IRM), a principle for environment-agnostic training. By leveraging IRM, we develop a new invariance-regularized LLM unlearning framework, termed invariant LLM unlearning (ILU). We show that the proposed invariance regularization, even using only a single fine-tuning dataset during ILU training, can enable unlearning robustness to generalize effectively across diverse and new fine-tuning tasks at test time. A task vector analysis is also provided to further elucidate the rationale behind ILU’s effectiveness. Extensive experiments on the WMDP benchmark, which focuses on removing an LLM’s hazardous knowledge generation capabilities, reveal that ILU significantly outperforms state-of-the-art unlearning methods, including negative preference optimization (NPO) and representation misdirection for unlearning (RMU). Notably, ILU achieves superior unlearning robustness across diverse downstream fine-tuning scenarios (e.g., math, paraphrase detection, and sentiment analysis) while preserving the fine-tuning performance.
APA
Wang, C., Zhang, Y., Jia, J., Ram, P., Wei, D., Yao, Y., Pal, S., Baracaldo, N. & Liu, S.. (2025). Invariance Makes LLM Unlearning Resilient Even to Unanticipated Downstream Fine-Tuning. Proceedings of the 42nd International Conference on Machine Learning, in Proceedings of Machine Learning Research 267:65464-65479 Available from https://proceedings.mlr.press/v267/wang25en.html.

Related Material