Vulnerability-Aware Alignment: Mitigating Uneven Forgetting in Harmful Fine-Tuning

Liang Chen, Xueting Han, Li Shen, Jing Bai, Kam-Fai Wong
Proceedings of the 42nd International Conference on Machine Learning, PMLR 267:8172-8183, 2025.

Abstract

Harmful fine-tuning (HFT), performed directly on open-source LLMs or through Fine-tuning-as-a-Service, breaks safety alignment and poses significant threats. Existing methods aim to mitigate HFT risks by learning robust representation on alignment data or making harmful data unlearnable, but they treat each data sample equally, leaving data vulnerability patterns understudied. In this work, we reveal that certain subsets of alignment data are consistently more prone to forgetting during HFT across different fine-tuning tasks and exhibit lower robustness compared to other subsets. Inspired by these findings, we propose Vulnerability-Aware Alignment (VAA), which calculates data vulnerability, partitions data into "vulnerable" and "invulnerable" groups, and encourages balanced learning using a group distributionally robust optimization (Group DRO) framework. Specifically, VAA learns an adversarial sampler that samples examples from the currently underperforming group and then applies group-dependent adversarial perturbations to the data during training, aiming to encourage a balanced learning process across groups. Experiments across four fine-tuning tasks demonstrate that VAA significantly reduces harmful scores while preserving downstream task performance, outperforming state-of-the-art baselines.

Cite this Paper


BibTeX
@InProceedings{pmlr-v267-chen25w, title = {Vulnerability-Aware Alignment: Mitigating Uneven Forgetting in Harmful Fine-Tuning}, author = {Chen, Liang and Han, Xueting and Shen, Li and Bai, Jing and Wong, Kam-Fai}, booktitle = {Proceedings of the 42nd International Conference on Machine Learning}, pages = {8172--8183}, year = {2025}, editor = {Singh, Aarti and Fazel, Maryam and Hsu, Daniel and Lacoste-Julien, Simon and Berkenkamp, Felix and Maharaj, Tegan and Wagstaff, Kiri and Zhu, Jerry}, volume = {267}, series = {Proceedings of Machine Learning Research}, month = {13--19 Jul}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v267/main/assets/chen25w/chen25w.pdf}, url = {https://proceedings.mlr.press/v267/chen25w.html}, abstract = {Harmful fine-tuning (HFT), performed directly on open-source LLMs or through Fine-tuning-as-a-Service, breaks safety alignment and poses significant threats. Existing methods aim to mitigate HFT risks by learning robust representation on alignment data or making harmful data unlearnable, but they treat each data sample equally, leaving data vulnerability patterns understudied. In this work, we reveal that certain subsets of alignment data are consistently more prone to forgetting during HFT across different fine-tuning tasks and exhibit lower robustness compared to other subsets. Inspired by these findings, we propose Vulnerability-Aware Alignment (VAA), which calculates data vulnerability, partitions data into "vulnerable" and "invulnerable" groups, and encourages balanced learning using a group distributionally robust optimization (Group DRO) framework. Specifically, VAA learns an adversarial sampler that samples examples from the currently underperforming group and then applies group-dependent adversarial perturbations to the data during training, aiming to encourage a balanced learning process across groups. Experiments across four fine-tuning tasks demonstrate that VAA significantly reduces harmful scores while preserving downstream task performance, outperforming state-of-the-art baselines.} }
Endnote
%0 Conference Paper %T Vulnerability-Aware Alignment: Mitigating Uneven Forgetting in Harmful Fine-Tuning %A Liang Chen %A Xueting Han %A Li Shen %A Jing Bai %A Kam-Fai Wong %B Proceedings of the 42nd International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2025 %E Aarti Singh %E Maryam Fazel %E Daniel Hsu %E Simon Lacoste-Julien %E Felix Berkenkamp %E Tegan Maharaj %E Kiri Wagstaff %E Jerry Zhu %F pmlr-v267-chen25w %I PMLR %P 8172--8183 %U https://proceedings.mlr.press/v267/chen25w.html %V 267 %X Harmful fine-tuning (HFT), performed directly on open-source LLMs or through Fine-tuning-as-a-Service, breaks safety alignment and poses significant threats. Existing methods aim to mitigate HFT risks by learning robust representation on alignment data or making harmful data unlearnable, but they treat each data sample equally, leaving data vulnerability patterns understudied. In this work, we reveal that certain subsets of alignment data are consistently more prone to forgetting during HFT across different fine-tuning tasks and exhibit lower robustness compared to other subsets. Inspired by these findings, we propose Vulnerability-Aware Alignment (VAA), which calculates data vulnerability, partitions data into "vulnerable" and "invulnerable" groups, and encourages balanced learning using a group distributionally robust optimization (Group DRO) framework. Specifically, VAA learns an adversarial sampler that samples examples from the currently underperforming group and then applies group-dependent adversarial perturbations to the data during training, aiming to encourage a balanced learning process across groups. Experiments across four fine-tuning tasks demonstrate that VAA significantly reduces harmful scores while preserving downstream task performance, outperforming state-of-the-art baselines.
APA
Chen, L., Han, X., Shen, L., Bai, J. & Wong, K.. (2025). Vulnerability-Aware Alignment: Mitigating Uneven Forgetting in Harmful Fine-Tuning. Proceedings of the 42nd International Conference on Machine Learning, in Proceedings of Machine Learning Research 267:8172-8183 Available from https://proceedings.mlr.press/v267/chen25w.html.

Related Material