Activation Space Interventions Can Be Transferred Between Large Language Models

Narmeen Fatimah Oozeer, Dhruv Nathawani, Nirmalendu Prakash, Michael Lan, Abir Harrasse, Amir Abdullah
Proceedings of the 42nd International Conference on Machine Learning, PMLR 267:47212-47286, 2025.

Abstract

The study of representation universality in AI models reveals growing convergence across domains, modalities, and architectures. However, the practical applications of representation universality remain largely unexplored. We bridge this gap by demonstrating that safety interventions can be transferred between models through learned mappings of their shared activation spaces. We demonstrate this approach on two well-established AI safety tasks: backdoor removal and refusal of harmful prompts, showing successful transfer of steering vectors that alter the models’ outputs in a predictable way. Additionally, we propose a new task, corrupted capabilities, where models are fine-tuned to embed knowledge tied to a backdoor. This tests their ability to separate useful skills from backdoors, reflecting real-world challenges. Extensive experiments across Llama, Qwen and Gemma model families show that our method enables using smaller models to efficiently align larger ones. Furthermore, we demonstrate that autoencoder mappings between base and fine-tuned models can serve as reliable "lightweight safety switches", allowing dynamic toggling between model behaviors.

Cite this Paper


BibTeX
@InProceedings{pmlr-v267-oozeer25a, title = {Activation Space Interventions Can Be Transferred Between Large Language Models}, author = {Oozeer, Narmeen Fatimah and Nathawani, Dhruv and Prakash, Nirmalendu and Lan, Michael and Harrasse, Abir and Abdullah, Amir}, booktitle = {Proceedings of the 42nd International Conference on Machine Learning}, pages = {47212--47286}, year = {2025}, editor = {Singh, Aarti and Fazel, Maryam and Hsu, Daniel and Lacoste-Julien, Simon and Berkenkamp, Felix and Maharaj, Tegan and Wagstaff, Kiri and Zhu, Jerry}, volume = {267}, series = {Proceedings of Machine Learning Research}, month = {13--19 Jul}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v267/main/assets/oozeer25a/oozeer25a.pdf}, url = {https://proceedings.mlr.press/v267/oozeer25a.html}, abstract = {The study of representation universality in AI models reveals growing convergence across domains, modalities, and architectures. However, the practical applications of representation universality remain largely unexplored. We bridge this gap by demonstrating that safety interventions can be transferred between models through learned mappings of their shared activation spaces. We demonstrate this approach on two well-established AI safety tasks: backdoor removal and refusal of harmful prompts, showing successful transfer of steering vectors that alter the models’ outputs in a predictable way. Additionally, we propose a new task, corrupted capabilities, where models are fine-tuned to embed knowledge tied to a backdoor. This tests their ability to separate useful skills from backdoors, reflecting real-world challenges. Extensive experiments across Llama, Qwen and Gemma model families show that our method enables using smaller models to efficiently align larger ones. Furthermore, we demonstrate that autoencoder mappings between base and fine-tuned models can serve as reliable "lightweight safety switches", allowing dynamic toggling between model behaviors.} }
Endnote
%0 Conference Paper %T Activation Space Interventions Can Be Transferred Between Large Language Models %A Narmeen Fatimah Oozeer %A Dhruv Nathawani %A Nirmalendu Prakash %A Michael Lan %A Abir Harrasse %A Amir Abdullah %B Proceedings of the 42nd International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2025 %E Aarti Singh %E Maryam Fazel %E Daniel Hsu %E Simon Lacoste-Julien %E Felix Berkenkamp %E Tegan Maharaj %E Kiri Wagstaff %E Jerry Zhu %F pmlr-v267-oozeer25a %I PMLR %P 47212--47286 %U https://proceedings.mlr.press/v267/oozeer25a.html %V 267 %X The study of representation universality in AI models reveals growing convergence across domains, modalities, and architectures. However, the practical applications of representation universality remain largely unexplored. We bridge this gap by demonstrating that safety interventions can be transferred between models through learned mappings of their shared activation spaces. We demonstrate this approach on two well-established AI safety tasks: backdoor removal and refusal of harmful prompts, showing successful transfer of steering vectors that alter the models’ outputs in a predictable way. Additionally, we propose a new task, corrupted capabilities, where models are fine-tuned to embed knowledge tied to a backdoor. This tests their ability to separate useful skills from backdoors, reflecting real-world challenges. Extensive experiments across Llama, Qwen and Gemma model families show that our method enables using smaller models to efficiently align larger ones. Furthermore, we demonstrate that autoencoder mappings between base and fine-tuned models can serve as reliable "lightweight safety switches", allowing dynamic toggling between model behaviors.
APA
Oozeer, N.F., Nathawani, D., Prakash, N., Lan, M., Harrasse, A. & Abdullah, A.. (2025). Activation Space Interventions Can Be Transferred Between Large Language Models. Proceedings of the 42nd International Conference on Machine Learning, in Proceedings of Machine Learning Research 267:47212-47286 Available from https://proceedings.mlr.press/v267/oozeer25a.html.

Related Material