Resolving Lexical Bias in Model Editing

Hammad Rizwan, Domenic Rosati, Ga Wu, Hassan Sajjad
Proceedings of the 42nd International Conference on Machine Learning, PMLR 267:51747-51769, 2025.

Abstract

Model editing aims to modify the outputs of large language models after they are trained. Previous approaches have often involved direct alterations to model weights, which can result in model degradation. Recent techniques avoid making modifications to the model’s weights by using an adapter that applies edits to the model when triggered by semantic similarity in the representation space. We demonstrate that current adapter methods are critically vulnerable to strong lexical biases, leading to issues such as applying edits to irrelevant prompts with overlapping words. This paper presents a principled approach to learning a disentangled representation space that facilitates precise localization of edits by maintaining distance between irrelevant prompts while preserving proximity among paraphrases. In our empirical study, we show that our method (Projector Editor Networks for Model Editing - PENME) achieves state-of-the-art model editing results while being more computationally efficient during inference than previous methods and adaptable across different architectures.

Cite this Paper


BibTeX
@InProceedings{pmlr-v267-rizwan25a, title = {Resolving Lexical Bias in Model Editing}, author = {Rizwan, Hammad and Rosati, Domenic and Wu, Ga and Sajjad, Hassan}, booktitle = {Proceedings of the 42nd International Conference on Machine Learning}, pages = {51747--51769}, year = {2025}, editor = {Singh, Aarti and Fazel, Maryam and Hsu, Daniel and Lacoste-Julien, Simon and Berkenkamp, Felix and Maharaj, Tegan and Wagstaff, Kiri and Zhu, Jerry}, volume = {267}, series = {Proceedings of Machine Learning Research}, month = {13--19 Jul}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v267/main/assets/rizwan25a/rizwan25a.pdf}, url = {https://proceedings.mlr.press/v267/rizwan25a.html}, abstract = {Model editing aims to modify the outputs of large language models after they are trained. Previous approaches have often involved direct alterations to model weights, which can result in model degradation. Recent techniques avoid making modifications to the model’s weights by using an adapter that applies edits to the model when triggered by semantic similarity in the representation space. We demonstrate that current adapter methods are critically vulnerable to strong lexical biases, leading to issues such as applying edits to irrelevant prompts with overlapping words. This paper presents a principled approach to learning a disentangled representation space that facilitates precise localization of edits by maintaining distance between irrelevant prompts while preserving proximity among paraphrases. In our empirical study, we show that our method (Projector Editor Networks for Model Editing - PENME) achieves state-of-the-art model editing results while being more computationally efficient during inference than previous methods and adaptable across different architectures.} }
Endnote
%0 Conference Paper %T Resolving Lexical Bias in Model Editing %A Hammad Rizwan %A Domenic Rosati %A Ga Wu %A Hassan Sajjad %B Proceedings of the 42nd International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2025 %E Aarti Singh %E Maryam Fazel %E Daniel Hsu %E Simon Lacoste-Julien %E Felix Berkenkamp %E Tegan Maharaj %E Kiri Wagstaff %E Jerry Zhu %F pmlr-v267-rizwan25a %I PMLR %P 51747--51769 %U https://proceedings.mlr.press/v267/rizwan25a.html %V 267 %X Model editing aims to modify the outputs of large language models after they are trained. Previous approaches have often involved direct alterations to model weights, which can result in model degradation. Recent techniques avoid making modifications to the model’s weights by using an adapter that applies edits to the model when triggered by semantic similarity in the representation space. We demonstrate that current adapter methods are critically vulnerable to strong lexical biases, leading to issues such as applying edits to irrelevant prompts with overlapping words. This paper presents a principled approach to learning a disentangled representation space that facilitates precise localization of edits by maintaining distance between irrelevant prompts while preserving proximity among paraphrases. In our empirical study, we show that our method (Projector Editor Networks for Model Editing - PENME) achieves state-of-the-art model editing results while being more computationally efficient during inference than previous methods and adaptable across different architectures.
APA
Rizwan, H., Rosati, D., Wu, G. & Sajjad, H.. (2025). Resolving Lexical Bias in Model Editing. Proceedings of the 42nd International Conference on Machine Learning, in Proceedings of Machine Learning Research 267:51747-51769 Available from https://proceedings.mlr.press/v267/rizwan25a.html.

Related Material