[edit]
Towards the Anonymization of Masked Language Modeling
Proceedings of the 7th Conference on Health, Inference, and Learning, PMLR 333:985-1002, 2026.
Abstract
Rapid advances in Natural Language Processing (NLP) have revolutionized many fields, including healthcare. However, these advances raise significant privacy concerns, especially when pre-trained models, fine-tuned and specialized on sensitive data, can memorize and then expose and regurgitate personal information. This paper presents a privacy-preserving language modeling approach to address the problem of anonymization of language models, and thus promote their sharing. Specifically, we propose a Masked Language Modeling (MLM) methodology to specialize a BERT-like language model that prevents the model from memorizing direct and indirect identifying information present in the training data. We comprehensively evaluated our approach on several models using a medical dataset and a corpus of legal texts, and compared it to different baselines. Our results indicate that by avoiding memorizing both direct and indirect identifiers during model specialization, our masked language modeling schemes offer a good tradeoff for maintaining high privacy while retaining high utility.