Towards the Anonymization of Masked Language Modeling

Antoine Boutet, Lucas Magnana, Juliette Senechal
Proceedings of the 7th Conference on Health, Inference, and Learning, PMLR 333:985-1002, 2026.

Abstract

Rapid advances in Natural Language Processing (NLP) have revolutionized many fields, including healthcare. However, these advances raise significant privacy concerns, especially when pre-trained models, fine-tuned and specialized on sensitive data, can memorize and then expose and regurgitate personal information. This paper presents a privacy-preserving language modeling approach to address the problem of anonymization of language models, and thus promote their sharing. Specifically, we propose a Masked Language Modeling (MLM) methodology to specialize a BERT-like language model that prevents the model from memorizing direct and indirect identifying information present in the training data. We comprehensively evaluated our approach on several models using a medical dataset and a corpus of legal texts, and compared it to different baselines. Our results indicate that by avoiding memorizing both direct and indirect identifiers during model specialization, our masked language modeling schemes offer a good tradeoff for maintaining high privacy while retaining high utility.

Cite this Paper


BibTeX
@InProceedings{pmlr-v333-boutet26a, title = {Towards the Anonymization of Masked Language Modeling}, author = {Boutet, Antoine and Magnana, Lucas and Senechal, Juliette}, booktitle = {Proceedings of the 7th Conference on Health, Inference, and Learning}, pages = {985--1002}, year = {2026}, editor = {Healey, Elizabeth and Fries, Jason and Pollard, Tom and Tang, Shengpu and Zink, Anna and Hartvigsen, Tom and Agrawal, Monica and Finlayson, Sam and Glicksberg, Benjamin and Beaulieu-Jones, Brett and Wang, Kai and Fontalvo, Daseyra and Sarker, Tasmie and Chen, Irene and Alsentzer, Emily}, volume = {333}, series = {Proceedings of Machine Learning Research}, month = {29--30 Jun}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v333/main/assets/boutet26a/boutet26a.pdf}, url = {https://proceedings.mlr.press/v333/boutet26a.html}, abstract = {Rapid advances in Natural Language Processing (NLP) have revolutionized many fields, including healthcare. However, these advances raise significant privacy concerns, especially when pre-trained models, fine-tuned and specialized on sensitive data, can memorize and then expose and regurgitate personal information. This paper presents a privacy-preserving language modeling approach to address the problem of anonymization of language models, and thus promote their sharing. Specifically, we propose a Masked Language Modeling (MLM) methodology to specialize a BERT-like language model that prevents the model from memorizing direct and indirect identifying information present in the training data. We comprehensively evaluated our approach on several models using a medical dataset and a corpus of legal texts, and compared it to different baselines. Our results indicate that by avoiding memorizing both direct and indirect identifiers during model specialization, our masked language modeling schemes offer a good tradeoff for maintaining high privacy while retaining high utility.} }
Endnote
%0 Conference Paper %T Towards the Anonymization of Masked Language Modeling %A Antoine Boutet %A Lucas Magnana %A Juliette Senechal %B Proceedings of the 7th Conference on Health, Inference, and Learning %C Proceedings of Machine Learning Research %D 2026 %E Elizabeth Healey %E Jason Fries %E Tom Pollard %E Shengpu Tang %E Anna Zink %E Tom Hartvigsen %E Monica Agrawal %E Sam Finlayson %E Benjamin Glicksberg %E Brett Beaulieu-Jones %E Kai Wang %E Daseyra Fontalvo %E Tasmie Sarker %E Irene Chen %E Emily Alsentzer %F pmlr-v333-boutet26a %I PMLR %P 985--1002 %U https://proceedings.mlr.press/v333/boutet26a.html %V 333 %X Rapid advances in Natural Language Processing (NLP) have revolutionized many fields, including healthcare. However, these advances raise significant privacy concerns, especially when pre-trained models, fine-tuned and specialized on sensitive data, can memorize and then expose and regurgitate personal information. This paper presents a privacy-preserving language modeling approach to address the problem of anonymization of language models, and thus promote their sharing. Specifically, we propose a Masked Language Modeling (MLM) methodology to specialize a BERT-like language model that prevents the model from memorizing direct and indirect identifying information present in the training data. We comprehensively evaluated our approach on several models using a medical dataset and a corpus of legal texts, and compared it to different baselines. Our results indicate that by avoiding memorizing both direct and indirect identifiers during model specialization, our masked language modeling schemes offer a good tradeoff for maintaining high privacy while retaining high utility.
APA
Boutet, A., Magnana, L. & Senechal, J.. (2026). Towards the Anonymization of Masked Language Modeling. Proceedings of the 7th Conference on Health, Inference, and Learning, in Proceedings of Machine Learning Research 333:985-1002 Available from https://proceedings.mlr.press/v333/boutet26a.html.

Related Material