EncryptedLLM: Privacy-Preserving Large Language Model Inference via GPU-Accelerated Fully Homomorphic Encryption

Leo De Castro, Daniel Escudero, Adya Agrawal, Antigoni Polychroniadou, Manuela Veloso
Proceedings of the 42nd International Conference on Machine Learning, PMLR 267:12677-12688, 2025.

Abstract

As large language models (LLMs) become more powerful, the computation required to run these models is increasingly outsourced to a third-party cloud. While this saves clients’ computation, it risks leaking the clients’ LLM queries to the cloud provider. Fully homomorphic encryption (FHE) presents a natural solution to this problem: simply encrypt the query and evaluate the LLM homomorphically on the cloud machine. The result remains encrypted and can only be learned by the client who holds the secret key. In this work, we present a GPU-accelerated implementation of FHE and use this implementation to benchmark an encrypted GPT-2 forward pass, with runtimes over $200\times$ faster than the CPU baseline. We also present novel and extensive experimental analysis of approximations of LLM activation functions to maintain accuracy while achieving this performance.

Cite this Paper


BibTeX
@InProceedings{pmlr-v267-de-castro25a, title = {{E}ncrypted{LLM}: Privacy-Preserving Large Language Model Inference via {GPU}-Accelerated Fully Homomorphic Encryption}, author = {De Castro, Leo and Escudero, Daniel and Agrawal, Adya and Polychroniadou, Antigoni and Veloso, Manuela}, booktitle = {Proceedings of the 42nd International Conference on Machine Learning}, pages = {12677--12688}, year = {2025}, editor = {Singh, Aarti and Fazel, Maryam and Hsu, Daniel and Lacoste-Julien, Simon and Berkenkamp, Felix and Maharaj, Tegan and Wagstaff, Kiri and Zhu, Jerry}, volume = {267}, series = {Proceedings of Machine Learning Research}, month = {13--19 Jul}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v267/main/assets/de-castro25a/de-castro25a.pdf}, url = {https://proceedings.mlr.press/v267/de-castro25a.html}, abstract = {As large language models (LLMs) become more powerful, the computation required to run these models is increasingly outsourced to a third-party cloud. While this saves clients’ computation, it risks leaking the clients’ LLM queries to the cloud provider. Fully homomorphic encryption (FHE) presents a natural solution to this problem: simply encrypt the query and evaluate the LLM homomorphically on the cloud machine. The result remains encrypted and can only be learned by the client who holds the secret key. In this work, we present a GPU-accelerated implementation of FHE and use this implementation to benchmark an encrypted GPT-2 forward pass, with runtimes over $200\times$ faster than the CPU baseline. We also present novel and extensive experimental analysis of approximations of LLM activation functions to maintain accuracy while achieving this performance.} }
Endnote
%0 Conference Paper %T EncryptedLLM: Privacy-Preserving Large Language Model Inference via GPU-Accelerated Fully Homomorphic Encryption %A Leo De Castro %A Daniel Escudero %A Adya Agrawal %A Antigoni Polychroniadou %A Manuela Veloso %B Proceedings of the 42nd International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2025 %E Aarti Singh %E Maryam Fazel %E Daniel Hsu %E Simon Lacoste-Julien %E Felix Berkenkamp %E Tegan Maharaj %E Kiri Wagstaff %E Jerry Zhu %F pmlr-v267-de-castro25a %I PMLR %P 12677--12688 %U https://proceedings.mlr.press/v267/de-castro25a.html %V 267 %X As large language models (LLMs) become more powerful, the computation required to run these models is increasingly outsourced to a third-party cloud. While this saves clients’ computation, it risks leaking the clients’ LLM queries to the cloud provider. Fully homomorphic encryption (FHE) presents a natural solution to this problem: simply encrypt the query and evaluate the LLM homomorphically on the cloud machine. The result remains encrypted and can only be learned by the client who holds the secret key. In this work, we present a GPU-accelerated implementation of FHE and use this implementation to benchmark an encrypted GPT-2 forward pass, with runtimes over $200\times$ faster than the CPU baseline. We also present novel and extensive experimental analysis of approximations of LLM activation functions to maintain accuracy while achieving this performance.
APA
De Castro, L., Escudero, D., Agrawal, A., Polychroniadou, A. & Veloso, M.. (2025). EncryptedLLM: Privacy-Preserving Large Language Model Inference via GPU-Accelerated Fully Homomorphic Encryption. Proceedings of the 42nd International Conference on Machine Learning, in Proceedings of Machine Learning Research 267:12677-12688 Available from https://proceedings.mlr.press/v267/de-castro25a.html.

Related Material