Cell2Sentence: Teaching Large Language Models the Language of Biology

Daniel Levine, Syed A Rizvi, Sacha Lévy, Nazreen Pallikkavaliyaveetil, David Zhang, Xingyu Chen, Sina Ghadermarzi, Ruiming Wu, Zihe Zheng, Ivan Vrkic, Anna Zhong, Daphne Raskin, Insu Han, Antonio Henrique De Oliveira Fonseca, Josue Ortega Caro, Amin Karbasi, Rahul Madhav Dhodapkar, David Van Dijk
Proceedings of the 41st International Conference on Machine Learning, PMLR 235:27299-27325, 2024.

Abstract

We introduce Cell2Sentence (C2S), a novel method to directly adapt large language models to a biological context, specifically single-cell transcriptomics. By transforming gene expression data into "cell sentences," C2S bridges the gap between natural language processing and biology. We demonstrate cell sentences enable the fine-tuning of language models for diverse tasks in biology, including cell generation, complex cell-type annotation, and direct data-driven text generation. Our experiments reveal that GPT-2, when fine-tuned with C2S, can generate biologically valid cells based on cell type inputs, and accurately predict cell types from cell sentences. This illustrates that language models, through C2S fine-tuning, can acquire a significant understanding of single-cell biology while maintaining robust text generation capabilities. C2S offers a flexible, accessible framework to integrate natural language processing with transcriptomics, utilizing existing models and libraries for a wide range of biological applications.

Cite this Paper


BibTeX
@InProceedings{pmlr-v235-levine24a, title = {{C}ell2{S}entence: Teaching Large Language Models the Language of Biology}, author = {Levine, Daniel and Rizvi, Syed A and L\'{e}vy, Sacha and Pallikkavaliyaveetil, Nazreen and Zhang, David and Chen, Xingyu and Ghadermarzi, Sina and Wu, Ruiming and Zheng, Zihe and Vrkic, Ivan and Zhong, Anna and Raskin, Daphne and Han, Insu and De Oliveira Fonseca, Antonio Henrique and Ortega Caro, Josue and Karbasi, Amin and Dhodapkar, Rahul Madhav and Dijk, David Van}, booktitle = {Proceedings of the 41st International Conference on Machine Learning}, pages = {27299--27325}, year = {2024}, editor = {Salakhutdinov, Ruslan and Kolter, Zico and Heller, Katherine and Weller, Adrian and Oliver, Nuria and Scarlett, Jonathan and Berkenkamp, Felix}, volume = {235}, series = {Proceedings of Machine Learning Research}, month = {21--27 Jul}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v235/main/assets/levine24a/levine24a.pdf}, url = {https://proceedings.mlr.press/v235/levine24a.html}, abstract = {We introduce Cell2Sentence (C2S), a novel method to directly adapt large language models to a biological context, specifically single-cell transcriptomics. By transforming gene expression data into "cell sentences," C2S bridges the gap between natural language processing and biology. We demonstrate cell sentences enable the fine-tuning of language models for diverse tasks in biology, including cell generation, complex cell-type annotation, and direct data-driven text generation. Our experiments reveal that GPT-2, when fine-tuned with C2S, can generate biologically valid cells based on cell type inputs, and accurately predict cell types from cell sentences. This illustrates that language models, through C2S fine-tuning, can acquire a significant understanding of single-cell biology while maintaining robust text generation capabilities. C2S offers a flexible, accessible framework to integrate natural language processing with transcriptomics, utilizing existing models and libraries for a wide range of biological applications.} }
Endnote
%0 Conference Paper %T Cell2Sentence: Teaching Large Language Models the Language of Biology %A Daniel Levine %A Syed A Rizvi %A Sacha Lévy %A Nazreen Pallikkavaliyaveetil %A David Zhang %A Xingyu Chen %A Sina Ghadermarzi %A Ruiming Wu %A Zihe Zheng %A Ivan Vrkic %A Anna Zhong %A Daphne Raskin %A Insu Han %A Antonio Henrique De Oliveira Fonseca %A Josue Ortega Caro %A Amin Karbasi %A Rahul Madhav Dhodapkar %A David Van Dijk %B Proceedings of the 41st International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2024 %E Ruslan Salakhutdinov %E Zico Kolter %E Katherine Heller %E Adrian Weller %E Nuria Oliver %E Jonathan Scarlett %E Felix Berkenkamp %F pmlr-v235-levine24a %I PMLR %P 27299--27325 %U https://proceedings.mlr.press/v235/levine24a.html %V 235 %X We introduce Cell2Sentence (C2S), a novel method to directly adapt large language models to a biological context, specifically single-cell transcriptomics. By transforming gene expression data into "cell sentences," C2S bridges the gap between natural language processing and biology. We demonstrate cell sentences enable the fine-tuning of language models for diverse tasks in biology, including cell generation, complex cell-type annotation, and direct data-driven text generation. Our experiments reveal that GPT-2, when fine-tuned with C2S, can generate biologically valid cells based on cell type inputs, and accurately predict cell types from cell sentences. This illustrates that language models, through C2S fine-tuning, can acquire a significant understanding of single-cell biology while maintaining robust text generation capabilities. C2S offers a flexible, accessible framework to integrate natural language processing with transcriptomics, utilizing existing models and libraries for a wide range of biological applications.
APA
Levine, D., Rizvi, S.A., Lévy, S., Pallikkavaliyaveetil, N., Zhang, D., Chen, X., Ghadermarzi, S., Wu, R., Zheng, Z., Vrkic, I., Zhong, A., Raskin, D., Han, I., De Oliveira Fonseca, A.H., Ortega Caro, J., Karbasi, A., Dhodapkar, R.M. & Dijk, D.V.. (2024). Cell2Sentence: Teaching Large Language Models the Language of Biology. Proceedings of the 41st International Conference on Machine Learning, in Proceedings of Machine Learning Research 235:27299-27325 Available from https://proceedings.mlr.press/v235/levine24a.html.

Related Material