BulkRNABert: Cancer prognosis from bulk RNA-seq based language models

Maxence Gélard, Guillaume Richard, Thomas Pierrot, Paul-Henry Cournède
Proceedings of the 4th Machine Learning for Health Symposium, PMLR 259:384-400, 2025.

Abstract

RNA sequencing (RNA-seq) has become a key technology in precision medicine, especially for cancer prognosis. However, the high dimensionality of such data may restrict classic statistical methods, thus raising the need to learn dense representations from them. Transformers models have exhibited capacities in providing representations for long sequences and thus are well suited for transcriptomics data. In this paper, we develop a pre-trained transformer-based language model through self-supervised learning using bulk RNA-seq from both non-cancer and cancer tissues, following BERT’s masking method. By probing learned embeddings from the model or using parameter-efficient fine-tuning, we then build downstream models for cancer-type classification and survival-time prediction. Leveraging the TCGA dataset, we demonstrate the performance of our method, BulkRNABert, on both tasks, with significant improvement compared to state-of-the-art methods in the pan-cancer setting for classification and survival analysis. We also show the transfer-learning capabilities of the model in the survival analysis setting on unseen cohorts.

Cite this Paper


BibTeX
@InProceedings{pmlr-v259-gelard25a, title = {BulkRNABert: Cancer prognosis from bulk RNA-seq based language models}, author = {G{\'{e}}lard, Maxence and Richard, Guillaume and Pierrot, Thomas and Courn{\`{e}}de, Paul-Henry}, booktitle = {Proceedings of the 4th Machine Learning for Health Symposium}, pages = {384--400}, year = {2025}, editor = {Hegselmann, Stefan and Zhou, Helen and Healey, Elizabeth and Chang, Trenton and Ellington, Caleb and Mhasawade, Vishwali and Tonekaboni, Sana and Argaw, Peniel and Zhang, Haoran}, volume = {259}, series = {Proceedings of Machine Learning Research}, month = {15--16 Dec}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v259/main/assets/gelard25a/gelard25a.pdf}, url = {https://proceedings.mlr.press/v259/gelard25a.html}, abstract = {RNA sequencing (RNA-seq) has become a key technology in precision medicine, especially for cancer prognosis. However, the high dimensionality of such data may restrict classic statistical methods, thus raising the need to learn dense representations from them. Transformers models have exhibited capacities in providing representations for long sequences and thus are well suited for transcriptomics data. In this paper, we develop a pre-trained transformer-based language model through self-supervised learning using bulk RNA-seq from both non-cancer and cancer tissues, following BERT’s masking method. By probing learned embeddings from the model or using parameter-efficient fine-tuning, we then build downstream models for cancer-type classification and survival-time prediction. Leveraging the TCGA dataset, we demonstrate the performance of our method, BulkRNABert, on both tasks, with significant improvement compared to state-of-the-art methods in the pan-cancer setting for classification and survival analysis. We also show the transfer-learning capabilities of the model in the survival analysis setting on unseen cohorts.} }
Endnote
%0 Conference Paper %T BulkRNABert: Cancer prognosis from bulk RNA-seq based language models %A Maxence Gélard %A Guillaume Richard %A Thomas Pierrot %A Paul-Henry Cournède %B Proceedings of the 4th Machine Learning for Health Symposium %C Proceedings of Machine Learning Research %D 2025 %E Stefan Hegselmann %E Helen Zhou %E Elizabeth Healey %E Trenton Chang %E Caleb Ellington %E Vishwali Mhasawade %E Sana Tonekaboni %E Peniel Argaw %E Haoran Zhang %F pmlr-v259-gelard25a %I PMLR %P 384--400 %U https://proceedings.mlr.press/v259/gelard25a.html %V 259 %X RNA sequencing (RNA-seq) has become a key technology in precision medicine, especially for cancer prognosis. However, the high dimensionality of such data may restrict classic statistical methods, thus raising the need to learn dense representations from them. Transformers models have exhibited capacities in providing representations for long sequences and thus are well suited for transcriptomics data. In this paper, we develop a pre-trained transformer-based language model through self-supervised learning using bulk RNA-seq from both non-cancer and cancer tissues, following BERT’s masking method. By probing learned embeddings from the model or using parameter-efficient fine-tuning, we then build downstream models for cancer-type classification and survival-time prediction. Leveraging the TCGA dataset, we demonstrate the performance of our method, BulkRNABert, on both tasks, with significant improvement compared to state-of-the-art methods in the pan-cancer setting for classification and survival analysis. We also show the transfer-learning capabilities of the model in the survival analysis setting on unseen cohorts.
APA
Gélard, M., Richard, G., Pierrot, T. & Cournède, P.. (2025). BulkRNABert: Cancer prognosis from bulk RNA-seq based language models. Proceedings of the 4th Machine Learning for Health Symposium, in Proceedings of Machine Learning Research 259:384-400 Available from https://proceedings.mlr.press/v259/gelard25a.html.

Related Material