[edit]
Modelling Technical and Biological Effects in scRNA-seq data with Scalable GPLVMs
Proceedings of the 17th Machine Learning in Computational Biology meeting, PMLR 200:46-60, 2022.
Abstract
Single-cell RNA-seq datasets are growing in size and complexity, enabling the study of cellular composition changes in various biological/clinical contexts. Scalable dimensionality reduction techniques are in need to disentangle biological variation in them, while accounting for technical and biological confounders. In this work, we extend a popular approach for probabilistic non-linear dimensionality reduction, the Gaussian process latent variable model, to scale to massive single-cell datasets while explicitly accounting for technical and biological confounders. The key idea is to use an augmented kernel which preserves the factorisability of the lower bound allowing for fast stochastic variational inference. We demonstrate its ability to reconstruct previously described latent signatures of innate immunity with 9x speed-up on training time. We further analyse a dataset of blood cells from COVID-19 patients and demonstrate that this framework enables to capture interpretable signatures of infection, while integrating data across individuals and technical batches. Specifically, we explore COVID-19 severity as a latent dimension to refine patient stratification and capture disease-specific gene expression signatures.