Training Flexible Models of Genetic Variant Effects from Functional Annotations using Accelerated Linear Algebra

Alan Nawzad Amin, Andres Potapczynski, Andrew Gordon Wilson
Proceedings of the 42nd International Conference on Machine Learning, PMLR 267:1401-1418, 2025.

Abstract

To understand how genetic variants in human genomes manifest in phenotypes - traits like height or diseases like asthma - geneticists have sequenced and measured hundreds of thousands of individuals. Geneticists use this data to build models that predict how a genetic variant impacts phenotype given genomic features of the variant, like DNA accessibility or the presence of nearby DNA-bound proteins. As more data and features become available, one might expect predictive models to improve. Unfortunately, training these models is bottlenecked by the need to solve expensive linear algebra problems because variants in the genome are correlated with nearby variants, requiring inversion of large matrices. Previous methods have therefore been restricted to fitting small models, and fitting simplified summary statistics, rather than the full likelihood of the statistical model. In this paper, we leverage modern fast linear algebra techniques to develop DeepWAS (Deep genome Wide Association Studies), a method to train large and flexible neural network predictive models to optimize likelihood. Surprisingly, we find that larger models only improve performance when using our full likelihood approach; when trained by fitting traditional summary statistics, larger models perform no better than small ones. We find larger models trained on more features make better predictions, potentially improving disease predictions and therapeutic target identification.

Cite this Paper


BibTeX
@InProceedings{pmlr-v267-amin25a, title = {Training Flexible Models of Genetic Variant Effects from Functional Annotations using Accelerated Linear Algebra}, author = {Amin, Alan Nawzad and Potapczynski, Andres and Wilson, Andrew Gordon}, booktitle = {Proceedings of the 42nd International Conference on Machine Learning}, pages = {1401--1418}, year = {2025}, editor = {Singh, Aarti and Fazel, Maryam and Hsu, Daniel and Lacoste-Julien, Simon and Berkenkamp, Felix and Maharaj, Tegan and Wagstaff, Kiri and Zhu, Jerry}, volume = {267}, series = {Proceedings of Machine Learning Research}, month = {13--19 Jul}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v267/main/assets/amin25a/amin25a.pdf}, url = {https://proceedings.mlr.press/v267/amin25a.html}, abstract = {To understand how genetic variants in human genomes manifest in phenotypes - traits like height or diseases like asthma - geneticists have sequenced and measured hundreds of thousands of individuals. Geneticists use this data to build models that predict how a genetic variant impacts phenotype given genomic features of the variant, like DNA accessibility or the presence of nearby DNA-bound proteins. As more data and features become available, one might expect predictive models to improve. Unfortunately, training these models is bottlenecked by the need to solve expensive linear algebra problems because variants in the genome are correlated with nearby variants, requiring inversion of large matrices. Previous methods have therefore been restricted to fitting small models, and fitting simplified summary statistics, rather than the full likelihood of the statistical model. In this paper, we leverage modern fast linear algebra techniques to develop DeepWAS (Deep genome Wide Association Studies), a method to train large and flexible neural network predictive models to optimize likelihood. Surprisingly, we find that larger models only improve performance when using our full likelihood approach; when trained by fitting traditional summary statistics, larger models perform no better than small ones. We find larger models trained on more features make better predictions, potentially improving disease predictions and therapeutic target identification.} }
Endnote
%0 Conference Paper %T Training Flexible Models of Genetic Variant Effects from Functional Annotations using Accelerated Linear Algebra %A Alan Nawzad Amin %A Andres Potapczynski %A Andrew Gordon Wilson %B Proceedings of the 42nd International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2025 %E Aarti Singh %E Maryam Fazel %E Daniel Hsu %E Simon Lacoste-Julien %E Felix Berkenkamp %E Tegan Maharaj %E Kiri Wagstaff %E Jerry Zhu %F pmlr-v267-amin25a %I PMLR %P 1401--1418 %U https://proceedings.mlr.press/v267/amin25a.html %V 267 %X To understand how genetic variants in human genomes manifest in phenotypes - traits like height or diseases like asthma - geneticists have sequenced and measured hundreds of thousands of individuals. Geneticists use this data to build models that predict how a genetic variant impacts phenotype given genomic features of the variant, like DNA accessibility or the presence of nearby DNA-bound proteins. As more data and features become available, one might expect predictive models to improve. Unfortunately, training these models is bottlenecked by the need to solve expensive linear algebra problems because variants in the genome are correlated with nearby variants, requiring inversion of large matrices. Previous methods have therefore been restricted to fitting small models, and fitting simplified summary statistics, rather than the full likelihood of the statistical model. In this paper, we leverage modern fast linear algebra techniques to develop DeepWAS (Deep genome Wide Association Studies), a method to train large and flexible neural network predictive models to optimize likelihood. Surprisingly, we find that larger models only improve performance when using our full likelihood approach; when trained by fitting traditional summary statistics, larger models perform no better than small ones. We find larger models trained on more features make better predictions, potentially improving disease predictions and therapeutic target identification.
APA
Amin, A.N., Potapczynski, A. & Wilson, A.G.. (2025). Training Flexible Models of Genetic Variant Effects from Functional Annotations using Accelerated Linear Algebra. Proceedings of the 42nd International Conference on Machine Learning, in Proceedings of Machine Learning Research 267:1401-1418 Available from https://proceedings.mlr.press/v267/amin25a.html.

Related Material