A Structured Observation Distribution for Generative Biological Sequence Prediction and Forecasting

Eli N Weinstein, Debora Marks
Proceedings of the 38th International Conference on Machine Learning, PMLR 139:11068-11079, 2021.

Abstract

Generative probabilistic modeling of biological sequences has widespread existing and potential application across biology and biomedicine, from evolutionary biology to epidemiology to protein design. Many standard sequence analysis methods preprocess data using a multiple sequence alignment (MSA) algorithm, one of the most widely used computational methods in all of science. However, as we show in this article, training generative probabilistic models with MSA preprocessing leads to statistical pathologies in the context of sequence prediction and forecasting. To address these problems, we propose a principled drop-in alternative to MSA preprocessing in the form of a structured observation distribution (the "MuE" distribution). We prove theoretically that the MuE distribution comprehensively generalizes popular methods for inferring biological sequence alignments, and provide a precise characterization of how such biological models have differed from natural language latent alignment models. We show empirically that models that use the MuE as an observation distribution outperform comparable methods across a variety of datasets, and apply MuE models to a novel problem for generative probabilistic sequence models: forecasting pathogen evolution.

Cite this Paper


BibTeX
@InProceedings{pmlr-v139-weinstein21a, title = {A Structured Observation Distribution for Generative Biological Sequence Prediction and Forecasting}, author = {Weinstein, Eli N and Marks, Debora}, booktitle = {Proceedings of the 38th International Conference on Machine Learning}, pages = {11068--11079}, year = {2021}, editor = {Meila, Marina and Zhang, Tong}, volume = {139}, series = {Proceedings of Machine Learning Research}, month = {18--24 Jul}, publisher = {PMLR}, pdf = {http://proceedings.mlr.press/v139/weinstein21a/weinstein21a.pdf}, url = {https://proceedings.mlr.press/v139/weinstein21a.html}, abstract = {Generative probabilistic modeling of biological sequences has widespread existing and potential application across biology and biomedicine, from evolutionary biology to epidemiology to protein design. Many standard sequence analysis methods preprocess data using a multiple sequence alignment (MSA) algorithm, one of the most widely used computational methods in all of science. However, as we show in this article, training generative probabilistic models with MSA preprocessing leads to statistical pathologies in the context of sequence prediction and forecasting. To address these problems, we propose a principled drop-in alternative to MSA preprocessing in the form of a structured observation distribution (the "MuE" distribution). We prove theoretically that the MuE distribution comprehensively generalizes popular methods for inferring biological sequence alignments, and provide a precise characterization of how such biological models have differed from natural language latent alignment models. We show empirically that models that use the MuE as an observation distribution outperform comparable methods across a variety of datasets, and apply MuE models to a novel problem for generative probabilistic sequence models: forecasting pathogen evolution.} }
Endnote
%0 Conference Paper %T A Structured Observation Distribution for Generative Biological Sequence Prediction and Forecasting %A Eli N Weinstein %A Debora Marks %B Proceedings of the 38th International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2021 %E Marina Meila %E Tong Zhang %F pmlr-v139-weinstein21a %I PMLR %P 11068--11079 %U https://proceedings.mlr.press/v139/weinstein21a.html %V 139 %X Generative probabilistic modeling of biological sequences has widespread existing and potential application across biology and biomedicine, from evolutionary biology to epidemiology to protein design. Many standard sequence analysis methods preprocess data using a multiple sequence alignment (MSA) algorithm, one of the most widely used computational methods in all of science. However, as we show in this article, training generative probabilistic models with MSA preprocessing leads to statistical pathologies in the context of sequence prediction and forecasting. To address these problems, we propose a principled drop-in alternative to MSA preprocessing in the form of a structured observation distribution (the "MuE" distribution). We prove theoretically that the MuE distribution comprehensively generalizes popular methods for inferring biological sequence alignments, and provide a precise characterization of how such biological models have differed from natural language latent alignment models. We show empirically that models that use the MuE as an observation distribution outperform comparable methods across a variety of datasets, and apply MuE models to a novel problem for generative probabilistic sequence models: forecasting pathogen evolution.
APA
Weinstein, E.N. & Marks, D.. (2021). A Structured Observation Distribution for Generative Biological Sequence Prediction and Forecasting. Proceedings of the 38th International Conference on Machine Learning, in Proceedings of Machine Learning Research 139:11068-11079 Available from https://proceedings.mlr.press/v139/weinstein21a.html.

Related Material