Optimal Design of Stochastic DNA Synthesis Protocols based on Generative Sequence Models

Eli N. Weinstein, Alan N. Amin, Will S. Grathwohl, Daniel Kassler, Jean Disset, Debora Marks
Proceedings of The 25th International Conference on Artificial Intelligence and Statistics, PMLR 151:7450-7482, 2022.

Abstract

Generative probabilistic models of biological sequences have widespread existing and potential applications in analyzing, predicting and designing proteins, RNA and genomes. To test the predictions of such a model experimentally, the standard approach is to draw samples, and then synthesize each sample individually in the laboratory. However, often orders of magnitude more sequences can be experimentally assayed than can be affordably synthesized individually. In this article, we propose instead to use stochastic synthesis methods, such as mixed nucleotides or trimers. We describe a black-box algorithm for optimizing stochastic synthesis protocols to produce approximate samples from any target generative model. We establish theoretical bounds on the method’s performance, and validate it in simulation using held-out sequence-to-function predictors trained on real experimental data. We show that using optimized stochastic synthesis protocols in place of individual synthesis can increase the number of hits in protein engineering efforts by orders of magnitude, e.g. from zero to a thousand.

Cite this Paper


BibTeX
@InProceedings{pmlr-v151-weinstein22a, title = { Optimal Design of Stochastic DNA Synthesis Protocols based on Generative Sequence Models }, author = {Weinstein, Eli N. and Amin, Alan N. and Grathwohl, Will S. and Kassler, Daniel and Disset, Jean and Marks, Debora}, booktitle = {Proceedings of The 25th International Conference on Artificial Intelligence and Statistics}, pages = {7450--7482}, year = {2022}, editor = {Camps-Valls, Gustau and Ruiz, Francisco J. R. and Valera, Isabel}, volume = {151}, series = {Proceedings of Machine Learning Research}, month = {28--30 Mar}, publisher = {PMLR}, pdf = {https://proceedings.mlr.press/v151/weinstein22a/weinstein22a.pdf}, url = {https://proceedings.mlr.press/v151/weinstein22a.html}, abstract = { Generative probabilistic models of biological sequences have widespread existing and potential applications in analyzing, predicting and designing proteins, RNA and genomes. To test the predictions of such a model experimentally, the standard approach is to draw samples, and then synthesize each sample individually in the laboratory. However, often orders of magnitude more sequences can be experimentally assayed than can be affordably synthesized individually. In this article, we propose instead to use stochastic synthesis methods, such as mixed nucleotides or trimers. We describe a black-box algorithm for optimizing stochastic synthesis protocols to produce approximate samples from any target generative model. We establish theoretical bounds on the method’s performance, and validate it in simulation using held-out sequence-to-function predictors trained on real experimental data. We show that using optimized stochastic synthesis protocols in place of individual synthesis can increase the number of hits in protein engineering efforts by orders of magnitude, e.g. from zero to a thousand. } }
Endnote
%0 Conference Paper %T Optimal Design of Stochastic DNA Synthesis Protocols based on Generative Sequence Models %A Eli N. Weinstein %A Alan N. Amin %A Will S. Grathwohl %A Daniel Kassler %A Jean Disset %A Debora Marks %B Proceedings of The 25th International Conference on Artificial Intelligence and Statistics %C Proceedings of Machine Learning Research %D 2022 %E Gustau Camps-Valls %E Francisco J. R. Ruiz %E Isabel Valera %F pmlr-v151-weinstein22a %I PMLR %P 7450--7482 %U https://proceedings.mlr.press/v151/weinstein22a.html %V 151 %X Generative probabilistic models of biological sequences have widespread existing and potential applications in analyzing, predicting and designing proteins, RNA and genomes. To test the predictions of such a model experimentally, the standard approach is to draw samples, and then synthesize each sample individually in the laboratory. However, often orders of magnitude more sequences can be experimentally assayed than can be affordably synthesized individually. In this article, we propose instead to use stochastic synthesis methods, such as mixed nucleotides or trimers. We describe a black-box algorithm for optimizing stochastic synthesis protocols to produce approximate samples from any target generative model. We establish theoretical bounds on the method’s performance, and validate it in simulation using held-out sequence-to-function predictors trained on real experimental data. We show that using optimized stochastic synthesis protocols in place of individual synthesis can increase the number of hits in protein engineering efforts by orders of magnitude, e.g. from zero to a thousand.
APA
Weinstein, E.N., Amin, A.N., Grathwohl, W.S., Kassler, D., Disset, J. & Marks, D.. (2022). Optimal Design of Stochastic DNA Synthesis Protocols based on Generative Sequence Models . Proceedings of The 25th International Conference on Artificial Intelligence and Statistics, in Proceedings of Machine Learning Research 151:7450-7482 Available from https://proceedings.mlr.press/v151/weinstein22a.html.

Related Material