Forecasting labels under distribution-shift for machine-guided sequence design

Lauren B Wheelock, Stephen Malina, Jeffrey Gerold, Sam Sinai
Proceedings of the 17th Machine Learning in Computational Biology meeting, PMLR 200:166-180, 2022.

Abstract

The ability to design and optimize biological sequences with specific functionalities would unlock enormous value in technology and healthcare. In recent years, machine learning-guided sequence design has progressed this goal significantly, though validating designed sequences in the lab or clinic takes many months and substantial labor. It is therefore valuable to assess the likelihood that a designed set contains sequences of the desired quality (which often lies outside the label distribution in our training data) before committing resources to an experiment. Forecasting, a prominent concept in many domains where feedback can be delayed (e.g. elections), has not been used or studied in the context of sequence design. Here we propose a method to guide decision-making that forecasts the performance of high-throughput libraries (e.g. containing $10{^}5$ unique variants) based on estimates provided by models, providing a posterior for the distribution of labels in the library. We show that our method outperforms baselines that naively use model scores to estimate library performance, which are the only tool available today for this purpose.

Cite this Paper


BibTeX
@InProceedings{pmlr-v200-wheelock22a, title = {Forecasting labels under distribution-shift for machine-guided sequence design}, author = {Wheelock, Lauren B and Malina, Stephen and Gerold, Jeffrey and Sinai, Sam}, booktitle = {Proceedings of the 17th Machine Learning in Computational Biology meeting}, pages = {166--180}, year = {2022}, editor = {Knowles, David A and Mostafavi, Sara and Lee, Su-In}, volume = {200}, series = {Proceedings of Machine Learning Research}, month = {21--22 Nov}, publisher = {PMLR}, pdf = {https://proceedings.mlr.press/v200/wheelock22a/wheelock22a.pdf}, url = {https://proceedings.mlr.press/v200/wheelock22a.html}, abstract = {The ability to design and optimize biological sequences with specific functionalities would unlock enormous value in technology and healthcare. In recent years, machine learning-guided sequence design has progressed this goal significantly, though validating designed sequences in the lab or clinic takes many months and substantial labor. It is therefore valuable to assess the likelihood that a designed set contains sequences of the desired quality (which often lies outside the label distribution in our training data) before committing resources to an experiment. Forecasting, a prominent concept in many domains where feedback can be delayed (e.g. elections), has not been used or studied in the context of sequence design. Here we propose a method to guide decision-making that forecasts the performance of high-throughput libraries (e.g. containing $10{^}5$ unique variants) based on estimates provided by models, providing a posterior for the distribution of labels in the library. We show that our method outperforms baselines that naively use model scores to estimate library performance, which are the only tool available today for this purpose.} }
Endnote
%0 Conference Paper %T Forecasting labels under distribution-shift for machine-guided sequence design %A Lauren B Wheelock %A Stephen Malina %A Jeffrey Gerold %A Sam Sinai %B Proceedings of the 17th Machine Learning in Computational Biology meeting %C Proceedings of Machine Learning Research %D 2022 %E David A Knowles %E Sara Mostafavi %E Su-In Lee %F pmlr-v200-wheelock22a %I PMLR %P 166--180 %U https://proceedings.mlr.press/v200/wheelock22a.html %V 200 %X The ability to design and optimize biological sequences with specific functionalities would unlock enormous value in technology and healthcare. In recent years, machine learning-guided sequence design has progressed this goal significantly, though validating designed sequences in the lab or clinic takes many months and substantial labor. It is therefore valuable to assess the likelihood that a designed set contains sequences of the desired quality (which often lies outside the label distribution in our training data) before committing resources to an experiment. Forecasting, a prominent concept in many domains where feedback can be delayed (e.g. elections), has not been used or studied in the context of sequence design. Here we propose a method to guide decision-making that forecasts the performance of high-throughput libraries (e.g. containing $10{^}5$ unique variants) based on estimates provided by models, providing a posterior for the distribution of labels in the library. We show that our method outperforms baselines that naively use model scores to estimate library performance, which are the only tool available today for this purpose.
APA
Wheelock, L.B., Malina, S., Gerold, J. & Sinai, S.. (2022). Forecasting labels under distribution-shift for machine-guided sequence design. Proceedings of the 17th Machine Learning in Computational Biology meeting, in Proceedings of Machine Learning Research 200:166-180 Available from https://proceedings.mlr.press/v200/wheelock22a.html.

Related Material