Kernel-Based Evaluation of Conditional Biological Sequence Models

Pierre Glaser, Steffanie Paul, Alissa M Hummer, Charlotte Deane, Debora Susan Marks, Alan Nawzad Amin
Proceedings of the 41st International Conference on Machine Learning, PMLR 235:15678-15705, 2024.

Abstract

We propose a set of kernel-based tools to evaluate the designs and tune the hyperparameters of conditional sequence models, with a focus on problems in computational biology. The backbone of our tools is a new measure of discrepancy between the true conditional distribution and the model’s estimate, called the Augmented Conditional Maximum Mean Discrepancy (ACMMD). Provided that the model can be sampled from, the ACMMD can be estimated unbiasedly from data to quantify absolute model fit, integrated within hypothesis tests, and used to evaluate model reliability. We demonstrate the utility of our approach by analyzing a popular protein design model, ProteinMPNN. We are able to reject the hypothesis that ProteinMPNN fits its data for various protein families, and tune the model’s temperature hyperparameter to achieve a better fit.

Cite this Paper


BibTeX
@InProceedings{pmlr-v235-glaser24a, title = {Kernel-Based Evaluation of Conditional Biological Sequence Models}, author = {Glaser, Pierre and Paul, Steffanie and Hummer, Alissa M and Deane, Charlotte and Marks, Debora Susan and Amin, Alan Nawzad}, booktitle = {Proceedings of the 41st International Conference on Machine Learning}, pages = {15678--15705}, year = {2024}, editor = {Salakhutdinov, Ruslan and Kolter, Zico and Heller, Katherine and Weller, Adrian and Oliver, Nuria and Scarlett, Jonathan and Berkenkamp, Felix}, volume = {235}, series = {Proceedings of Machine Learning Research}, month = {21--27 Jul}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v235/main/assets/glaser24a/glaser24a.pdf}, url = {https://proceedings.mlr.press/v235/glaser24a.html}, abstract = {We propose a set of kernel-based tools to evaluate the designs and tune the hyperparameters of conditional sequence models, with a focus on problems in computational biology. The backbone of our tools is a new measure of discrepancy between the true conditional distribution and the model’s estimate, called the Augmented Conditional Maximum Mean Discrepancy (ACMMD). Provided that the model can be sampled from, the ACMMD can be estimated unbiasedly from data to quantify absolute model fit, integrated within hypothesis tests, and used to evaluate model reliability. We demonstrate the utility of our approach by analyzing a popular protein design model, ProteinMPNN. We are able to reject the hypothesis that ProteinMPNN fits its data for various protein families, and tune the model’s temperature hyperparameter to achieve a better fit.} }
Endnote
%0 Conference Paper %T Kernel-Based Evaluation of Conditional Biological Sequence Models %A Pierre Glaser %A Steffanie Paul %A Alissa M Hummer %A Charlotte Deane %A Debora Susan Marks %A Alan Nawzad Amin %B Proceedings of the 41st International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2024 %E Ruslan Salakhutdinov %E Zico Kolter %E Katherine Heller %E Adrian Weller %E Nuria Oliver %E Jonathan Scarlett %E Felix Berkenkamp %F pmlr-v235-glaser24a %I PMLR %P 15678--15705 %U https://proceedings.mlr.press/v235/glaser24a.html %V 235 %X We propose a set of kernel-based tools to evaluate the designs and tune the hyperparameters of conditional sequence models, with a focus on problems in computational biology. The backbone of our tools is a new measure of discrepancy between the true conditional distribution and the model’s estimate, called the Augmented Conditional Maximum Mean Discrepancy (ACMMD). Provided that the model can be sampled from, the ACMMD can be estimated unbiasedly from data to quantify absolute model fit, integrated within hypothesis tests, and used to evaluate model reliability. We demonstrate the utility of our approach by analyzing a popular protein design model, ProteinMPNN. We are able to reject the hypothesis that ProteinMPNN fits its data for various protein families, and tune the model’s temperature hyperparameter to achieve a better fit.
APA
Glaser, P., Paul, S., Hummer, A.M., Deane, C., Marks, D.S. & Amin, A.N.. (2024). Kernel-Based Evaluation of Conditional Biological Sequence Models. Proceedings of the 41st International Conference on Machine Learning, in Proceedings of Machine Learning Research 235:15678-15705 Available from https://proceedings.mlr.press/v235/glaser24a.html.

Related Material