GUANinE v1.0: Benchmark Datasets for Genomic AI Sequence-to-Function Models

eyes s. robson, Nilah Ioannidis
Proceedings of the 18th Machine Learning in Computational Biology meeting, PMLR 240:250-266, 2024.

Abstract

Computational genomics increasingly relies on machine learning methods for genome interpretation, and the recent adoption of neural sequence-to-function models highlights the need for rigorous model specification and controlled evaluation, problems familiar to other fields of AI. Research strategies that have greatly benefited other fields — including benchmarking, auditing, and algorithmic fairness — are also needed to advance the field of genomic AI and to facilitate model development. Here we propose a genomic AI benchmark, GUANinE, for evaluating model generalization across a number of distinct genomic tasks. Compared to existing task formulations in computational genomics, GUANinE is large-scale, de-noised, and suitable for evaluating pretrained models. GUANinE v1.0 primarily focuses on functional genomics tasks such as functional element annotation and gene expression prediction, and it also draws upon connections to evolutionary biology through sequence conservation tasks. The current GUANinE tasks provide insight into the performance of existing genomic AI models and non-neural baselines, with opportunities to be refined, revisited, and broadened as the field matures. Finally, the GUANinE benchmark allows us to evaluate new self-supervised T5 models and explore the tradeoffs between tokenization and model performance, while showcasing the potential for self-supervision to complement existing pretraining procedures.

Cite this Paper


BibTeX
@InProceedings{pmlr-v240-robson24a, title = {GUANinE v1.0: Benchmark Datasets for Genomic AI Sequence-to-Function Models}, author = {robson, eyes s. and Ioannidis, Nilah}, booktitle = {Proceedings of the 18th Machine Learning in Computational Biology meeting}, pages = {250--266}, year = {2024}, editor = {Knowles, David A. and Mostafavi, Sara}, volume = {240}, series = {Proceedings of Machine Learning Research}, month = {30 Nov--01 Dec}, publisher = {PMLR}, pdf = {https://proceedings.mlr.press/v240/robson24a/robson24a.pdf}, url = {https://proceedings.mlr.press/v240/robson24a.html}, abstract = {Computational genomics increasingly relies on machine learning methods for genome interpretation, and the recent adoption of neural sequence-to-function models highlights the need for rigorous model specification and controlled evaluation, problems familiar to other fields of AI. Research strategies that have greatly benefited other fields — including benchmarking, auditing, and algorithmic fairness — are also needed to advance the field of genomic AI and to facilitate model development. Here we propose a genomic AI benchmark, GUANinE, for evaluating model generalization across a number of distinct genomic tasks. Compared to existing task formulations in computational genomics, GUANinE is large-scale, de-noised, and suitable for evaluating pretrained models. GUANinE v1.0 primarily focuses on functional genomics tasks such as functional element annotation and gene expression prediction, and it also draws upon connections to evolutionary biology through sequence conservation tasks. The current GUANinE tasks provide insight into the performance of existing genomic AI models and non-neural baselines, with opportunities to be refined, revisited, and broadened as the field matures. Finally, the GUANinE benchmark allows us to evaluate new self-supervised T5 models and explore the tradeoffs between tokenization and model performance, while showcasing the potential for self-supervision to complement existing pretraining procedures.} }
Endnote
%0 Conference Paper %T GUANinE v1.0: Benchmark Datasets for Genomic AI Sequence-to-Function Models %A eyes s. robson %A Nilah Ioannidis %B Proceedings of the 18th Machine Learning in Computational Biology meeting %C Proceedings of Machine Learning Research %D 2024 %E David A. Knowles %E Sara Mostafavi %F pmlr-v240-robson24a %I PMLR %P 250--266 %U https://proceedings.mlr.press/v240/robson24a.html %V 240 %X Computational genomics increasingly relies on machine learning methods for genome interpretation, and the recent adoption of neural sequence-to-function models highlights the need for rigorous model specification and controlled evaluation, problems familiar to other fields of AI. Research strategies that have greatly benefited other fields — including benchmarking, auditing, and algorithmic fairness — are also needed to advance the field of genomic AI and to facilitate model development. Here we propose a genomic AI benchmark, GUANinE, for evaluating model generalization across a number of distinct genomic tasks. Compared to existing task formulations in computational genomics, GUANinE is large-scale, de-noised, and suitable for evaluating pretrained models. GUANinE v1.0 primarily focuses on functional genomics tasks such as functional element annotation and gene expression prediction, and it also draws upon connections to evolutionary biology through sequence conservation tasks. The current GUANinE tasks provide insight into the performance of existing genomic AI models and non-neural baselines, with opportunities to be refined, revisited, and broadened as the field matures. Finally, the GUANinE benchmark allows us to evaluate new self-supervised T5 models and explore the tradeoffs between tokenization and model performance, while showcasing the potential for self-supervision to complement existing pretraining procedures.
APA
robson, e.s. & Ioannidis, N.. (2024). GUANinE v1.0: Benchmark Datasets for Genomic AI Sequence-to-Function Models. Proceedings of the 18th Machine Learning in Computational Biology meeting, in Proceedings of Machine Learning Research 240:250-266 Available from https://proceedings.mlr.press/v240/robson24a.html.

Related Material