Towards More Realistic Simulated Datasets for Benchmarking Deep Learning Models in Regulatory Genomics

Eva I. Prakash, Avanti Shrikumar, Anshul Kundaje
Proceedings of the 16th Machine Learning in Computational Biology meeting, PMLR 165:58-77, 2022.

Abstract

Deep neural networks and support vector machines have been shown to accurately predict genome-wide signals of regulatory activity from raw DNA sequences. These models are appealing in part because they can learn predictive DNA sequence features without prior assumptions. Several methods such as in-silico mutagenesis, GradCAM, DeepLIFT, Integrated Gradients and GkmExplain have been developed to reveal these learned features. However, the behavior of these methods on regulatory genomic data remains an area of active research. Although prior work has benchmarked these methods on simulated datasets with known ground-truth motifs, these simulations employed highly simplified regulatory logic that is not representative of the genome. In this work, we propose a novel pipeline for designing simulated data that comes closer to modeling the complexity of regulatory genomic DNA. We apply the pipeline to build simulated datasets based on publicly-available chromatin accessibility experiments and use these datasets to benchmark different interpretation methods based on their ability to identify ground-truth motifs. We find that a GradCAM-based method, which was reported to perform well on a more simplified dataset, does not do well on this dataset (particularly when using an architecture with shorter convolutional kernels in the first layer), and we theoretically show that this is expected based on the nature of regulatory genomic data. We also show that Integrated Gradients sometimes performs worse than gradient-times-input, likely owing to its linear interpolation path. We additionally explore the impact of user-defined settings on the interpretation methods, such as the choice of “reference”/“baseline”, and identify recommended settings for genomics. Our analysis suggests several promising directions for future research on these model interpretation methods. Code and links to data are available at https://github.com/kundajelab/interpret-benchmark.

Cite this Paper


BibTeX
@InProceedings{pmlr-v165-prakash22a, title = {Towards More Realistic Simulated Datasets for Benchmarking Deep Learning Models in Regulatory Genomics}, author = {Prakash, Eva I. and Shrikumar, Avanti and Kundaje, Anshul}, booktitle = {Proceedings of the 16th Machine Learning in Computational Biology meeting}, pages = {58--77}, year = {2022}, editor = {Knowles, David A. and Mostafavi, Sara and Lee, Su-In}, volume = {165}, series = {Proceedings of Machine Learning Research}, month = {22--23 Nov}, publisher = {PMLR}, pdf = {https://proceedings.mlr.press/v165/prakash22a/prakash22a.pdf}, url = {https://proceedings.mlr.press/v165/prakash22a.html}, abstract = {Deep neural networks and support vector machines have been shown to accurately predict genome-wide signals of regulatory activity from raw DNA sequences. These models are appealing in part because they can learn predictive DNA sequence features without prior assumptions. Several methods such as in-silico mutagenesis, GradCAM, DeepLIFT, Integrated Gradients and GkmExplain have been developed to reveal these learned features. However, the behavior of these methods on regulatory genomic data remains an area of active research. Although prior work has benchmarked these methods on simulated datasets with known ground-truth motifs, these simulations employed highly simplified regulatory logic that is not representative of the genome. In this work, we propose a novel pipeline for designing simulated data that comes closer to modeling the complexity of regulatory genomic DNA. We apply the pipeline to build simulated datasets based on publicly-available chromatin accessibility experiments and use these datasets to benchmark different interpretation methods based on their ability to identify ground-truth motifs. We find that a GradCAM-based method, which was reported to perform well on a more simplified dataset, does not do well on this dataset (particularly when using an architecture with shorter convolutional kernels in the first layer), and we theoretically show that this is expected based on the nature of regulatory genomic data. We also show that Integrated Gradients sometimes performs worse than gradient-times-input, likely owing to its linear interpolation path. We additionally explore the impact of user-defined settings on the interpretation methods, such as the choice of “reference”/“baseline”, and identify recommended settings for genomics. Our analysis suggests several promising directions for future research on these model interpretation methods. Code and links to data are available at https://github.com/kundajelab/interpret-benchmark.} }
Endnote
%0 Conference Paper %T Towards More Realistic Simulated Datasets for Benchmarking Deep Learning Models in Regulatory Genomics %A Eva I. Prakash %A Avanti Shrikumar %A Anshul Kundaje %B Proceedings of the 16th Machine Learning in Computational Biology meeting %C Proceedings of Machine Learning Research %D 2022 %E David A. Knowles %E Sara Mostafavi %E Su-In Lee %F pmlr-v165-prakash22a %I PMLR %P 58--77 %U https://proceedings.mlr.press/v165/prakash22a.html %V 165 %X Deep neural networks and support vector machines have been shown to accurately predict genome-wide signals of regulatory activity from raw DNA sequences. These models are appealing in part because they can learn predictive DNA sequence features without prior assumptions. Several methods such as in-silico mutagenesis, GradCAM, DeepLIFT, Integrated Gradients and GkmExplain have been developed to reveal these learned features. However, the behavior of these methods on regulatory genomic data remains an area of active research. Although prior work has benchmarked these methods on simulated datasets with known ground-truth motifs, these simulations employed highly simplified regulatory logic that is not representative of the genome. In this work, we propose a novel pipeline for designing simulated data that comes closer to modeling the complexity of regulatory genomic DNA. We apply the pipeline to build simulated datasets based on publicly-available chromatin accessibility experiments and use these datasets to benchmark different interpretation methods based on their ability to identify ground-truth motifs. We find that a GradCAM-based method, which was reported to perform well on a more simplified dataset, does not do well on this dataset (particularly when using an architecture with shorter convolutional kernels in the first layer), and we theoretically show that this is expected based on the nature of regulatory genomic data. We also show that Integrated Gradients sometimes performs worse than gradient-times-input, likely owing to its linear interpolation path. We additionally explore the impact of user-defined settings on the interpretation methods, such as the choice of “reference”/“baseline”, and identify recommended settings for genomics. Our analysis suggests several promising directions for future research on these model interpretation methods. Code and links to data are available at https://github.com/kundajelab/interpret-benchmark.
APA
Prakash, E.I., Shrikumar, A. & Kundaje, A.. (2022). Towards More Realistic Simulated Datasets for Benchmarking Deep Learning Models in Regulatory Genomics. Proceedings of the 16th Machine Learning in Computational Biology meeting, in Proceedings of Machine Learning Research 165:58-77 Available from https://proceedings.mlr.press/v165/prakash22a.html.

Related Material