Towards a Better Understanding of Reverse-Complement Equivariance for Deep Learning Models in Genomics

Hannah Zhou, Avanti Shrikumar, Anshul Kundaje
Proceedings of the 16th Machine Learning in Computational Biology meeting, PMLR 165:1-33, 2022.

Abstract

Predictive models mapping double-stranded DNA to signals of regulatory activity should, in principle, produce analogous (or “equivariant”) predictions whether the forward strand or its reverse complement (RC) is supplied as input. Unfortunately, standard neural networks can produce highly divergent predictions across strands, even when the training set is augmented with RC sequences. Two strategies have emerged to enforce equivariance: conjoined/“siamese” architectures, and RC parameter sharing or RCPS. However, the connections between the two remain unclear, comparisons to strong baselines are lacking, and neither has been adapted to base-resolution signal profile prediction. In this work, we extend conjoined & RCPS models to base-resolution signal prediction, and introduce a strong baseline: a standard model (trained with RC data augmentation) that is made conjoined only after training, which we call “post-hoc” conjoined. Through benchmarks on diverse tasks, we find post-hoc conjoined consistently performs best or second-best, surpassed only occasionally by RCPS, and never underperforms conjoined-during-training. We propose an overfitting-based hypothesis for the latter finding, and study it empirically. Despite its theoretical appeal, RCPS shows mediocre performance on several tasks, even though (as we prove) it can represent any solution learned by conjoined models. Our results suggest users interested in RC equivariance should default to post-hoc conjoined as a reliable baseline before exploring RCPS. Finally, we present a unified description of conjoined & RCPS architectures, revealing a broader class of models that gradually interpolate between RCPS and conjoined while maintaining equivariance. The code to replicate the experiments is available at https://github.com/hannahgz/BenchmarkRCStrategies. A 22-minute video explaining the paper is available at https://youtu.be/UY1Rmj036Wg

Cite this Paper


BibTeX
@InProceedings{pmlr-v165-zhou22a, title = {Towards a Better Understanding of Reverse-Complement Equivariance for Deep Learning Models in Genomics}, author = {Zhou, Hannah and Shrikumar, Avanti and Kundaje, Anshul}, booktitle = {Proceedings of the 16th Machine Learning in Computational Biology meeting}, pages = {1--33}, year = {2022}, editor = {Knowles, David A. and Mostafavi, Sara and Lee, Su-In}, volume = {165}, series = {Proceedings of Machine Learning Research}, month = {22--23 Nov}, publisher = {PMLR}, pdf = {https://proceedings.mlr.press/v165/zhou22a/zhou22a.pdf}, url = {https://proceedings.mlr.press/v165/zhou22a.html}, abstract = {Predictive models mapping double-stranded DNA to signals of regulatory activity should, in principle, produce analogous (or “equivariant”) predictions whether the forward strand or its reverse complement (RC) is supplied as input. Unfortunately, standard neural networks can produce highly divergent predictions across strands, even when the training set is augmented with RC sequences. Two strategies have emerged to enforce equivariance: conjoined/“siamese” architectures, and RC parameter sharing or RCPS. However, the connections between the two remain unclear, comparisons to strong baselines are lacking, and neither has been adapted to base-resolution signal profile prediction. In this work, we extend conjoined & RCPS models to base-resolution signal prediction, and introduce a strong baseline: a standard model (trained with RC data augmentation) that is made conjoined only after training, which we call “post-hoc” conjoined. Through benchmarks on diverse tasks, we find post-hoc conjoined consistently performs best or second-best, surpassed only occasionally by RCPS, and never underperforms conjoined-during-training. We propose an overfitting-based hypothesis for the latter finding, and study it empirically. Despite its theoretical appeal, RCPS shows mediocre performance on several tasks, even though (as we prove) it can represent any solution learned by conjoined models. Our results suggest users interested in RC equivariance should default to post-hoc conjoined as a reliable baseline before exploring RCPS. Finally, we present a unified description of conjoined & RCPS architectures, revealing a broader class of models that gradually interpolate between RCPS and conjoined while maintaining equivariance. The code to replicate the experiments is available at https://github.com/hannahgz/BenchmarkRCStrategies. A 22-minute video explaining the paper is available at https://youtu.be/UY1Rmj036Wg} }
Endnote
%0 Conference Paper %T Towards a Better Understanding of Reverse-Complement Equivariance for Deep Learning Models in Genomics %A Hannah Zhou %A Avanti Shrikumar %A Anshul Kundaje %B Proceedings of the 16th Machine Learning in Computational Biology meeting %C Proceedings of Machine Learning Research %D 2022 %E David A. Knowles %E Sara Mostafavi %E Su-In Lee %F pmlr-v165-zhou22a %I PMLR %P 1--33 %U https://proceedings.mlr.press/v165/zhou22a.html %V 165 %X Predictive models mapping double-stranded DNA to signals of regulatory activity should, in principle, produce analogous (or “equivariant”) predictions whether the forward strand or its reverse complement (RC) is supplied as input. Unfortunately, standard neural networks can produce highly divergent predictions across strands, even when the training set is augmented with RC sequences. Two strategies have emerged to enforce equivariance: conjoined/“siamese” architectures, and RC parameter sharing or RCPS. However, the connections between the two remain unclear, comparisons to strong baselines are lacking, and neither has been adapted to base-resolution signal profile prediction. In this work, we extend conjoined & RCPS models to base-resolution signal prediction, and introduce a strong baseline: a standard model (trained with RC data augmentation) that is made conjoined only after training, which we call “post-hoc” conjoined. Through benchmarks on diverse tasks, we find post-hoc conjoined consistently performs best or second-best, surpassed only occasionally by RCPS, and never underperforms conjoined-during-training. We propose an overfitting-based hypothesis for the latter finding, and study it empirically. Despite its theoretical appeal, RCPS shows mediocre performance on several tasks, even though (as we prove) it can represent any solution learned by conjoined models. Our results suggest users interested in RC equivariance should default to post-hoc conjoined as a reliable baseline before exploring RCPS. Finally, we present a unified description of conjoined & RCPS architectures, revealing a broader class of models that gradually interpolate between RCPS and conjoined while maintaining equivariance. The code to replicate the experiments is available at https://github.com/hannahgz/BenchmarkRCStrategies. A 22-minute video explaining the paper is available at https://youtu.be/UY1Rmj036Wg
APA
Zhou, H., Shrikumar, A. & Kundaje, A.. (2022). Towards a Better Understanding of Reverse-Complement Equivariance for Deep Learning Models in Genomics. Proceedings of the 16th Machine Learning in Computational Biology meeting, in Proceedings of Machine Learning Research 165:1-33 Available from https://proceedings.mlr.press/v165/zhou22a.html.

Related Material