Language-Informed Basecalling Architecture for Nanopore Direct RNA Sequencing

Alexandra Sneddon, Pablo Acera Mateos, Nikolay Shirokikh, Eduardo Eyras
Proceedings of the 17th Machine Learning in Computational Biology meeting, PMLR 200:150-165, 2022.

Abstract

Algorithms developed for basecalling Nanopore signals have primarily focused on DNA to date and utilise the raw signal as the only input. However, it is known that messenger RNA (mRNA), which dominates Nanopore direct RNA (dRNA) sequencing libraries, contains specific nucleotide patterns that are implicitly encoded in the Nanopore signals since RNA is always sequenced from the 3’ to 5’ direction. In this study we present an approach to exploit the sequence biases in mRNA as an additional input to dRNA basecalling. We developed a probabilistic model of mRNA language and propose a modified CTC beam search decoding algorithm to conditionally incorporate the language model during basecalling. Our findings demonstrate that inclusion of mRNA language is able to guide CTC beam search decoding towards the more probable nucleotide sequence. We also propose a time efficient approach to decoding variable length nanopore signals. This work provides the first demonstration of the potential for biological language to inform Nanopore basecalling. Code is available at: https://github.com/comprna/radian.

Cite this Paper


BibTeX
@InProceedings{pmlr-v200-sneddon22a, title = {Language-Informed Basecalling Architecture for Nanopore Direct RNA Sequencing}, author = {Sneddon, Alexandra and Acera Mateos, Pablo and Shirokikh, Nikolay and Eyras, Eduardo}, booktitle = {Proceedings of the 17th Machine Learning in Computational Biology meeting}, pages = {150--165}, year = {2022}, editor = {Knowles, David A and Mostafavi, Sara and Lee, Su-In}, volume = {200}, series = {Proceedings of Machine Learning Research}, month = {21--22 Nov}, publisher = {PMLR}, pdf = {https://proceedings.mlr.press/v200/sneddon22a/sneddon22a.pdf}, url = {https://proceedings.mlr.press/v200/sneddon22a.html}, abstract = {Algorithms developed for basecalling Nanopore signals have primarily focused on DNA to date and utilise the raw signal as the only input. However, it is known that messenger RNA (mRNA), which dominates Nanopore direct RNA (dRNA) sequencing libraries, contains specific nucleotide patterns that are implicitly encoded in the Nanopore signals since RNA is always sequenced from the 3’ to 5’ direction. In this study we present an approach to exploit the sequence biases in mRNA as an additional input to dRNA basecalling. We developed a probabilistic model of mRNA language and propose a modified CTC beam search decoding algorithm to conditionally incorporate the language model during basecalling. Our findings demonstrate that inclusion of mRNA language is able to guide CTC beam search decoding towards the more probable nucleotide sequence. We also propose a time efficient approach to decoding variable length nanopore signals. This work provides the first demonstration of the potential for biological language to inform Nanopore basecalling. Code is available at: https://github.com/comprna/radian.} }
Endnote
%0 Conference Paper %T Language-Informed Basecalling Architecture for Nanopore Direct RNA Sequencing %A Alexandra Sneddon %A Pablo Acera Mateos %A Nikolay Shirokikh %A Eduardo Eyras %B Proceedings of the 17th Machine Learning in Computational Biology meeting %C Proceedings of Machine Learning Research %D 2022 %E David A Knowles %E Sara Mostafavi %E Su-In Lee %F pmlr-v200-sneddon22a %I PMLR %P 150--165 %U https://proceedings.mlr.press/v200/sneddon22a.html %V 200 %X Algorithms developed for basecalling Nanopore signals have primarily focused on DNA to date and utilise the raw signal as the only input. However, it is known that messenger RNA (mRNA), which dominates Nanopore direct RNA (dRNA) sequencing libraries, contains specific nucleotide patterns that are implicitly encoded in the Nanopore signals since RNA is always sequenced from the 3’ to 5’ direction. In this study we present an approach to exploit the sequence biases in mRNA as an additional input to dRNA basecalling. We developed a probabilistic model of mRNA language and propose a modified CTC beam search decoding algorithm to conditionally incorporate the language model during basecalling. Our findings demonstrate that inclusion of mRNA language is able to guide CTC beam search decoding towards the more probable nucleotide sequence. We also propose a time efficient approach to decoding variable length nanopore signals. This work provides the first demonstration of the potential for biological language to inform Nanopore basecalling. Code is available at: https://github.com/comprna/radian.
APA
Sneddon, A., Acera Mateos, P., Shirokikh, N. & Eyras, E.. (2022). Language-Informed Basecalling Architecture for Nanopore Direct RNA Sequencing. Proceedings of the 17th Machine Learning in Computational Biology meeting, in Proceedings of Machine Learning Research 200:150-165 Available from https://proceedings.mlr.press/v200/sneddon22a.html.

Related Material