Latent Imputation before Prediction: A New Computational Paradigm for De Novo Peptide Sequencing

Ye Du, Chen Yang, Nanxi Yu, Wanyu Lin, Qian Zhao, Shujun Wang
Proceedings of the 42nd International Conference on Machine Learning, PMLR 267:14669-14681, 2025.

Abstract

De novo peptide sequencing is a fundamental computational technique for ascertaining amino acid sequences of peptides directly from tandem mass spectrometry data, eliminating the need for reference databases. Cutting-edge models encode the observed mass spectra into latent representations from which peptides are predicted auto-regressively. However, the issue of missing fragmentation, attributable to factors such as suboptimal fragmentation efficiency and instrumental constraints, presents a formidable challenge in practical applications. To tackle this obstacle, we propose a novel computational paradigm called $\underline{\textbf{L}}$atent $\underline{\textbf{I}}$mputation before $\underline{\textbf{P}}$rediction (LIPNovo). LIPNovo is devised to compensate for missing fragmentation information within observed spectra before executing the final peptide prediction. Rather than generating raw missing data, LIPNovo performs imputation in the latent space, guided by the theoretical peak profile of the target peptide sequence. The imputation process is conceptualized as a set-prediction problem, utilizing a set of learnable peak queries to reason about the relationships among observed peaks and directly generate the latent representations of theoretical peaks through optimal bipartite matching. In this way, LIPNovo manages to supplement missing information during inference and thus boosts performance. Despite its simplicity, experiments on three benchmark datasets demonstrate that LIPNovo outperforms state-of-the-art methods by large margins. Code is available at https://github.com/usr922/LIPNovo.

Cite this Paper


BibTeX
@InProceedings{pmlr-v267-du25g, title = {Latent Imputation before Prediction: A New Computational Paradigm for De Novo Peptide Sequencing}, author = {Du, Ye and Yang, Chen and Yu, Nanxi and Lin, Wanyu and Zhao, Qian and Wang, Shujun}, booktitle = {Proceedings of the 42nd International Conference on Machine Learning}, pages = {14669--14681}, year = {2025}, editor = {Singh, Aarti and Fazel, Maryam and Hsu, Daniel and Lacoste-Julien, Simon and Berkenkamp, Felix and Maharaj, Tegan and Wagstaff, Kiri and Zhu, Jerry}, volume = {267}, series = {Proceedings of Machine Learning Research}, month = {13--19 Jul}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v267/main/assets/du25g/du25g.pdf}, url = {https://proceedings.mlr.press/v267/du25g.html}, abstract = {De novo peptide sequencing is a fundamental computational technique for ascertaining amino acid sequences of peptides directly from tandem mass spectrometry data, eliminating the need for reference databases. Cutting-edge models encode the observed mass spectra into latent representations from which peptides are predicted auto-regressively. However, the issue of missing fragmentation, attributable to factors such as suboptimal fragmentation efficiency and instrumental constraints, presents a formidable challenge in practical applications. To tackle this obstacle, we propose a novel computational paradigm called $\underline{\textbf{L}}$atent $\underline{\textbf{I}}$mputation before $\underline{\textbf{P}}$rediction (LIPNovo). LIPNovo is devised to compensate for missing fragmentation information within observed spectra before executing the final peptide prediction. Rather than generating raw missing data, LIPNovo performs imputation in the latent space, guided by the theoretical peak profile of the target peptide sequence. The imputation process is conceptualized as a set-prediction problem, utilizing a set of learnable peak queries to reason about the relationships among observed peaks and directly generate the latent representations of theoretical peaks through optimal bipartite matching. In this way, LIPNovo manages to supplement missing information during inference and thus boosts performance. Despite its simplicity, experiments on three benchmark datasets demonstrate that LIPNovo outperforms state-of-the-art methods by large margins. Code is available at https://github.com/usr922/LIPNovo.} }
Endnote
%0 Conference Paper %T Latent Imputation before Prediction: A New Computational Paradigm for De Novo Peptide Sequencing %A Ye Du %A Chen Yang %A Nanxi Yu %A Wanyu Lin %A Qian Zhao %A Shujun Wang %B Proceedings of the 42nd International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2025 %E Aarti Singh %E Maryam Fazel %E Daniel Hsu %E Simon Lacoste-Julien %E Felix Berkenkamp %E Tegan Maharaj %E Kiri Wagstaff %E Jerry Zhu %F pmlr-v267-du25g %I PMLR %P 14669--14681 %U https://proceedings.mlr.press/v267/du25g.html %V 267 %X De novo peptide sequencing is a fundamental computational technique for ascertaining amino acid sequences of peptides directly from tandem mass spectrometry data, eliminating the need for reference databases. Cutting-edge models encode the observed mass spectra into latent representations from which peptides are predicted auto-regressively. However, the issue of missing fragmentation, attributable to factors such as suboptimal fragmentation efficiency and instrumental constraints, presents a formidable challenge in practical applications. To tackle this obstacle, we propose a novel computational paradigm called $\underline{\textbf{L}}$atent $\underline{\textbf{I}}$mputation before $\underline{\textbf{P}}$rediction (LIPNovo). LIPNovo is devised to compensate for missing fragmentation information within observed spectra before executing the final peptide prediction. Rather than generating raw missing data, LIPNovo performs imputation in the latent space, guided by the theoretical peak profile of the target peptide sequence. The imputation process is conceptualized as a set-prediction problem, utilizing a set of learnable peak queries to reason about the relationships among observed peaks and directly generate the latent representations of theoretical peaks through optimal bipartite matching. In this way, LIPNovo manages to supplement missing information during inference and thus boosts performance. Despite its simplicity, experiments on three benchmark datasets demonstrate that LIPNovo outperforms state-of-the-art methods by large margins. Code is available at https://github.com/usr922/LIPNovo.
APA
Du, Y., Yang, C., Yu, N., Lin, W., Zhao, Q. & Wang, S.. (2025). Latent Imputation before Prediction: A New Computational Paradigm for De Novo Peptide Sequencing. Proceedings of the 42nd International Conference on Machine Learning, in Proceedings of Machine Learning Research 267:14669-14681 Available from https://proceedings.mlr.press/v267/du25g.html.

Related Material