[edit]
Beware of Data Leakage from Protein LLM Pretraining
Proceedings of the 19th Machine Learning in Computational Biology meeting, PMLR 261:106-116, 2024.
Abstract
Pretrained protein language models are becoming increasingly popular as a backbone for protein property inference tasks such as structure prediction or function annotation, accelerating biological research. However, related research oftentimes does not consider the effects of data leakage from pretraining on the actual downstream task, resulting in potentially unrealistic performance estimates. Reported generalization might not necessarily be reproducible for proteins highly dissimilar from the pretraining set. In this work, we measure the effects of data leakage from protein language model pretraining in the domain of protein thermostability prediction. Specifically, we compare two different dataset split strategies: a pretraining-aware split, designed to avoid similarity between pretraining data and the held-out test sets, and a commonly-used naive split, relying on clustering the training data for a downstream task without taking the pretraining data into account. Our experiments suggest that data leakage from language model pretraining shows consistent effects on melting point prediction across all experiments, distorting the measured performance by an average 11.1% compared to the pretraining-aware split. The source code and our dataset splits are available at https://gitlab.com/dacs-hpi/pretraining-aware-hotprot.