Learning from biased positive-unlabeled data via threshold calibration

Paweł Teisseyre, Timo Martens, Jessa Bekker, Jesse Davis
Proceedings of The 28th International Conference on Artificial Intelligence and Statistics, PMLR 258:2314-2322, 2025.

Abstract

Learning from positive and unlabeled data (PU learning) aims to train a binary classification model when only positive and unlabeled examples are available. Typically, learners assume that there is a labeling mechanism that determines which positive labels are observed. A particularly challenging setting arises when the observed positive labels are a biased sample from the positive distribution. Current approaches either require estimating the propensity scores, which are the instance-specific probabilities that a positive example’s label will be observed, or make overly restricting assumptions about the labeling mechanism. We make a novel assumption about the labeling mechanism which we show is more general than several commonly used existing ones. Moreover, the combination of our novel assumption and theoretical results from robust statistics can simplify the process of learning from biased PU data. Empirically, our approach offers superior predictive and run time performance compared to the state-of-the-art methods.

Cite this Paper


BibTeX
@InProceedings{pmlr-v258-teisseyre25a, title = {Learning from biased positive-unlabeled data via threshold calibration}, author = {Teisseyre, Pawe{\l} and Martens, Timo and Bekker, Jessa and Davis, Jesse}, booktitle = {Proceedings of The 28th International Conference on Artificial Intelligence and Statistics}, pages = {2314--2322}, year = {2025}, editor = {Li, Yingzhen and Mandt, Stephan and Agrawal, Shipra and Khan, Emtiyaz}, volume = {258}, series = {Proceedings of Machine Learning Research}, month = {03--05 May}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v258/main/assets/teisseyre25a/teisseyre25a.pdf}, url = {https://proceedings.mlr.press/v258/teisseyre25a.html}, abstract = {Learning from positive and unlabeled data (PU learning) aims to train a binary classification model when only positive and unlabeled examples are available. Typically, learners assume that there is a labeling mechanism that determines which positive labels are observed. A particularly challenging setting arises when the observed positive labels are a biased sample from the positive distribution. Current approaches either require estimating the propensity scores, which are the instance-specific probabilities that a positive example’s label will be observed, or make overly restricting assumptions about the labeling mechanism. We make a novel assumption about the labeling mechanism which we show is more general than several commonly used existing ones. Moreover, the combination of our novel assumption and theoretical results from robust statistics can simplify the process of learning from biased PU data. Empirically, our approach offers superior predictive and run time performance compared to the state-of-the-art methods.} }
Endnote
%0 Conference Paper %T Learning from biased positive-unlabeled data via threshold calibration %A Paweł Teisseyre %A Timo Martens %A Jessa Bekker %A Jesse Davis %B Proceedings of The 28th International Conference on Artificial Intelligence and Statistics %C Proceedings of Machine Learning Research %D 2025 %E Yingzhen Li %E Stephan Mandt %E Shipra Agrawal %E Emtiyaz Khan %F pmlr-v258-teisseyre25a %I PMLR %P 2314--2322 %U https://proceedings.mlr.press/v258/teisseyre25a.html %V 258 %X Learning from positive and unlabeled data (PU learning) aims to train a binary classification model when only positive and unlabeled examples are available. Typically, learners assume that there is a labeling mechanism that determines which positive labels are observed. A particularly challenging setting arises when the observed positive labels are a biased sample from the positive distribution. Current approaches either require estimating the propensity scores, which are the instance-specific probabilities that a positive example’s label will be observed, or make overly restricting assumptions about the labeling mechanism. We make a novel assumption about the labeling mechanism which we show is more general than several commonly used existing ones. Moreover, the combination of our novel assumption and theoretical results from robust statistics can simplify the process of learning from biased PU data. Empirically, our approach offers superior predictive and run time performance compared to the state-of-the-art methods.
APA
Teisseyre, P., Martens, T., Bekker, J. & Davis, J.. (2025). Learning from biased positive-unlabeled data via threshold calibration. Proceedings of The 28th International Conference on Artificial Intelligence and Statistics, in Proceedings of Machine Learning Research 258:2314-2322 Available from https://proceedings.mlr.press/v258/teisseyre25a.html.

Related Material