Learning from positive and unlabeled examples by enforcing statistical significance

Pierre Geurts

Learning from positive and unlabeled examples by enforcing statistical significance

Pierre Geurts

Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, PMLR 15:305-314, 2011.

Abstract

Given a finite but large set of objects described by a vector of features, only a small subset of which have been labeled as ‘positive’ with respect to a class of interest, we consider the problem of characterizing the positive class. We formalize this as the problem of learning a feature based score function that minimizes the p-value of a non parametric statistical hypothesis test. For linear score functions over the original feature space or over one of its kernelized versions, we provide a solution of this problem computed by a one-class SVM applied on a surrogate dataset obtained by sampling subsets of the overall set of objects and representing them by their average feature-vector shifted by the average feature-vector of the original sample of positive examples. We carry out experiments with this method on the prediction of targets of transcription factors in two different organisms, E. Coli and S. Cererevisiae. Our method extends enrichment analysis commonly carried out in Bioinformatics and its results outperform common solutions to this problem.

Cite this Paper

BibTeX

@InProceedings{pmlr-v15-geurts11a,
  title = 	 {Learning from positive and unlabeled examples by enforcing statistical significance},
  author = 	 {Geurts, Pierre},
  booktitle = 	 {Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics},
  pages = 	 {305--314},
  year = 	 {2011},
  editor = 	 {Gordon, Geoffrey and Dunson, David and Dudík, Miroslav},
  volume = 	 {15},
  series = 	 {Proceedings of Machine Learning Research},
  address = 	 {Fort Lauderdale, FL, USA},
  month = 	 {11--13 Apr},
  publisher =    {PMLR},
  pdf = 	 {http://proceedings.mlr.press/v15/geurts11a/geurts11a.pdf},
  url = 	 {https://proceedings.mlr.press/v15/geurts11a.html},
  abstract = 	 {Given a finite but large set of objects described by a vector of features, only a small subset of which have been labeled as ‘positive’ with respect to a class of interest, we consider the problem of characterizing the positive class. We formalize this as the problem of learning a feature based score function that minimizes the p-value of a non parametric statistical hypothesis test. For linear score functions over the original feature space or over one of its kernelized versions, we provide a solution of this problem computed by a one-class SVM applied on a surrogate dataset obtained by sampling subsets of the overall set of objects and representing them by their average feature-vector shifted by the average feature-vector of the original sample of positive examples. We carry out experiments with this method on the prediction of targets of transcription factors in two different organisms, E. Coli and S. Cererevisiae. Our method extends enrichment analysis commonly carried out in Bioinformatics and its results outperform common solutions to this problem.}
}

Endnote

%0 Conference Paper
%T Learning from positive and unlabeled examples by enforcing statistical significance
%A Pierre Geurts
%B Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics
%C Proceedings of Machine Learning Research
%D 2011
%E Geoffrey Gordon
%E David Dunson
%E Miroslav Dudík	
%F pmlr-v15-geurts11a
%I PMLR
%P 305--314
%U https://proceedings.mlr.press/v15/geurts11a.html
%V 15
%X Given a finite but large set of objects described by a vector of features, only a small subset of which have been labeled as ‘positive’ with respect to a class of interest, we consider the problem of characterizing the positive class. We formalize this as the problem of learning a feature based score function that minimizes the p-value of a non parametric statistical hypothesis test. For linear score functions over the original feature space or over one of its kernelized versions, we provide a solution of this problem computed by a one-class SVM applied on a surrogate dataset obtained by sampling subsets of the overall set of objects and representing them by their average feature-vector shifted by the average feature-vector of the original sample of positive examples. We carry out experiments with this method on the prediction of targets of transcription factors in two different organisms, E. Coli and S. Cererevisiae. Our method extends enrichment analysis commonly carried out in Bioinformatics and its results outperform common solutions to this problem.

RIS

TY  - CPAPER
TI  - Learning from positive and unlabeled examples by enforcing statistical significance
AU  - Pierre Geurts
BT  - Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics
DA  - 2011/06/14
ED  - Geoffrey Gordon
ED  - David Dunson
ED  - Miroslav Dudík	
ID  - pmlr-v15-geurts11a
PB  - PMLR
DP  - Proceedings of Machine Learning Research
VL  - 15
SP  - 305
EP  - 314
L1  - http://proceedings.mlr.press/v15/geurts11a/geurts11a.pdf
UR  - https://proceedings.mlr.press/v15/geurts11a.html
AB  - Given a finite but large set of objects described by a vector of features, only a small subset of which have been labeled as ‘positive’ with respect to a class of interest, we consider the problem of characterizing the positive class. We formalize this as the problem of learning a feature based score function that minimizes the p-value of a non parametric statistical hypothesis test. For linear score functions over the original feature space or over one of its kernelized versions, we provide a solution of this problem computed by a one-class SVM applied on a surrogate dataset obtained by sampling subsets of the overall set of objects and representing them by their average feature-vector shifted by the average feature-vector of the original sample of positive examples. We carry out experiments with this method on the prediction of targets of transcription factors in two different organisms, E. Coli and S. Cererevisiae. Our method extends enrichment analysis commonly carried out in Bioinformatics and its results outperform common solutions to this problem.
ER  -

APA

Geurts, P.. (2011). Learning from positive and unlabeled examples by enforcing statistical significance. Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, in Proceedings of Machine Learning Research 15:305-314 Available from https://proceedings.mlr.press/v15/geurts11a.html.

Learning from positive and unlabeled examples by enforcing statistical significance

Abstract

Cite this Paper

Related Material