Using GNUsmail to Compare Data Stream Mining Methods for On-line Email Classification

Jose M. Carmona-Cejudo; Manuel Baena-Garcia; Jose Campo-Avila; Rafael Morales-Bueno; Joao Gama; Albert Bifet

Using GNUsmail to Compare Data Stream Mining Methods for On-line Email Classification

Jose M. Carmona-Cejudo, Manuel Baena-Garcia, Jose Campo-Avila, Rafael Morales-Bueno, Joao Gama, Albert Bifet

Proceedings of the Second Workshop on Applications of Pattern Analysis, PMLR 17:12-18, 2011.

Abstract

Real-time classification of emails is a challenging task because of its online nature, and also because email streams are subject to concept drift. Identifying email spam, where only two different labels or classes are defined (spam or not spam), has received great attention in the literature. We are nevertheless interested in a more specific classification where multiple folders exist, which is an additional source of complexity: the class can have a very large number of different values. Moreover, neither cross-validation nor other sampling procedures are suitable for evaluation in data stream contexts, which is why other metrics, like the prequential error, have been proposed. However, the prequential error poses some problems, which can be alleviated by using recently proposed mechanisms such as fading factors. In this paper, we present GNUsmail, an open-source extensible framework for email classification, and we focus on its ability to perform online evaluation. GNUsmails architecture supports incremental and online learning, and it can be used to compare different data stream mining methods, using state-of-art online evaluation metrics. Besides describing the framework, characterized by two overlapping phases, we show how it can be used to compare different algorithms in order to find the most appropriate one. The GNUsmail source code includes a tool for launching replicable experiments.

Cite this Paper

BibTeX


@InProceedings{pmlr-v17-carmona11a,
  title = 	 {Using GNUsmail to Compare Data Stream Mining Methods for On-line Email Classification},
  author = 	 {Carmona-Cejudo, Jose M. and Baena-Garcia, Manuel and Campo-Avila, Jose and Morales-Bueno, Rafael and Gama, Joao and Bifet, Albert},
  booktitle = 	 {Proceedings of the Second Workshop on Applications of Pattern Analysis},
  pages = 	 {12--18},
  year = 	 {2011},
  editor = 	 {Diethe, Tom and Balcazar, Jose and Shawe-Taylor, John and Tirnauca, Cristina},
  volume = 	 {17},
  series = 	 {Proceedings of Machine Learning Research},
  address = 	 {CIEM, Castro Urdiales, Spain},
  month = 	 {19--21 Oct},
  publisher =    {PMLR},
  pdf = 	 {http://proceedings.mlr.press/v17/carmona11a/carmona11a.pdf},
  url = 	 {https://proceedings.mlr.press/v17/carmona11a.html},
  abstract = 	 {Real-time classification of emails is a challenging task because of its online nature, and also because email streams are subject to concept drift. Identifying email spam, where only two different labels or classes are defined (spam or not spam), has received great attention in the literature. We are nevertheless interested in a more specific classification where multiple folders exist, which is an additional source of complexity: the class can have a very large number of different values. Moreover, neither cross-validation nor other sampling procedures are suitable for evaluation in data stream contexts, which is why other metrics, like the prequential error, have been proposed. However, the prequential error poses some problems, which can be alleviated by using recently proposed mechanisms such as fading factors. In this paper, we present GNUsmail, an open-source extensible framework for email classification, and we focus on its ability to perform online evaluation. GNUsmails architecture supports incremental and online learning, and it can be used to compare different data stream mining methods, using state-of-art online evaluation metrics. Besides describing the framework, characterized by two overlapping phases, we show how it can be used to compare different algorithms in order to find the most appropriate one. The GNUsmail source code includes a tool for launching replicable experiments.}
}

Endnote

%0 Conference Paper
%T Using GNUsmail to Compare Data Stream Mining Methods for On-line Email Classification
%A Jose M. Carmona-Cejudo
%A Manuel Baena-Garcia
%A Jose Campo-Avila
%A Rafael Morales-Bueno
%A Joao Gama
%A Albert Bifet
%B Proceedings of the Second Workshop on Applications of Pattern Analysis
%C Proceedings of Machine Learning Research
%D 2011
%E Tom Diethe
%E Jose Balcazar
%E John Shawe-Taylor
%E Cristina Tirnauca	
%F pmlr-v17-carmona11a
%I PMLR
%P 12--18
%U https://proceedings.mlr.press/v17/carmona11a.html
%V 17
%X Real-time classification of emails is a challenging task because of its online nature, and also because email streams are subject to concept drift. Identifying email spam, where only two different labels or classes are defined (spam or not spam), has received great attention in the literature. We are nevertheless interested in a more specific classification where multiple folders exist, which is an additional source of complexity: the class can have a very large number of different values. Moreover, neither cross-validation nor other sampling procedures are suitable for evaluation in data stream contexts, which is why other metrics, like the prequential error, have been proposed. However, the prequential error poses some problems, which can be alleviated by using recently proposed mechanisms such as fading factors. In this paper, we present GNUsmail, an open-source extensible framework for email classification, and we focus on its ability to perform online evaluation. GNUsmails architecture supports incremental and online learning, and it can be used to compare different data stream mining methods, using state-of-art online evaluation metrics. Besides describing the framework, characterized by two overlapping phases, we show how it can be used to compare different algorithms in order to find the most appropriate one. The GNUsmail source code includes a tool for launching replicable experiments.

RIS


TY  - CPAPER
TI  - Using GNUsmail to Compare Data Stream Mining Methods for On-line Email Classification
AU  - Jose M. Carmona-Cejudo
AU  - Manuel Baena-Garcia
AU  - Jose Campo-Avila
AU  - Rafael Morales-Bueno
AU  - Joao Gama
AU  - Albert Bifet
BT  - Proceedings of the Second Workshop on Applications of Pattern Analysis
DA  - 2011/10/21
ED  - Tom Diethe
ED  - Jose Balcazar
ED  - John Shawe-Taylor
ED  - Cristina Tirnauca	
ID  - pmlr-v17-carmona11a
PB  - PMLR
DP  - Proceedings of Machine Learning Research
VL  - 17
SP  - 12
EP  - 18
L1  - http://proceedings.mlr.press/v17/carmona11a/carmona11a.pdf
UR  - https://proceedings.mlr.press/v17/carmona11a.html
AB  - Real-time classification of emails is a challenging task because of its online nature, and also because email streams are subject to concept drift. Identifying email spam, where only two different labels or classes are defined (spam or not spam), has received great attention in the literature. We are nevertheless interested in a more specific classification where multiple folders exist, which is an additional source of complexity: the class can have a very large number of different values. Moreover, neither cross-validation nor other sampling procedures are suitable for evaluation in data stream contexts, which is why other metrics, like the prequential error, have been proposed. However, the prequential error poses some problems, which can be alleviated by using recently proposed mechanisms such as fading factors. In this paper, we present GNUsmail, an open-source extensible framework for email classification, and we focus on its ability to perform online evaluation. GNUsmails architecture supports incremental and online learning, and it can be used to compare different data stream mining methods, using state-of-art online evaluation metrics. Besides describing the framework, characterized by two overlapping phases, we show how it can be used to compare different algorithms in order to find the most appropriate one. The GNUsmail source code includes a tool for launching replicable experiments.
ER  -

APA


Carmona-Cejudo, J.M., Baena-Garcia, M., Campo-Avila, J., Morales-Bueno, R., Gama, J. & Bifet, A.. (2011). Using GNUsmail to Compare Data Stream Mining Methods for On-line Email Classification. Proceedings of the Second Workshop on Applications of Pattern Analysis, in Proceedings of Machine Learning Research 17:12-18 Available from https://proceedings.mlr.press/v17/carmona11a.html.

Related Material

Download PDF