On the Relationship Between Feature Selection and Classification Accuracy

Andreas Janecek; Wilfried Gansterer; Michael Demel; Gerhard Ecker

On the Relationship Between Feature Selection and Classification Accuracy

Andreas Janecek, Wilfried Gansterer, Michael Demel, Gerhard Ecker

Proceedings of the Workshop on New Challenges for Feature Selection in Data Mining and Knowledge Discovery at ECML/PKDD 2008, PMLR 4:90-105, 2008.

Abstract

Dimensionality reduction and feature subset selection are two techniques for reducing the attribute space of a feature set, which is an important component of both supervised and unsupervised classification or regression problems. While in feature subset selection a subset of the original attributes is extracted, dimensionality reduction in general produces linear combinations of the original attribute set. In this paper we investigate the relationship between several attribute space reduction techniques and the resulting classification accuracy for two very different application areas. On the one hand, we consider e-mail filtering, where the feature space contains various properties of e-mail messages, and on the other hand, we consider drug discovery problems, where quantitative representations of molecular structures are encoded in terms of information-preserving descriptor values. Subsets of the original attributes constructed by filter and wrapper techniques as well as subsets of linear combinations of the original attributes constructed by three different variants of the principle component analysis (PCA) are compared in terms of the classification performance achieved with various machine learning algorithms as well as in terms of runtime performance. We successively reduce the size of the attribute sets and investigate the changes in the classification results. Moreover, we explore the relationship between the variance captured in the linear combinations within PCA and the resulting classification accuracy. The results show that the classification accuracy based on PCA is highly sensitive to the type of data and that the variance captured the principal components is not necessarily a vital indicator for the classification performance.

Cite this Paper

BibTeX


@InProceedings{pmlr-v4-janecek08a,
  title = 	 {On the Relationship Between Feature Selection and Classification Accuracy},
  author = 	 {Janecek, Andreas and Gansterer, Wilfried and Demel, Michael and Ecker, Gerhard},
  booktitle = 	 {Proceedings of the Workshop on New Challenges for Feature Selection in Data  Mining and Knowledge Discovery at ECML/PKDD 2008},
  pages = 	 {90--105},
  year = 	 {2008},
  editor = 	 {Saeys, Yvan and Liu, Huan and Inza, Iñaki and Wehenkel, Louis and Pee, Yves Van de},
  volume = 	 {4},
  series = 	 {Proceedings of Machine Learning Research},
  address = 	 {Antwerp, Belgium},
  month = 	 {15 Sep},
  publisher =    {PMLR},
  pdf = 	 {http://proceedings.mlr.press/v4/janecek08a/janecek08a.pdf},
  url = 	 {https://proceedings.mlr.press/v4/janecek08a.html},
  abstract = 	 {Dimensionality reduction and feature subset selection are two techniques for reducing the attribute space of a feature set, which is an important component of both supervised and unsupervised classification or regression problems. While in feature subset selection a subset of the original attributes is extracted, dimensionality reduction in general produces linear combinations of the original attribute set. In this paper we investigate the relationship between several attribute space reduction techniques and the resulting classification accuracy for two very different application areas. On the one hand, we consider e-mail filtering, where the feature space contains various properties of e-mail messages, and on the other hand, we consider drug discovery problems, where quantitative representations of molecular structures are encoded in terms of information-preserving descriptor values.   Subsets of the original attributes constructed by filter and wrapper techniques as well as subsets of linear combinations of the original attributes constructed by three different variants of the principle component analysis (PCA) are compared in terms of the classification performance achieved with various machine learning algorithms as well as in terms of runtime performance. We successively reduce the size of the attribute sets and investigate the changes in the classification results. Moreover, we explore the relationship between the variance captured in the linear combinations within PCA and the resulting classification accuracy.  The results show that the classification accuracy based on PCA is highly sensitive to the type of data and that the variance captured the principal components is not necessarily a vital indicator for the classification performance.}
}

Endnote

%0 Conference Paper
%T On the Relationship Between Feature Selection and Classification Accuracy
%A Andreas Janecek
%A Wilfried Gansterer
%A Michael Demel
%A Gerhard Ecker
%B Proceedings of the Workshop on New Challenges for Feature Selection in Data  Mining and Knowledge Discovery at ECML/PKDD 2008
%C Proceedings of Machine Learning Research
%D 2008
%E Yvan Saeys
%E Huan Liu
%E Iñaki Inza
%E Louis Wehenkel
%E Yves Van de Pee	
%F pmlr-v4-janecek08a
%I PMLR
%P 90--105
%U https://proceedings.mlr.press/v4/janecek08a.html
%V 4
%X Dimensionality reduction and feature subset selection are two techniques for reducing the attribute space of a feature set, which is an important component of both supervised and unsupervised classification or regression problems. While in feature subset selection a subset of the original attributes is extracted, dimensionality reduction in general produces linear combinations of the original attribute set. In this paper we investigate the relationship between several attribute space reduction techniques and the resulting classification accuracy for two very different application areas. On the one hand, we consider e-mail filtering, where the feature space contains various properties of e-mail messages, and on the other hand, we consider drug discovery problems, where quantitative representations of molecular structures are encoded in terms of information-preserving descriptor values.   Subsets of the original attributes constructed by filter and wrapper techniques as well as subsets of linear combinations of the original attributes constructed by three different variants of the principle component analysis (PCA) are compared in terms of the classification performance achieved with various machine learning algorithms as well as in terms of runtime performance. We successively reduce the size of the attribute sets and investigate the changes in the classification results. Moreover, we explore the relationship between the variance captured in the linear combinations within PCA and the resulting classification accuracy.  The results show that the classification accuracy based on PCA is highly sensitive to the type of data and that the variance captured the principal components is not necessarily a vital indicator for the classification performance.

RIS


TY  - CPAPER
TI  - On the Relationship Between Feature Selection and Classification Accuracy
AU  - Andreas Janecek
AU  - Wilfried Gansterer
AU  - Michael Demel
AU  - Gerhard Ecker
BT  - Proceedings of the Workshop on New Challenges for Feature Selection in Data  Mining and Knowledge Discovery at ECML/PKDD 2008
DA  - 2008/09/11
ED  - Yvan Saeys
ED  - Huan Liu
ED  - Iñaki Inza
ED  - Louis Wehenkel
ED  - Yves Van de Pee	
ID  - pmlr-v4-janecek08a
PB  - PMLR
DP  - Proceedings of Machine Learning Research
VL  - 4
SP  - 90
EP  - 105
L1  - http://proceedings.mlr.press/v4/janecek08a/janecek08a.pdf
UR  - https://proceedings.mlr.press/v4/janecek08a.html
AB  - Dimensionality reduction and feature subset selection are two techniques for reducing the attribute space of a feature set, which is an important component of both supervised and unsupervised classification or regression problems. While in feature subset selection a subset of the original attributes is extracted, dimensionality reduction in general produces linear combinations of the original attribute set. In this paper we investigate the relationship between several attribute space reduction techniques and the resulting classification accuracy for two very different application areas. On the one hand, we consider e-mail filtering, where the feature space contains various properties of e-mail messages, and on the other hand, we consider drug discovery problems, where quantitative representations of molecular structures are encoded in terms of information-preserving descriptor values.   Subsets of the original attributes constructed by filter and wrapper techniques as well as subsets of linear combinations of the original attributes constructed by three different variants of the principle component analysis (PCA) are compared in terms of the classification performance achieved with various machine learning algorithms as well as in terms of runtime performance. We successively reduce the size of the attribute sets and investigate the changes in the classification results. Moreover, we explore the relationship between the variance captured in the linear combinations within PCA and the resulting classification accuracy.  The results show that the classification accuracy based on PCA is highly sensitive to the type of data and that the variance captured the principal components is not necessarily a vital indicator for the classification performance.
ER  -

APA


Janecek, A., Gansterer, W., Demel, M. & Ecker, G.. (2008). On the Relationship Between Feature Selection and Classification Accuracy. Proceedings of the Workshop on New Challenges for Feature Selection in Data  Mining and Knowledge Discovery at ECML/PKDD 2008, in Proceedings of Machine Learning Research 4:90-105 Available from https://proceedings.mlr.press/v4/janecek08a.html.

Related Material

Download PDF