Feature Selection for Text Classification Based on Gini Coefficient of Inequality

Ranbir Sanasam; Hema Murthy; Timothy Gonsalves

Feature Selection for Text Classification Based on Gini Coefficient of Inequality

Ranbir Sanasam, Hema Murthy, Timothy Gonsalves

Proceedings of the Fourth International Workshop on Feature Selection in Data Mining, PMLR 10:76-85, 2010.

Abstract

A number of feature selection mechanisms have been explored in text categorization, among which mutual information, information gain and chi-square are considered most effective. In this paper, we study another method known as \it within class popularity to deal with feature selection based on the concept \it Gini coefficient of inequality (a commonly used measure of inequality of \textitincome). The proposed measure explores the relative distribution of a feature among different classes. From extensive experiments with four text classifiers over three datasets of different levels of heterogeneity, we observe that the proposed measure outperforms the mutual information, information gain and chi-square static with an average improvement of approximately 28.5%, 19% and 9.2% respectively.

Cite this Paper

BibTeX


@InProceedings{pmlr-v10-sanasam10a,
  title = 	 {Feature Selection for Text Classification Based on Gini Coefficient of Inequality},
  author = 	 {Sanasam, Ranbir and Murthy, Hema and Gonsalves, Timothy},
  booktitle = 	 {Proceedings of the Fourth International Workshop on Feature Selection in Data Mining},
  pages = 	 {76--85},
  year = 	 {2010},
  editor = 	 {Liu, Huan and Motoda, Hiroshi and Setiono, Rudy and Zhao, Zheng},
  volume = 	 {10},
  series = 	 {Proceedings of Machine Learning Research},
  address = 	 {Hyderabad, India},
  month = 	 {21 Jun},
  publisher =    {PMLR},
  pdf = 	 {http://proceedings.mlr.press/v10/sanasam10a/sanasam10a.pdf},
  url = 	 {https://proceedings.mlr.press/v10/sanasam10a.html},
  abstract = 	 {A number of feature selection mechanisms have been explored in text categorization, among which mutual information, information gain and chi-square are considered most effective. In this paper, we study another method known as \it within class popularity to deal with feature selection based on the concept \it Gini coefficient of inequality (a commonly used measure of inequality of \textitincome). The proposed measure explores the relative distribution of a feature among different classes. From extensive experiments with four text classifiers over three datasets of different levels of heterogeneity, we observe that the proposed measure outperforms the mutual information, information gain and chi-square static with an average improvement of approximately 28.5%, 19% and 9.2% respectively.}
}

Endnote

%0 Conference Paper
%T Feature Selection for Text Classification Based on Gini Coefficient of Inequality
%A Ranbir Sanasam
%A Hema Murthy
%A Timothy Gonsalves
%B Proceedings of the Fourth International Workshop on Feature Selection in Data Mining
%C Proceedings of Machine Learning Research
%D 2010
%E Huan Liu
%E Hiroshi Motoda
%E Rudy Setiono
%E Zheng Zhao	
%F pmlr-v10-sanasam10a
%I PMLR
%P 76--85
%U https://proceedings.mlr.press/v10/sanasam10a.html
%V 10
%X A number of feature selection mechanisms have been explored in text categorization, among which mutual information, information gain and chi-square are considered most effective. In this paper, we study another method known as \it within class popularity to deal with feature selection based on the concept \it Gini coefficient of inequality (a commonly used measure of inequality of \textitincome). The proposed measure explores the relative distribution of a feature among different classes. From extensive experiments with four text classifiers over three datasets of different levels of heterogeneity, we observe that the proposed measure outperforms the mutual information, information gain and chi-square static with an average improvement of approximately 28.5%, 19% and 9.2% respectively.

RIS


TY  - CPAPER
TI  - Feature Selection for Text Classification Based on Gini Coefficient of Inequality
AU  - Ranbir Sanasam
AU  - Hema Murthy
AU  - Timothy Gonsalves
BT  - Proceedings of the Fourth International Workshop on Feature Selection in Data Mining
DA  - 2010/05/26
ED  - Huan Liu
ED  - Hiroshi Motoda
ED  - Rudy Setiono
ED  - Zheng Zhao	
ID  - pmlr-v10-sanasam10a
PB  - PMLR
DP  - Proceedings of Machine Learning Research
VL  - 10
SP  - 76
EP  - 85
L1  - http://proceedings.mlr.press/v10/sanasam10a/sanasam10a.pdf
UR  - https://proceedings.mlr.press/v10/sanasam10a.html
AB  - A number of feature selection mechanisms have been explored in text categorization, among which mutual information, information gain and chi-square are considered most effective. In this paper, we study another method known as \it within class popularity to deal with feature selection based on the concept \it Gini coefficient of inequality (a commonly used measure of inequality of \textitincome). The proposed measure explores the relative distribution of a feature among different classes. From extensive experiments with four text classifiers over three datasets of different levels of heterogeneity, we observe that the proposed measure outperforms the mutual information, information gain and chi-square static with an average improvement of approximately 28.5%, 19% and 9.2% respectively.
ER  -

APA


Sanasam, R., Murthy, H. & Gonsalves, T.. (2010). Feature Selection for Text Classification Based on Gini Coefficient of Inequality. Proceedings of the Fourth International Workshop on Feature Selection in Data Mining, in Proceedings of Machine Learning Research 10:76-85 Available from https://proceedings.mlr.press/v10/sanasam10a.html.

Related Material

Download PDF