Learning Dissimilarities for Categorical Symbols

Jierui Xie; Boleslaw Szymanski; Mohammed Zaki

Learning Dissimilarities for Categorical Symbols

Jierui Xie, Boleslaw Szymanski, Mohammed Zaki

Proceedings of the Fourth International Workshop on Feature Selection in Data Mining, PMLR 10:97-106, 2010.

Abstract

In this paper we learn a dissimilarity measure for categorical data, for effective classification of the data points. Each categorical feature (with values taken from a finite set of symbols) is mapped onto a continuous feature whose values are real numbers. Guided by the classification error based on a nearest neighbor based technique, we repeatedly update the assignment of categorical symbols to real numbers to minimize this error. Intuitively, the algorithm pushes together points with the same class label, while enlarging the distances to points labeled differently. Our experiments show that 1) the learned dissimilarities improve classification accuracy by using the affinities of categorical symbols; 2) they outperform dissimilarities produced by previous data-driven methods; 3) our enhanced nearest neighbor classifier (called LD) based on the new space is competitive compared with classifiers such as decision trees, RBF neural networks, Naive Bayes and support vector machines, on a range of categorical datasets.

Cite this Paper

BibTeX


@InProceedings{pmlr-v10-xie10a,
  title = 	 {Learning Dissimilarities for Categorical Symbols},
  author = 	 {Xie, Jierui and Szymanski, Boleslaw and Zaki, Mohammed},
  booktitle = 	 {Proceedings of the Fourth International Workshop on Feature Selection in Data Mining},
  pages = 	 {97--106},
  year = 	 {2010},
  editor = 	 {Liu, Huan and Motoda, Hiroshi and Setiono, Rudy and Zhao, Zheng},
  volume = 	 {10},
  series = 	 {Proceedings of Machine Learning Research},
  address = 	 {Hyderabad, India},
  month = 	 {21 Jun},
  publisher =    {PMLR},
  pdf = 	 {http://proceedings.mlr.press/v10/xie10a/xie10a.pdf},
  url = 	 {https://proceedings.mlr.press/v10/xie10a.html},
  abstract = 	 {In this paper we learn a dissimilarity measure for categorical data, for effective classification of the data points. Each categorical feature (with values taken from a finite set of symbols) is mapped onto a continuous feature whose values are real numbers. Guided by the classification error based on a nearest neighbor based technique, we repeatedly update the assignment of categorical symbols to real numbers to minimize this error. Intuitively, the algorithm pushes together points with the same class label, while enlarging the distances to points labeled differently. Our experiments show that 1) the learned dissimilarities improve classification accuracy by using the affinities of categorical symbols; 2) they outperform dissimilarities produced by previous data-driven methods; 3) our enhanced nearest neighbor classifier (called LD) based on the new space is competitive compared with classifiers such as decision trees, RBF neural networks, Naive Bayes and support vector machines, on a range of categorical datasets.}
}

Endnote

%0 Conference Paper
%T Learning Dissimilarities for Categorical Symbols
%A Jierui Xie
%A Boleslaw Szymanski
%A Mohammed Zaki
%B Proceedings of the Fourth International Workshop on Feature Selection in Data Mining
%C Proceedings of Machine Learning Research
%D 2010
%E Huan Liu
%E Hiroshi Motoda
%E Rudy Setiono
%E Zheng Zhao	
%F pmlr-v10-xie10a
%I PMLR
%P 97--106
%U https://proceedings.mlr.press/v10/xie10a.html
%V 10
%X In this paper we learn a dissimilarity measure for categorical data, for effective classification of the data points. Each categorical feature (with values taken from a finite set of symbols) is mapped onto a continuous feature whose values are real numbers. Guided by the classification error based on a nearest neighbor based technique, we repeatedly update the assignment of categorical symbols to real numbers to minimize this error. Intuitively, the algorithm pushes together points with the same class label, while enlarging the distances to points labeled differently. Our experiments show that 1) the learned dissimilarities improve classification accuracy by using the affinities of categorical symbols; 2) they outperform dissimilarities produced by previous data-driven methods; 3) our enhanced nearest neighbor classifier (called LD) based on the new space is competitive compared with classifiers such as decision trees, RBF neural networks, Naive Bayes and support vector machines, on a range of categorical datasets.

RIS


TY  - CPAPER
TI  - Learning Dissimilarities for Categorical Symbols
AU  - Jierui Xie
AU  - Boleslaw Szymanski
AU  - Mohammed Zaki
BT  - Proceedings of the Fourth International Workshop on Feature Selection in Data Mining
DA  - 2010/05/26
ED  - Huan Liu
ED  - Hiroshi Motoda
ED  - Rudy Setiono
ED  - Zheng Zhao	
ID  - pmlr-v10-xie10a
PB  - PMLR
DP  - Proceedings of Machine Learning Research
VL  - 10
SP  - 97
EP  - 106
L1  - http://proceedings.mlr.press/v10/xie10a/xie10a.pdf
UR  - https://proceedings.mlr.press/v10/xie10a.html
AB  - In this paper we learn a dissimilarity measure for categorical data, for effective classification of the data points. Each categorical feature (with values taken from a finite set of symbols) is mapped onto a continuous feature whose values are real numbers. Guided by the classification error based on a nearest neighbor based technique, we repeatedly update the assignment of categorical symbols to real numbers to minimize this error. Intuitively, the algorithm pushes together points with the same class label, while enlarging the distances to points labeled differently. Our experiments show that 1) the learned dissimilarities improve classification accuracy by using the affinities of categorical symbols; 2) they outperform dissimilarities produced by previous data-driven methods; 3) our enhanced nearest neighbor classifier (called LD) based on the new space is competitive compared with classifiers such as decision trees, RBF neural networks, Naive Bayes and support vector machines, on a range of categorical datasets.
ER  -

APA


Xie, J., Szymanski, B. & Zaki, M.. (2010). Learning Dissimilarities for Categorical Symbols. Proceedings of the Fourth International Workshop on Feature Selection in Data Mining, in Proceedings of Machine Learning Research 10:97-106 Available from https://proceedings.mlr.press/v10/xie10a.html.

Related Material

Download PDF