Learning Dissimilarities for Categorical Symbols

Jierui Xie, Boleslaw Szymanski, Mohammed Zaki
; Proceedings of the Fourth International Workshop on Feature Selection in Data Mining, PMLR 10:97-106, 2010.

Abstract

In this paper we learn a dissimilarity measure for categorical data, for effective classification of the data points. Each categorical feature (with values taken from a finite set of symbols) is mapped onto a continuous feature whose values are real numbers. Guided by the classification error based on a nearest neighbor based technique, we repeatedly update the assignment of categorical symbols to real numbers to minimize this error. Intuitively, the algorithm pushes together points with the same class label, while enlarging the distances to points labeled differently. Our experiments show that 1) the learned dissimilarities improve classification accuracy by using the affinities of categorical symbols; 2) they outperform dissimilarities produced by previous data-driven methods; 3) our enhanced nearest neighbor classifier (called LD) based on the new space is competitive compared with classifiers such as decision trees, RBF neural networks, Naive Bayes and support vector machines, on a range of categorical datasets.

Cite this Paper


BibTeX
@InProceedings{pmlr-v10-xie10a, title = {Learning Dissimilarities for Categorical Symbols}, author = {Jierui Xie and Boleslaw Szymanski and Mohammed Zaki}, pages = {97--106}, year = {2010}, editor = {Huan Liu and Hiroshi Motoda and Rudy Setiono and Zheng Zhao}, volume = {10}, series = {Proceedings of Machine Learning Research}, address = {Hyderabad, India}, month = {21 Jun}, publisher = {PMLR}, pdf = {http://proceedings.mlr.press/v10/xie10a/xie10a.pdf}, url = {http://proceedings.mlr.press/v10/xie10a.html}, abstract = {In this paper we learn a dissimilarity measure for categorical data, for effective classification of the data points. Each categorical feature (with values taken from a finite set of symbols) is mapped onto a continuous feature whose values are real numbers. Guided by the classification error based on a nearest neighbor based technique, we repeatedly update the assignment of categorical symbols to real numbers to minimize this error. Intuitively, the algorithm pushes together points with the same class label, while enlarging the distances to points labeled differently. Our experiments show that 1) the learned dissimilarities improve classification accuracy by using the affinities of categorical symbols; 2) they outperform dissimilarities produced by previous data-driven methods; 3) our enhanced nearest neighbor classifier (called LD) based on the new space is competitive compared with classifiers such as decision trees, RBF neural networks, Naive Bayes and support vector machines, on a range of categorical datasets.} }
Endnote
%0 Conference Paper %T Learning Dissimilarities for Categorical Symbols %A Jierui Xie %A Boleslaw Szymanski %A Mohammed Zaki %B Proceedings of the Fourth International Workshop on Feature Selection in Data Mining %C Proceedings of Machine Learning Research %D 2010 %E Huan Liu %E Hiroshi Motoda %E Rudy Setiono %E Zheng Zhao %F pmlr-v10-xie10a %I PMLR %J Proceedings of Machine Learning Research %P 97--106 %U http://proceedings.mlr.press %V 10 %W PMLR %X In this paper we learn a dissimilarity measure for categorical data, for effective classification of the data points. Each categorical feature (with values taken from a finite set of symbols) is mapped onto a continuous feature whose values are real numbers. Guided by the classification error based on a nearest neighbor based technique, we repeatedly update the assignment of categorical symbols to real numbers to minimize this error. Intuitively, the algorithm pushes together points with the same class label, while enlarging the distances to points labeled differently. Our experiments show that 1) the learned dissimilarities improve classification accuracy by using the affinities of categorical symbols; 2) they outperform dissimilarities produced by previous data-driven methods; 3) our enhanced nearest neighbor classifier (called LD) based on the new space is competitive compared with classifiers such as decision trees, RBF neural networks, Naive Bayes and support vector machines, on a range of categorical datasets.
RIS
TY - CPAPER TI - Learning Dissimilarities for Categorical Symbols AU - Jierui Xie AU - Boleslaw Szymanski AU - Mohammed Zaki BT - Proceedings of the Fourth International Workshop on Feature Selection in Data Mining PY - 2010/05/26 DA - 2010/05/26 ED - Huan Liu ED - Hiroshi Motoda ED - Rudy Setiono ED - Zheng Zhao ID - pmlr-v10-xie10a PB - PMLR SP - 97 DP - PMLR EP - 106 L1 - http://proceedings.mlr.press/v10/xie10a/xie10a.pdf UR - http://proceedings.mlr.press/v10/xie10a.html AB - In this paper we learn a dissimilarity measure for categorical data, for effective classification of the data points. Each categorical feature (with values taken from a finite set of symbols) is mapped onto a continuous feature whose values are real numbers. Guided by the classification error based on a nearest neighbor based technique, we repeatedly update the assignment of categorical symbols to real numbers to minimize this error. Intuitively, the algorithm pushes together points with the same class label, while enlarging the distances to points labeled differently. Our experiments show that 1) the learned dissimilarities improve classification accuracy by using the affinities of categorical symbols; 2) they outperform dissimilarities produced by previous data-driven methods; 3) our enhanced nearest neighbor classifier (called LD) based on the new space is competitive compared with classifiers such as decision trees, RBF neural networks, Naive Bayes and support vector machines, on a range of categorical datasets. ER -
APA
Xie, J., Szymanski, B. & Zaki, M.. (2010). Learning Dissimilarities for Categorical Symbols. Proceedings of the Fourth International Workshop on Feature Selection in Data Mining, in PMLR 10:97-106

Related Material