The Coding Divergence for Measuring the Complexity of Separating Two Sets

Mahito Sugiyama; Akihiro Yamamoto

The Coding Divergence for Measuring the Complexity of Separating Two Sets

Mahito Sugiyama, Akihiro Yamamoto

Proceedings of 2nd Asian Conference on Machine Learning, PMLR 13:127-143, 2010.

Abstract

In this paper we integrate two essential processes, discretization of continuous data and learning of a model that explains them, towards fully computational machine learning from continuous data. Discretization is fundamental for machine learning and data mining, since every continuous datum; e.g., a real-valued datum obtained by observation in the real world, must be discretized and converted from analog (continuous) to digital (discrete) form to store in databases. However, most machine learning methods do not pay attention to the situation; i.e., they use digital data in actual applications on a computer whereas assume analog data (usually vectors of real numbers) theoretically. To bridge the gap, we propose a novel measure of the difference between two sets of data, called the coding divergence, and unify two processes discretization and learning computationally. Discretization of continuous data is realized by a topological mapping (in the sense of mathematics) from the

$d$ -dimensional Euclidean space

$\mathbb{R}^d$ into the Cantor space

$\Sigma^\omega$ , and the simplest model is learned in the Cantor space, which corresponds to the minimum open set separating the given two sets of data. Furthermore, we construct a classifier using the divergence, and experimentally demonstrate robust performance of it. Our contribution is not only introducing a new measure from the computational point of view, but also triggering more interaction between experimental science and machine learning.

Cite this Paper

BibTeX


@InProceedings{pmlr-v13-sugiyama10b,
  title = 	 {The Coding Divergence for Measuring the Complexity of Separating Two Sets},
  author = 	 {Sugiyama, Mahito and Yamamoto, Akihiro},
  booktitle = 	 {Proceedings of 2nd Asian Conference on Machine Learning},
  pages = 	 {127--143},
  year = 	 {2010},
  editor = 	 {Sugiyama, Masashi and Yang, Qiang},
  volume = 	 {13},
  series = 	 {Proceedings of Machine Learning Research},
  address = 	 {Tokyo, Japan},
  month = 	 {08--10 Nov},
  publisher =    {PMLR},
  pdf = 	 {http://proceedings.mlr.press/v13/sugiyama10b/sugiyama10b.pdf},
  url = 	 {https://proceedings.mlr.press/v13/sugiyama10b.html},
  abstract = 	 {In this paper we integrate two essential processes, discretization of continuous data and learning of a model that explains them, towards fully computational machine learning from continuous data. Discretization is fundamental for machine learning and data mining, since every continuous datum; e.g., a real-valued datum obtained by observation in the real world, must be discretized and converted from analog (continuous) to digital (discrete) form to store in databases. However, most machine learning methods do not pay attention to the situation; i.e., they use digital data in actual applications on a computer whereas assume analog data (usually vectors of real numbers) theoretically. To bridge the gap, we propose a novel measure of the difference between two sets of data, called the coding divergence, and unify two processes discretization and learning computationally. Discretization of continuous data is realized by a topological mapping (in the sense of mathematics) from the $d$-dimensional Euclidean space $\mathbb{R}^d$ into the Cantor space $\Sigma^\omega$, and the simplest model is learned in the Cantor space, which corresponds to the minimum open set separating the given two sets of data. Furthermore, we construct a classifier using the divergence, and experimentally demonstrate robust performance of it. Our contribution is not only introducing a new measure from the computational point of view, but also triggering more interaction between experimental science and machine learning.}
}

Endnote

%0 Conference Paper
%T The Coding Divergence for Measuring the Complexity of Separating Two Sets
%A Mahito Sugiyama
%A Akihiro Yamamoto
%B Proceedings of 2nd Asian Conference on Machine Learning
%C Proceedings of Machine Learning Research
%D 2010
%E Masashi Sugiyama
%E Qiang Yang	
%F pmlr-v13-sugiyama10b
%I PMLR
%P 127--143
%U https://proceedings.mlr.press/v13/sugiyama10b.html
%V 13
%X In this paper we integrate two essential processes, discretization of continuous data and learning of a model that explains them, towards fully computational machine learning from continuous data. Discretization is fundamental for machine learning and data mining, since every continuous datum; e.g., a real-valued datum obtained by observation in the real world, must be discretized and converted from analog (continuous) to digital (discrete) form to store in databases. However, most machine learning methods do not pay attention to the situation; i.e., they use digital data in actual applications on a computer whereas assume analog data (usually vectors of real numbers) theoretically. To bridge the gap, we propose a novel measure of the difference between two sets of data, called the coding divergence, and unify two processes discretization and learning computationally. Discretization of continuous data is realized by a topological mapping (in the sense of mathematics) from the $d$-dimensional Euclidean space $\mathbb{R}^d$ into the Cantor space $\Sigma^\omega$, and the simplest model is learned in the Cantor space, which corresponds to the minimum open set separating the given two sets of data. Furthermore, we construct a classifier using the divergence, and experimentally demonstrate robust performance of it. Our contribution is not only introducing a new measure from the computational point of view, but also triggering more interaction between experimental science and machine learning.

RIS


TY  - CPAPER
TI  - The Coding Divergence for Measuring the Complexity of Separating Two Sets
AU  - Mahito Sugiyama
AU  - Akihiro Yamamoto
BT  - Proceedings of 2nd Asian Conference on Machine Learning
DA  - 2010/10/31
ED  - Masashi Sugiyama
ED  - Qiang Yang	
ID  - pmlr-v13-sugiyama10b
PB  - PMLR
DP  - Proceedings of Machine Learning Research
VL  - 13
SP  - 127
EP  - 143
L1  - http://proceedings.mlr.press/v13/sugiyama10b/sugiyama10b.pdf
UR  - https://proceedings.mlr.press/v13/sugiyama10b.html
AB  - In this paper we integrate two essential processes, discretization of continuous data and learning of a model that explains them, towards fully computational machine learning from continuous data. Discretization is fundamental for machine learning and data mining, since every continuous datum; e.g., a real-valued datum obtained by observation in the real world, must be discretized and converted from analog (continuous) to digital (discrete) form to store in databases. However, most machine learning methods do not pay attention to the situation; i.e., they use digital data in actual applications on a computer whereas assume analog data (usually vectors of real numbers) theoretically. To bridge the gap, we propose a novel measure of the difference between two sets of data, called the coding divergence, and unify two processes discretization and learning computationally. Discretization of continuous data is realized by a topological mapping (in the sense of mathematics) from the $d$-dimensional Euclidean space $\mathbb{R}^d$ into the Cantor space $\Sigma^\omega$, and the simplest model is learned in the Cantor space, which corresponds to the minimum open set separating the given two sets of data. Furthermore, we construct a classifier using the divergence, and experimentally demonstrate robust performance of it. Our contribution is not only introducing a new measure from the computational point of view, but also triggering more interaction between experimental science and machine learning.
ER  -

APA


Sugiyama, M. & Yamamoto, A.. (2010). The Coding Divergence for Measuring the Complexity of Separating Two Sets. Proceedings of 2nd Asian Conference on Machine Learning, in Proceedings of Machine Learning Research 13:127-143 Available from https://proceedings.mlr.press/v13/sugiyama10b.html.

Related Material

Download PDF