Grammar Compression: Grammatical Inference by Compression and Its Application to Real Data

Hiroshi Sakamoto

Grammar Compression: Grammatical Inference by Compression and Its Application to Real Data

Hiroshi Sakamoto

The 12th International Conference on Grammatical Inference, PMLR 34:3-20, 2014.

Abstract

A grammatical inference algorithm tries to find as a small grammar as possible representing a potentially infinite sequence of strings. Here, let us consider a simple restriction: the input is a finite sequence or it might be a singleton set. Then the restricted problem is called the \em grammar compression to find the smallest CFG generating just the input. In the last decade many researchers have tackled this problem because of its scalable applications, e.g., expansion of data storage capacity, speeding-up information retrieval, DNA sequencing, frequent pattern mining, and similarity search. We would review the history of grammar compression and its wide applications together with an important future work. The study of grammar compression has begun with the bad news: the smallest CFG problem is NP-hard. Hence, the first question is: Can we get a near-optimal solution in a polynomial time? (Is there a reasonable approximation algorithm?) And the next question is: Can we minimize the costs of time and space? (Does a linear time algorithm exist within an optimal working space?) The recent results produced by the research community answer affirmatively the questions. We introduce several important results and typical applications to a huge text collection. On the other hand, the shrinkage of the advantage of grammar compression is caused by the data explosion, since there is no working space for storing the whole data supplied from data stream. The last question is: How can we handle the stream data? For this question, we propose the framework of \em stream grammar compression for the next generation and its attractive application to fast data transmission.

Cite this Paper

BibTeX


@InProceedings{pmlr-v34-sakamoto14a,
  title = 	 {Grammar Compression: Grammatical Inference by Compression and Its Application to Real Data },
  author = 	 {Sakamoto, Hiroshi},
  booktitle = 	 {The 12th International Conference on Grammatical Inference},
  pages = 	 {3--20},
  year = 	 {2014},
  editor = 	 {Clark, Alexander and Kanazawa, Makoto and Yoshinaka, Ryo},
  volume = 	 {34},
  series = 	 {Proceedings of Machine Learning Research},
  address = 	 {Kyoto, Japan},
  month = 	 {17--19 Sep},
  publisher =    {PMLR},
  pdf = 	 {http://proceedings.mlr.press/v34/sakamoto14a.pdf},
  url = 	 {https://proceedings.mlr.press/v34/sakamoto14a.html},
  abstract = 	 {A grammatical inference algorithm tries to find as a small grammar as possible representing  a potentially infinite sequence of strings. Here, let us consider a simple restriction: the input is a finite sequence or it might be a singleton set. Then the restricted problem is called the \em grammar compression to find the smallest CFG generating just the input. In the last decade many researchers have tackled this problem because of its scalable applications, e.g., expansion of data storage capacity, speeding-up information retrieval, DNA sequencing, frequent pattern mining,  and similarity search. We would review the history of grammar compression and its wide applications together with an important future work. The study of grammar compression has begun with the bad news: the smallest CFG problem is NP-hard. Hence, the first question is: Can we get a near-optimal solution in a polynomial time?  (Is there a reasonable approximation algorithm?) And the next question is: Can we minimize the costs of time and space? (Does a linear time algorithm exist within an optimal working space?) The recent results produced by the research community answer affirmatively the questions. We introduce several important results and typical applications to a huge text collection. On the other hand, the shrinkage of the advantage of grammar compression is caused by the data explosion, since there is no working space for storing the whole data supplied from data stream. The last question is: How can we handle the stream data? For this question, we propose the framework of \em stream grammar compression for the next generation and its attractive application to fast data transmission.}
}

Endnote

%0 Conference Paper
%T Grammar Compression: Grammatical Inference by Compression and Its Application to Real Data 
%A Hiroshi Sakamoto
%B The 12th International Conference on Grammatical Inference
%C Proceedings of Machine Learning Research
%D 2014
%E Alexander Clark
%E Makoto Kanazawa
%E Ryo Yoshinaka	
%F pmlr-v34-sakamoto14a
%I PMLR
%P 3--20
%U https://proceedings.mlr.press/v34/sakamoto14a.html
%V 34
%X A grammatical inference algorithm tries to find as a small grammar as possible representing  a potentially infinite sequence of strings. Here, let us consider a simple restriction: the input is a finite sequence or it might be a singleton set. Then the restricted problem is called the \em grammar compression to find the smallest CFG generating just the input. In the last decade many researchers have tackled this problem because of its scalable applications, e.g., expansion of data storage capacity, speeding-up information retrieval, DNA sequencing, frequent pattern mining,  and similarity search. We would review the history of grammar compression and its wide applications together with an important future work. The study of grammar compression has begun with the bad news: the smallest CFG problem is NP-hard. Hence, the first question is: Can we get a near-optimal solution in a polynomial time?  (Is there a reasonable approximation algorithm?) And the next question is: Can we minimize the costs of time and space? (Does a linear time algorithm exist within an optimal working space?) The recent results produced by the research community answer affirmatively the questions. We introduce several important results and typical applications to a huge text collection. On the other hand, the shrinkage of the advantage of grammar compression is caused by the data explosion, since there is no working space for storing the whole data supplied from data stream. The last question is: How can we handle the stream data? For this question, we propose the framework of \em stream grammar compression for the next generation and its attractive application to fast data transmission.

RIS


TY  - CPAPER
TI  - Grammar Compression: Grammatical Inference by Compression and Its Application to Real Data 
AU  - Hiroshi Sakamoto
BT  - The 12th International Conference on Grammatical Inference
DA  - 2014/08/30
ED  - Alexander Clark
ED  - Makoto Kanazawa
ED  - Ryo Yoshinaka	
ID  - pmlr-v34-sakamoto14a
PB  - PMLR
DP  - Proceedings of Machine Learning Research
VL  - 34
SP  - 3
EP  - 20
L1  - http://proceedings.mlr.press/v34/sakamoto14a.pdf
UR  - https://proceedings.mlr.press/v34/sakamoto14a.html
AB  - A grammatical inference algorithm tries to find as a small grammar as possible representing  a potentially infinite sequence of strings. Here, let us consider a simple restriction: the input is a finite sequence or it might be a singleton set. Then the restricted problem is called the \em grammar compression to find the smallest CFG generating just the input. In the last decade many researchers have tackled this problem because of its scalable applications, e.g., expansion of data storage capacity, speeding-up information retrieval, DNA sequencing, frequent pattern mining,  and similarity search. We would review the history of grammar compression and its wide applications together with an important future work. The study of grammar compression has begun with the bad news: the smallest CFG problem is NP-hard. Hence, the first question is: Can we get a near-optimal solution in a polynomial time?  (Is there a reasonable approximation algorithm?) And the next question is: Can we minimize the costs of time and space? (Does a linear time algorithm exist within an optimal working space?) The recent results produced by the research community answer affirmatively the questions. We introduce several important results and typical applications to a huge text collection. On the other hand, the shrinkage of the advantage of grammar compression is caused by the data explosion, since there is no working space for storing the whole data supplied from data stream. The last question is: How can we handle the stream data? For this question, we propose the framework of \em stream grammar compression for the next generation and its attractive application to fast data transmission.
ER  -

APA


Sakamoto, H.. (2014). Grammar Compression: Grammatical Inference by Compression and Its Application to Real Data . The 12th International Conference on Grammatical Inference, in Proceedings of Machine Learning Research 34:3-20 Available from https://proceedings.mlr.press/v34/sakamoto14a.html.

Related Material

Download PDF