Bootstrapping and Learning PDFA in Data Streams

Borja Balle, Jorge Castro, Ricard Gavaldà
Proceedings of the Eleventh International Conference on Grammatical Inference, PMLR 21:34-48, 2012.

Abstract

Markovian models with hidden state are widely-used formalisms for modeling sequential phenomena. Learnability of these models has been well studied when the sample is given in batch mode, and algorithms with PAC-like learning guarantees exist for specific classes of models such as Probabilistic Deterministic Finite Automata (PDFA). Here we focus on PDFA and give an algorithm for infering models in this class under the stringent \emphdata stream scenario: unlike existing methods, our algorithm works incrementally and in one pass, uses memory sublinear in the stream length, and processes input items in amortized constant time. We provide rigorous PAC-like bounds for all of the above, as well as an evaluation on synthetic data showing that the algorithm performs well in practice. Our algorithm makes a key usage of several old and new sketching techniques. In particular, we develop a new sketch for implementing bootstrapping in a streaming setting which may be of independent interest. In experiments we have observed that this sketch yields important reductions in the examples required for performing some crucial statistical tests in our algorithm.

Cite this Paper


BibTeX
@InProceedings{pmlr-v21-balle12a, title = {Bootstrapping and Learning PDFA in Data Streams}, author = {Balle, Borja and Castro, Jorge and Gavaldà, Ricard}, booktitle = {Proceedings of the Eleventh International Conference on Grammatical Inference}, pages = {34--48}, year = {2012}, editor = {Heinz, Jeffrey and Higuera, Colin and Oates, Tim}, volume = {21}, series = {Proceedings of Machine Learning Research}, address = {University of Maryland, College Park, MD, USA}, month = {05--08 Sep}, publisher = {PMLR}, pdf = {http://proceedings.mlr.press/v21/balle12a/balle12a.pdf}, url = {https://proceedings.mlr.press/v21/balle12a.html}, abstract = {Markovian models with hidden state are widely-used formalisms for modeling sequential phenomena. Learnability of these models has been well studied when the sample is given in batch mode, and algorithms with PAC-like learning guarantees exist for specific classes of models such as Probabilistic Deterministic Finite Automata (PDFA). Here we focus on PDFA and give an algorithm for infering models in this class under the stringent \emphdata stream scenario: unlike existing methods, our algorithm works incrementally and in one pass, uses memory sublinear in the stream length, and processes input items in amortized constant time. We provide rigorous PAC-like bounds for all of the above, as well as an evaluation on synthetic data showing that the algorithm performs well in practice. Our algorithm makes a key usage of several old and new sketching techniques. In particular, we develop a new sketch for implementing bootstrapping in a streaming setting which may be of independent interest. In experiments we have observed that this sketch yields important reductions in the examples required for performing some crucial statistical tests in our algorithm.} }
Endnote
%0 Conference Paper %T Bootstrapping and Learning PDFA in Data Streams %A Borja Balle %A Jorge Castro %A Ricard Gavaldà %B Proceedings of the Eleventh International Conference on Grammatical Inference %C Proceedings of Machine Learning Research %D 2012 %E Jeffrey Heinz %E Colin Higuera %E Tim Oates %F pmlr-v21-balle12a %I PMLR %P 34--48 %U https://proceedings.mlr.press/v21/balle12a.html %V 21 %X Markovian models with hidden state are widely-used formalisms for modeling sequential phenomena. Learnability of these models has been well studied when the sample is given in batch mode, and algorithms with PAC-like learning guarantees exist for specific classes of models such as Probabilistic Deterministic Finite Automata (PDFA). Here we focus on PDFA and give an algorithm for infering models in this class under the stringent \emphdata stream scenario: unlike existing methods, our algorithm works incrementally and in one pass, uses memory sublinear in the stream length, and processes input items in amortized constant time. We provide rigorous PAC-like bounds for all of the above, as well as an evaluation on synthetic data showing that the algorithm performs well in practice. Our algorithm makes a key usage of several old and new sketching techniques. In particular, we develop a new sketch for implementing bootstrapping in a streaming setting which may be of independent interest. In experiments we have observed that this sketch yields important reductions in the examples required for performing some crucial statistical tests in our algorithm.
RIS
TY - CPAPER TI - Bootstrapping and Learning PDFA in Data Streams AU - Borja Balle AU - Jorge Castro AU - Ricard Gavaldà BT - Proceedings of the Eleventh International Conference on Grammatical Inference DA - 2012/08/16 ED - Jeffrey Heinz ED - Colin Higuera ED - Tim Oates ID - pmlr-v21-balle12a PB - PMLR DP - Proceedings of Machine Learning Research VL - 21 SP - 34 EP - 48 L1 - http://proceedings.mlr.press/v21/balle12a/balle12a.pdf UR - https://proceedings.mlr.press/v21/balle12a.html AB - Markovian models with hidden state are widely-used formalisms for modeling sequential phenomena. Learnability of these models has been well studied when the sample is given in batch mode, and algorithms with PAC-like learning guarantees exist for specific classes of models such as Probabilistic Deterministic Finite Automata (PDFA). Here we focus on PDFA and give an algorithm for infering models in this class under the stringent \emphdata stream scenario: unlike existing methods, our algorithm works incrementally and in one pass, uses memory sublinear in the stream length, and processes input items in amortized constant time. We provide rigorous PAC-like bounds for all of the above, as well as an evaluation on synthetic data showing that the algorithm performs well in practice. Our algorithm makes a key usage of several old and new sketching techniques. In particular, we develop a new sketch for implementing bootstrapping in a streaming setting which may be of independent interest. In experiments we have observed that this sketch yields important reductions in the examples required for performing some crucial statistical tests in our algorithm. ER -
APA
Balle, B., Castro, J. & Gavaldà, R.. (2012). Bootstrapping and Learning PDFA in Data Streams. Proceedings of the Eleventh International Conference on Grammatical Inference, in Proceedings of Machine Learning Research 21:34-48 Available from https://proceedings.mlr.press/v21/balle12a.html.

Related Material