fAST: regular expression inference from positive examples using Abstract Syntax Trees

Maxime Raynal, Marc-Olivier Buob, Georges Quénot
Proceedings of 16th edition of the International Conference on Grammatical Inference, PMLR 217:96-116, 2023.

Abstract

Our paper presents a new algorithm that infers a regular expression matching a given set of strings known as \emph{positive examples}\footnote{Positive (resp. negative) examples are strings that do (resp. do not) belong to the target language.}. This algorithm has practical applications in automating file parsing for files with an unknown template. In practice, prior works hardly apply because they require negative examples and hardly scale. By restricting to positive examples, the problem becomes especially challenging: many regular expressions can match the set of positive examples, but only a few are useful in practice. To assess the quality of a regular expression, we introduce two performance metrics, called accuracy and conciseness. The contributions of the paper are threefold. First, we introduce an algorithm that infers a regular expression from positive examples only while optimizing accuracy and conciseness. Second, we adapt this algorithm to generate a regular expression based on a set of predefined patterns. Third, we demonstrate the tractability and the usefulness of our solution by performing experiments on synthesized and real-world datasets.

Cite this Paper


BibTeX
@InProceedings{pmlr-v217-raynal23a, title = {fAST: regular expression inference from positive examples using Abstract Syntax Trees}, author = {Raynal, Maxime and Buob, Marc-Olivier and Qu\'enot, Georges}, booktitle = {Proceedings of 16th edition of the International Conference on Grammatical Inference}, pages = {96--116}, year = {2023}, editor = {Coste, François and Ouardi, Faissal and Rabusseau, Guillaume}, volume = {217}, series = {Proceedings of Machine Learning Research}, month = {10--13 Jul}, publisher = {PMLR}, pdf = {https://proceedings.mlr.press/v217/raynal23a/raynal23a.pdf}, url = {https://proceedings.mlr.press/v217/raynal23a.html}, abstract = {Our paper presents a new algorithm that infers a regular expression matching a given set of strings known as \emph{positive examples}\footnote{Positive (resp. negative) examples are strings that do (resp. do not) belong to the target language.}. This algorithm has practical applications in automating file parsing for files with an unknown template. In practice, prior works hardly apply because they require negative examples and hardly scale. By restricting to positive examples, the problem becomes especially challenging: many regular expressions can match the set of positive examples, but only a few are useful in practice. To assess the quality of a regular expression, we introduce two performance metrics, called accuracy and conciseness. The contributions of the paper are threefold. First, we introduce an algorithm that infers a regular expression from positive examples only while optimizing accuracy and conciseness. Second, we adapt this algorithm to generate a regular expression based on a set of predefined patterns. Third, we demonstrate the tractability and the usefulness of our solution by performing experiments on synthesized and real-world datasets.} }
Endnote
%0 Conference Paper %T fAST: regular expression inference from positive examples using Abstract Syntax Trees %A Maxime Raynal %A Marc-Olivier Buob %A Georges Quénot %B Proceedings of 16th edition of the International Conference on Grammatical Inference %C Proceedings of Machine Learning Research %D 2023 %E François Coste %E Faissal Ouardi %E Guillaume Rabusseau %F pmlr-v217-raynal23a %I PMLR %P 96--116 %U https://proceedings.mlr.press/v217/raynal23a.html %V 217 %X Our paper presents a new algorithm that infers a regular expression matching a given set of strings known as \emph{positive examples}\footnote{Positive (resp. negative) examples are strings that do (resp. do not) belong to the target language.}. This algorithm has practical applications in automating file parsing for files with an unknown template. In practice, prior works hardly apply because they require negative examples and hardly scale. By restricting to positive examples, the problem becomes especially challenging: many regular expressions can match the set of positive examples, but only a few are useful in practice. To assess the quality of a regular expression, we introduce two performance metrics, called accuracy and conciseness. The contributions of the paper are threefold. First, we introduce an algorithm that infers a regular expression from positive examples only while optimizing accuracy and conciseness. Second, we adapt this algorithm to generate a regular expression based on a set of predefined patterns. Third, we demonstrate the tractability and the usefulness of our solution by performing experiments on synthesized and real-world datasets.
APA
Raynal, M., Buob, M. & Quénot, G.. (2023). fAST: regular expression inference from positive examples using Abstract Syntax Trees. Proceedings of 16th edition of the International Conference on Grammatical Inference, in Proceedings of Machine Learning Research 217:96-116 Available from https://proceedings.mlr.press/v217/raynal23a.html.

Related Material