[edit]
fAST: regular expression inference from positive examples using Abstract Syntax Trees
Proceedings of 16th edition of the International Conference on Grammatical Inference, PMLR 217:96-116, 2023.
Abstract
Our paper presents a new algorithm that infers a regular expression matching a given set of strings known as \emph{positive examples}\footnote{Positive (resp. negative) examples are strings that do (resp. do not) belong to the target language.}. This algorithm has practical applications in automating file parsing for files with an unknown template. In practice, prior works hardly apply because they require negative examples and hardly scale. By restricting to positive examples, the problem becomes especially challenging: many regular expressions can match the set of positive examples, but only a few are useful in practice. To assess the quality of a regular expression, we introduce two performance metrics, called accuracy and conciseness. The contributions of the paper are threefold. First, we introduce an algorithm that infers a regular expression from positive examples only while optimizing accuracy and conciseness. Second, we adapt this algorithm to generate a regular expression based on a set of predefined patterns. Third, we demonstrate the tractability and the usefulness of our solution by performing experiments on synthesized and real-world datasets.