Extracting Grammar from Programs: Evolutionary Approach
Matej Črepinšek1 , Marjan Mernik1 ,
Faizan Javed2, Barrett R. Bryant2 , and Alan Sprague2
1
University of Maribor,
Faculty of Electrical Engineering and Computer Science,
Smetanova 17, 2000 Maribor, Slovenia
{matej.crepinsek, marjan.mernik}@uni-mb.si
2
The University of Alabama at Birmingham,
Department of Computer and Information Sciences,
Birmingham, AL 35294-1170, U.S.A.
{javedf, bryant, sprague}@cis.uab.edu
Abstract. The paper discusses context-free grammar (CFG) inference using genetic-programming
with application to inducing grammars from programs written in simple domain-specific languages. Grammar-specific heuristic operators and non-random construction of the initial population are proposed to achieve this task. Suitability of the approach is shown by small examples
where the underlying CFG’s are successfully inferred.
Keywords. Grammar induction, Grammar inference, Learning from positive and negative examples, Genetic programming
1
Introduction
In the accompanying paper [15] we discussed the search space of regular and context-free grammar
inference. The conclusion reached was that owing to the large search space, the exhaustive (brute-force)
approach to grammar induction could only be applied to small positive samples. Hence, a need for a
different and more efficient approach to explore the search space arose. Evolutionary computation [16]
is particulary suitable for such kinds of problems. In fact, genetic algorithms have already been applied
to the grammar inference problem, with varying results. In this paper another evolutionary approach,
Genetic Programming (GP), to CFG learning is presented. Genetic programming [3] is a successful
technique for getting computers to automatically solve problems. It has been successfully used in a
wide variety of application domains such as data mining, image classification and robotic control. In
general, genetic programming works well for problems where solutions can be expressed with a modestly
short program. For example, methods working on typical data structures such as stacks, queues and
lists have been successfully evolved using genetic programming in [4]. Specifications (BNF) for domainspecific languages are small enough so that we can expect that a successful solution can be found using
genetic programming. Our previous work [6] was successful in inferring small context-free grammars
from positive and negative samples. This paper elaborates on our recent research findings and builds
on our previous work.
2
Related Work
The impact of different representations of grammars was explored in [14] where experimental results
showed that an evolutionary algorithm using standard context-free grammars (BNF) outperforms those
using Greibach Normal Form (GNF), Chomsky Normal Form (CNF) or bit-string representations [5].
This performance differential was attributed to the larger grammar search space of the other representations, which was a consequence of them having a more complex grammar form. The experimental
assessment in [14] was very limited due to the large processing time (processing of one generation had
taken several hours; using our system, processing of one generation takes just few seconds). This was
due to use of the chart parser, which is used commonly in natural language parsing and can accept
ACM SIGPLAN Notices
39
Vol. 40(4), Apr 2005
ambiguous grammars as well. With this approach a grammar was successfully inferred for the language of correctly balanced and nested brackets. In [1] a genetic algorithm was used on the problem
of merging states in the prefix-tree automaton of regular grammar inference. It was shown that the
genetic algorithm performs equally good as other regular grammar inference algorithms (e.g. RPNI).
Variable length chromosomes with introns were used in stochastic context-free grammar induction in
[17]. Genetic algorithm was used also on the problem of labelling nonterminals (partitioning problem of
nonterminals) in context-free grammar inference using completely structured [10] or partially structured
samples [11].
3
Genetically Generated Grammars
3.1
Previous work
In this section a short overview of our previous work is presented. For more details see [6]. To infer
context-free grammars for domain-specific languages, the genetic programming approach was adopted.
In genetic programming, a program is constructed from terminal set T and user-defined function set F.
The set T contains variables and constants and the set F contains functions that are a priori believed
to be useful for the problem domain. In our case, the set T consists of terminal symbols defined with
regular expressions and the set F consists of nonterminal symbols. From these two sets appropriate
grammars can be evolved, which can be seen as a domain-specific language for expressing the syntax. For
effective use of an evolutionary algorithm we have to choose a suitable representation of the problem,
suitable genetic operators and parameters, and the evaluation function to determine the fitness of
chromosomes. For the encoding of a grammar into a chromosome we used a direct encoding as a list of
BNF production rules as suggested in [14] since this encoding outperforms bit-string representations.
Our earlier GP system starts with a population of randomly generated grammars [6], where the
following additional control parameters that prevent grammars from becoming too large have been
introduced:
– max prod size: maximum number of productions of one grammar,
– max RHS size: maximum number of right-hand symbols of one production.
Furthermore, the specific one-point crossover, mutation and heuristic operators have been proposed
as genetic operators. The one-point crossover is performed in the following manner: two grammars
are chosen randomly and are cut at the same random position; the second halves are then swapped
between the two grammars. To ensure that after crossover the two offsprings are both legal grammars,
the breakpoint position cannot appear in the middle of the production rule. The breakpoint position is
chosen randomly from the smallest of two grammars selected for crossover. An example of the crossover
operation is presented in Figure 1. After crossover, grammars undergo mutation, where a symbol in a
randomly chosen production is mutated. An example of the mutation operator is presented in Figure 2.
E → int T
T → operator
T →ε
E
E
E
T
→T
→T
→ε
→ε
E
E →T
E
E
❑❆ ✁ T → operator E
❆ ✁
T →ε
✁❆
❆
✁
☛✁ ❆
E
E
E
T
→ int
→T
→ε
→ε
Crossover
✛ point
T
Fig. 1. The crossover operator
To enhance the search, the following heuristic operators have been proposed:
ACM SIGPLAN Notices
40
Vol. 40(4), Apr 2005
Mutation point
E → int T
✠
T →E E
E → int
T
✲ T → operator E
Fig. 2. The mutation operator
– option operator,
– iteration* operator and
– iteration+ operator
which exploit the knowledge of grammars, namely extended BNF (EBNF), where grammar symbols
often appear optionally or iteratively. Heuristic operators work in a similar manner as the mutation
operator. A symbol in a randomly chosen production can appear optionally or iteratively. An example
of the option operator is presented in Figure 3.
Option point
E → int T
T → operator
T →ε
E
E ✠ ✲T
T
F
F
→ int T
→ operator
→ε
→E
→ε
F
Fig. 3. The option operator
Similar transformations on grammars are performed under iteration* and iteration+ operators.
To ensure that after crossover, mutation and deletion a chromosome represents a legal grammar a
special procedure is performed where non-reachable or superfluous nonterminal symbols are detected
and eliminated.
Chromosomes were evaluated at the end of each generation by testing each grammar on a sample of
positive and negative samples. For each grammar in the population an LR(1) parser was automatically
generated using the compiler generator tool LISA [7]. The generated parser was then run on fitness
cases (Fig. 4). A grammar’s fitness value is proportional to the length of the correctly parsed positive
sample; thus it is desirable to have a grammar which accepts all the positive samples and rejects all
the negative samples.
Many grammars can be concocted which reject the negative samples. However, our search converges
to the desired grammar more when we obtain grammars which accept the positive samples. Hence, it is a
desultory move to search in the space of all grammars which reject the negative samples, only. Negative
samples are only taken into account when a grammar is capable of accepting all the positive samples.
Another reason is that negative samples are needed mainly to prevent overgeneralization of grammars
[2]. Keeping these facts in view, the fitness value of each grammar is defined to be between 0 and 1,
where interval 0 .. 0.5 denotes that the grammar did not recognize all positive samples and interval 0.5
.. 1 denotes that the grammar recognized all positive samples and did not reject all negative samples. A
grammar with fitness value of 1 signifies that the generated LR(1) parser successfully parsed all positive
samples and rejected all negative samples. For the given grammar[i] its fitness f j (grammar[i]) on the
j-fitness case is defined as:
fj (grammar[i]) =
s
length(programj ) ∗ 2
where s = length(successf ully parsed programj )
The total fitness f (grammar[i]) is defined as:
ACM SIGPLAN Notices
41
Vol. 40(4), Apr 2005
PN
fk (grammar[i])
N
A grammar is tested on the negative samples set only if it successfully recognizes all positive samples.
Here, the portion of successfully parsed negative sample is not important. Therefore, its fitness value
is defined as:
m
f (grammar[i]) = 1.0 −
M ∗2
where m = number of recognized negative samples
M = number of all negative samples
f (grammar[i]) =
k=1
Fitness cases
(positive and negative
samples)
Population of grammars
Run parser on each
fitness case
Sucessfulnes of
parsing
Fitness value
Generated parser
Test grammars
Evolutionary
process
Selection
For each grammar in the
population
Parser
generation
LISA compiler
generator
Crossover and mutation
Fig. 4. The evaluation of chromosomes
Experiments performed in [6] show that with the described approach context-free grammars can be
inferred. But, these context-free grammars have only a small number of productions (e.g. up to 5). For
example, we were able to infer the context-free grammar of a language for simple robot movement [6]:
NT1 -> #Begin NT2 #End
NT2 -> #Command NT2
NT2 -> epsilon
or of a language for nested begin end blocks:
NT1 -> #Begin NT2 #End
NT2 -> NT1 NT2
NT2 -> epsilon
On examples where the underlying context-free grammar was more complex, this approach was not
successful.
3.2
New ideas and approaches
It was observed that the heuristic operators considerably improved the search process, resulting in fitter
induced grammars. Currently, we are using slightly modified versions of the aforementioned heuristic
operators. The sequence of right-hand symbols (not just a single symbol) can appear optionally or
ACM SIGPLAN Notices
42
Vol. 40(4), Apr 2005
iteratively. However, we still were not able to induce bigger grammars despite the fact that the respective
sub-grammars were previously induced in a separate process. For example, we were successful in finding
a context-free grammar for simple expressions and simple assignments statements where the right hand
expression can only be a numeric value. But, when both sub-languages were combined, our earlier
approach failed to find a solution.
Example: The evolution of a grammar for assignment statements with simple arithmetic expressions on
the right side
G=50, pop_size=500, pc=0.4, pm=0.4 pheuristic=0.2, max_prod_size = 8, max_RHS_size = 5
pos num cases (N = 4)
a := 9 + 2
a := 10 + 2 + 12
abc := 22
i := 0
j := 1
abc := 22 + 3
i := 10 + 2 + 3
j := 1
neg num cases (M = 5)
22 := d
d := := 32
:= 2
a :=
+6
Upon closer analysis of our results, it became clear that the randomly generated initial population
was an impediment for the induction process. The search space of all possible grammars is expansive,
and in order to minimize this search space, the initial population should exploit knowledge from the
positive samples by generating a few valid derivation trees by simple composition of consecutive symbols.
For example, in fig. 5 one of the possible derivation trees and appropriate context-free grammars for
positive sample a := 9 + 2 of the aforementioned example is presented. During this process a sequence
of non-terminal symbols which appear iteratively can be detected and an appropriate grammar can be
constructed (see fig. 6 and 7). Apart from this change in the construction of the initial population,
the genetic programming system remained the same. Using this simple enhancement we were able to
induce a context-free grammar for this example (fig. 8).
NT8
NT8
NT7
NT6
NT5
NT4
NT3
NT2
NT1
NT7
NT5
NT6
NT4
NT2
NT1
#id
:=
NT3
NT1
→
→
→
→
→
→
→
→
NT7 NT1
NT5 NT6
NT1 NT3
NT4 NT2
#id
+
:=
#int
#int
+
#int
Fig. 5. One possible derivation tree for positive sample a := 9 + 2
Yet, in some other cases inferring of the underlying context-free grammar was still not successful
simply because composition of consecutive symbols is not always correct. What we need is to identify sub-languages and construct derivation tree for sub-programs first. But this is as hard as original
problem. Since using completely structured [10] or partially structured samples [11] are impractical we
are using an approximation: frequent sequences. A string of symbols is called a frequent sequence if it
appears at least θ times, where θ is some preset threshold. Consider the above example of assignment
ACM SIGPLAN Notices
43
Vol. 40(4), Apr 2005
NT5
NT6
NT6
NT4
NT2
NT1
NT3
NT1
NT3
NT1
#id
:=
#int
+
#int
+
#int
NT6
NT5
NT4
NT3
NT2
NT1
→
→
→
→
→
→
NT1 NT3
NT4 NT2
#id
+
:=
#int
Fig. 6. Construction of derivation tree for positive sample a := 10 + 2 + 12 to the point where iteration of
non-terminal NT6 was detected
NT9
NT9
NT8
NT7
NT7
NT6
NT5
NT4
NT3
NT2
NT1
NT8
NT7
NT7
NT7
NT5
NT6
NT4
NT2
NT1
#id
:=
NT3
NT6
ε
NT1
NT3
NT2
#id
:=
NT8 NT1
NT5 NT7
NT6 NT7
ε
NT1 NT3
NT4 NT2
#id
+
:=
#int
NT1
#int
+
#int
+
#int
Fig. 7. One possible derivation tree for positive sample a := 10 + 2 + 12
NT9
NT9
NT8
NT7
NT7
NT6
NT5
NT4
NT3
NT2
NT1
NT9
NT4
→
→
→
→
→
→
→
→
→
→
NT8
NT7
NT9
NT5
NT6
NT7
NT1
NT3
NT1
→
→
→
→
→
→
→
→
→
→
→
NT8 NT7 NT9
ε
NT4 NT5
NT6 NT7
ε
NT3 NT1
NT2 NT1
#id
+
:=
#int
ε
ε
#int
+
#int
Fig. 8. Correct context-free grammar was found in the generation 21
ACM SIGPLAN Notices
44
Vol. 40(4), Apr 2005
statements with simple arithmetic expressions on the right side. Some frequent sequences of length 2
in 4 positive cases are:
pair
occurrences
#id #oper=
8
8
#oper= #int
#int #oper+
6
#oper+ #int
6
Our basic idea is to construct an initial derivation tree in which frequent sequences are recognized
by a single nonterminal. For example, we might adjoin productions FR1 → #id #oper=, or FR2 →
#int #oper+ into the initial grammar and then construct a valid derivation tree by composition of
consecutive symbols.
Table 1. Inferred grammars for some DSLs.
4
DSL
G
pop size N
M Inferred grammar
video
store
28 300
8
stock
and
sales
1
300
10 3
DESK
1
300
5
3
FDL
127 300
8
4
8
An example of
positive sample
NT15 → NT11 NT7 NT15 jurassicpark child
NT15 → ε
roadtrip reg
NT11 → NT10 NT6
ring new
NT10 → NT5 NT10
andy 3 jurassicpark child
NT10 → ε
2 roadtrip reg
NT7 → NT5 NT7
ann 3 ring new
NT7 → ε
NT6 → #name #days
NT5 → #title #type
NT6 → NT5 NT3
stock description
NT5 → NT4 #sales
twix 0.70 10
NT4 → #stock NT2
mars 0.65 12
NT2 → FR2 NT2
bar 1.09 5
NT2 → ε
sales description
FR2 → #item #price #qty mars 0.65
NT3 → FR0 NT3
twix 0.70
NT3 → ε
mars 0.65
FR0 → #item #price
NT7 → NT6 NT3
print a + b
NT6 → NT5 #where
where a = 10 b = 20
NT5 → NT4 NT2
NT4 → #print #id
NT3 → FR4 NT3
NT3 → ε
NT2 → FR3 NT2
NT2 → ε
FR4 → #id #assign #int
FR3 → #plus #id
NT7 → NT2 NT7
c : all(c1, more-of(f4, f5))
NT7 → ε
c1 : one-of(f1, c2)
NT2 → NT1 FR8
c2 : all(f4, f5)
NT1 → #feature #:
FR8 → #op #( NT11 #,
NT11 #)
NT11 → #feature
NT11 → FR8
Experimental Results
Using an evolutionary approach enhanced by grammar-specific heuristic operators and by better construction of the initial population we were able to infer grammars for small domain-specific languages
ACM SIGPLAN Notices
45
Vol. 40(4), Apr 2005
[8] such as video store [13], stock and sales [13], simple desk calculation language - DESK [9], and a
simplified version of feature description language (FDL) [12]. Inferred grammars as well as some other
parameters (G - number of generations when solution is found, pop size – population size, N – number
of positive samples, M – number of negative samples) are shown in the Table 1. Although results are
promising more research work still need to be done. A lot of ideas need to be implemented and verified
in practice.
5
Conclusions and Future Work
Previous attempts at learning context-free grammars resulted in ineffectual success on real examples.
We extend those works by introducing grammar-specific heuristic operators and facilitating better
construction of the initial population where knowledge from positive samples has been exploited. Our
future work involves exploring the use of data mining techniques in grammar inference, augmenting
the brute force approach with heuristics, and investigating the Minimum Description Length (MDL)
approach for context-free grammar inference.
References
1. P. Dupont. Regular Grammatical Inference from Positive and Negative Samples by Genetic Search: The
GIG method. Proceedings of the 2nd International Colloquium on Grammatical Inference and Applications,
ICGI’94, LNAI, Vol. 862, pp. 236-245, 1994.
2. M.E. Gold. Language Identification in the Limit. Information and Control, Vol. 10, pp. 447-474, 1967.
3. J.R. Koza. Genetic Programming: On the Programming of Computers by Natural Selection. MIT Press,
1992.
4. W.B. Langdon. Genetic Programming and Data Structures: Genetic Programming + Data Structures =
Automatic Programming! Kluwer Academic Publishers, 1998.
5. S. Lukas. Structuring Chromosomes for Context-Free Grammar Evolution. 1st International Conference on
Evolutionary Computing, pp. 130-135, 1994.
6. M. Mernik, G. Gerlič, V. Žumer, B. Bryant. Can a Parser be Generated from Examples? Proceedings of the
ACM Symposium on Applied Computing, Melbourne, pp. 1063-1067, 2003.
7. M. Mernik, M. Lenič, E. Avdičaušević, V. Žumer. LISA: An Interactive Environment for Programming
Language Development. 11th International Conference on Compiler Construction, LNCS, Vol. 2304, pp. 1-4,
2002.
8. M. Mernik, J. Heering, T. Sloane. When and how to develop domain-specific languages. CWI Technical
Report, SEN-E0309, 2003.
9. J. Paakki. Attribute Grammar Paradigms - A High-Level Methodology in Language Implementation. ACM
Computing Surveys, Vol. 27, No. 2, pp. 196-255, 1995.
10. Y. Sakakibara. Efficient Learning of Context-Free Grammars from Positive Structural Examples. Information and Computation, Vol. 97, pp. 23-60, 1992.
11. Y. Sakakibara, H. Muramatsu. Learning Context-Free Grammars from Partially Structured Examples.
Proceedings of the 5th International Colloquium on Grammatical Inference and Applications, ICGI’00, LNAI,
Vol. 1891, pp. 229-240, 2000.
12. A. van Deursen, P. Klint. Domain-Specific Language Design Requires Feature Descriptions. Journal for
Computing and Information Technology, Special issue on Domain-Specific Languages, Eds: R. Lämmel and
M. Mernik, Vol. 9, No. 4, pp. 1-17, 2002.
13. M. Varanda Pereira, M. Mernik, T. Kosar, V. Žumer, P. Henriques. Object-Oriented Attribute Grammar
based Grammatical Approach to Problem Specification. Technical Report, University of Braga, Department
of Computer Science, 2002.
14. P. Wyard. Representational Issues for Context Free Grammar Induction Using Genetic Algorithm. Proceedings of the 2nd International Colloquium on Grammatical Inference and Applications, LNAI, Vol. 862,
pp. 222-235, 1994.
15. M. Črepinšek, M. Mernik, V. Žumer. Extracting Grammar from Programs: Brute Force Approach. Submitted to ACM Sigplan Notices, 2004.
16. T. Bäck, D. Fogel, Z. Michalewicz. Handbook of Evolutionary Computation. University of Oxford Press,
1996.
17. T. Kammeyer, R. K. Belew. Stochastic Context-Free Grammar Induction with a Genetic Algorithm Using
Local Search. Foundations of Genetic Algorithms IV, 1996.
ACM SIGPLAN Notices
46
Vol. 40(4), Apr 2005