[go: up one dir, main page]

Academia.eduAcademia.edu
Extracting Grammar from Programs: Evolutionary Approach Matej Črepinšek1 , Marjan Mernik1 , Faizan Javed2, Barrett R. Bryant2 , and Alan Sprague2 1 University of Maribor, Faculty of Electrical Engineering and Computer Science, Smetanova 17, 2000 Maribor, Slovenia {matej.crepinsek, marjan.mernik}@uni-mb.si 2 The University of Alabama at Birmingham, Department of Computer and Information Sciences, Birmingham, AL 35294-1170, U.S.A. {javedf, bryant, sprague}@cis.uab.edu Abstract. The paper discusses context-free grammar (CFG) inference using genetic-programming with application to inducing grammars from programs written in simple domain-specific languages. Grammar-specific heuristic operators and non-random construction of the initial population are proposed to achieve this task. Suitability of the approach is shown by small examples where the underlying CFG’s are successfully inferred. Keywords. Grammar induction, Grammar inference, Learning from positive and negative examples, Genetic programming 1 Introduction In the accompanying paper [15] we discussed the search space of regular and context-free grammar inference. The conclusion reached was that owing to the large search space, the exhaustive (brute-force) approach to grammar induction could only be applied to small positive samples. Hence, a need for a different and more efficient approach to explore the search space arose. Evolutionary computation [16] is particulary suitable for such kinds of problems. In fact, genetic algorithms have already been applied to the grammar inference problem, with varying results. In this paper another evolutionary approach, Genetic Programming (GP), to CFG learning is presented. Genetic programming [3] is a successful technique for getting computers to automatically solve problems. It has been successfully used in a wide variety of application domains such as data mining, image classification and robotic control. In general, genetic programming works well for problems where solutions can be expressed with a modestly short program. For example, methods working on typical data structures such as stacks, queues and lists have been successfully evolved using genetic programming in [4]. Specifications (BNF) for domainspecific languages are small enough so that we can expect that a successful solution can be found using genetic programming. Our previous work [6] was successful in inferring small context-free grammars from positive and negative samples. This paper elaborates on our recent research findings and builds on our previous work. 2 Related Work The impact of different representations of grammars was explored in [14] where experimental results showed that an evolutionary algorithm using standard context-free grammars (BNF) outperforms those using Greibach Normal Form (GNF), Chomsky Normal Form (CNF) or bit-string representations [5]. This performance differential was attributed to the larger grammar search space of the other representations, which was a consequence of them having a more complex grammar form. The experimental assessment in [14] was very limited due to the large processing time (processing of one generation had taken several hours; using our system, processing of one generation takes just few seconds). This was due to use of the chart parser, which is used commonly in natural language parsing and can accept ACM SIGPLAN Notices 39 Vol. 40(4), Apr 2005 ambiguous grammars as well. With this approach a grammar was successfully inferred for the language of correctly balanced and nested brackets. In [1] a genetic algorithm was used on the problem of merging states in the prefix-tree automaton of regular grammar inference. It was shown that the genetic algorithm performs equally good as other regular grammar inference algorithms (e.g. RPNI). Variable length chromosomes with introns were used in stochastic context-free grammar induction in [17]. Genetic algorithm was used also on the problem of labelling nonterminals (partitioning problem of nonterminals) in context-free grammar inference using completely structured [10] or partially structured samples [11]. 3 Genetically Generated Grammars 3.1 Previous work In this section a short overview of our previous work is presented. For more details see [6]. To infer context-free grammars for domain-specific languages, the genetic programming approach was adopted. In genetic programming, a program is constructed from terminal set T and user-defined function set F. The set T contains variables and constants and the set F contains functions that are a priori believed to be useful for the problem domain. In our case, the set T consists of terminal symbols defined with regular expressions and the set F consists of nonterminal symbols. From these two sets appropriate grammars can be evolved, which can be seen as a domain-specific language for expressing the syntax. For effective use of an evolutionary algorithm we have to choose a suitable representation of the problem, suitable genetic operators and parameters, and the evaluation function to determine the fitness of chromosomes. For the encoding of a grammar into a chromosome we used a direct encoding as a list of BNF production rules as suggested in [14] since this encoding outperforms bit-string representations. Our earlier GP system starts with a population of randomly generated grammars [6], where the following additional control parameters that prevent grammars from becoming too large have been introduced: – max prod size: maximum number of productions of one grammar, – max RHS size: maximum number of right-hand symbols of one production. Furthermore, the specific one-point crossover, mutation and heuristic operators have been proposed as genetic operators. The one-point crossover is performed in the following manner: two grammars are chosen randomly and are cut at the same random position; the second halves are then swapped between the two grammars. To ensure that after crossover the two offsprings are both legal grammars, the breakpoint position cannot appear in the middle of the production rule. The breakpoint position is chosen randomly from the smallest of two grammars selected for crossover. An example of the crossover operation is presented in Figure 1. After crossover, grammars undergo mutation, where a symbol in a randomly chosen production is mutated. An example of the mutation operator is presented in Figure 2. E → int T T → operator T →ε E E E T →T →T →ε →ε E E →T E E ❑❆ ✁ T → operator E ❆ ✁ T →ε ✁❆ ❆ ✁ ☛✁ ❆ E E E T → int →T →ε →ε Crossover ✛ point T Fig. 1. The crossover operator To enhance the search, the following heuristic operators have been proposed: ACM SIGPLAN Notices 40 Vol. 40(4), Apr 2005 Mutation point E → int T ✠ T →E E E → int T ✲ T → operator E Fig. 2. The mutation operator – option operator, – iteration* operator and – iteration+ operator which exploit the knowledge of grammars, namely extended BNF (EBNF), where grammar symbols often appear optionally or iteratively. Heuristic operators work in a similar manner as the mutation operator. A symbol in a randomly chosen production can appear optionally or iteratively. An example of the option operator is presented in Figure 3. Option point E → int T T → operator T →ε E E ✠ ✲T T F F → int T → operator →ε →E →ε F Fig. 3. The option operator Similar transformations on grammars are performed under iteration* and iteration+ operators. To ensure that after crossover, mutation and deletion a chromosome represents a legal grammar a special procedure is performed where non-reachable or superfluous nonterminal symbols are detected and eliminated. Chromosomes were evaluated at the end of each generation by testing each grammar on a sample of positive and negative samples. For each grammar in the population an LR(1) parser was automatically generated using the compiler generator tool LISA [7]. The generated parser was then run on fitness cases (Fig. 4). A grammar’s fitness value is proportional to the length of the correctly parsed positive sample; thus it is desirable to have a grammar which accepts all the positive samples and rejects all the negative samples. Many grammars can be concocted which reject the negative samples. However, our search converges to the desired grammar more when we obtain grammars which accept the positive samples. Hence, it is a desultory move to search in the space of all grammars which reject the negative samples, only. Negative samples are only taken into account when a grammar is capable of accepting all the positive samples. Another reason is that negative samples are needed mainly to prevent overgeneralization of grammars [2]. Keeping these facts in view, the fitness value of each grammar is defined to be between 0 and 1, where interval 0 .. 0.5 denotes that the grammar did not recognize all positive samples and interval 0.5 .. 1 denotes that the grammar recognized all positive samples and did not reject all negative samples. A grammar with fitness value of 1 signifies that the generated LR(1) parser successfully parsed all positive samples and rejected all negative samples. For the given grammar[i] its fitness f j (grammar[i]) on the j-fitness case is defined as: fj (grammar[i]) = s length(programj ) ∗ 2 where s = length(successf ully parsed programj ) The total fitness f (grammar[i]) is defined as: ACM SIGPLAN Notices 41 Vol. 40(4), Apr 2005 PN fk (grammar[i]) N A grammar is tested on the negative samples set only if it successfully recognizes all positive samples. Here, the portion of successfully parsed negative sample is not important. Therefore, its fitness value is defined as: m f (grammar[i]) = 1.0 − M ∗2 where m = number of recognized negative samples M = number of all negative samples f (grammar[i]) = k=1 Fitness cases (positive and negative samples) Population of grammars Run parser on each fitness case Sucessfulnes of parsing Fitness value Generated parser Test grammars Evolutionary process Selection For each grammar in the population Parser generation LISA compiler generator Crossover and mutation Fig. 4. The evaluation of chromosomes Experiments performed in [6] show that with the described approach context-free grammars can be inferred. But, these context-free grammars have only a small number of productions (e.g. up to 5). For example, we were able to infer the context-free grammar of a language for simple robot movement [6]: NT1 -> #Begin NT2 #End NT2 -> #Command NT2 NT2 -> epsilon or of a language for nested begin end blocks: NT1 -> #Begin NT2 #End NT2 -> NT1 NT2 NT2 -> epsilon On examples where the underlying context-free grammar was more complex, this approach was not successful. 3.2 New ideas and approaches It was observed that the heuristic operators considerably improved the search process, resulting in fitter induced grammars. Currently, we are using slightly modified versions of the aforementioned heuristic operators. The sequence of right-hand symbols (not just a single symbol) can appear optionally or ACM SIGPLAN Notices 42 Vol. 40(4), Apr 2005 iteratively. However, we still were not able to induce bigger grammars despite the fact that the respective sub-grammars were previously induced in a separate process. For example, we were successful in finding a context-free grammar for simple expressions and simple assignments statements where the right hand expression can only be a numeric value. But, when both sub-languages were combined, our earlier approach failed to find a solution. Example: The evolution of a grammar for assignment statements with simple arithmetic expressions on the right side G=50, pop_size=500, pc=0.4, pm=0.4 pheuristic=0.2, max_prod_size = 8, max_RHS_size = 5 pos num cases (N = 4) a := 9 + 2 a := 10 + 2 + 12 abc := 22 i := 0 j := 1 abc := 22 + 3 i := 10 + 2 + 3 j := 1 neg num cases (M = 5) 22 := d d := := 32 := 2 a := +6 Upon closer analysis of our results, it became clear that the randomly generated initial population was an impediment for the induction process. The search space of all possible grammars is expansive, and in order to minimize this search space, the initial population should exploit knowledge from the positive samples by generating a few valid derivation trees by simple composition of consecutive symbols. For example, in fig. 5 one of the possible derivation trees and appropriate context-free grammars for positive sample a := 9 + 2 of the aforementioned example is presented. During this process a sequence of non-terminal symbols which appear iteratively can be detected and an appropriate grammar can be constructed (see fig. 6 and 7). Apart from this change in the construction of the initial population, the genetic programming system remained the same. Using this simple enhancement we were able to induce a context-free grammar for this example (fig. 8). NT8 NT8 NT7 NT6 NT5 NT4 NT3 NT2 NT1 NT7 NT5 NT6 NT4 NT2 NT1 #id := NT3 NT1 → → → → → → → → NT7 NT1 NT5 NT6 NT1 NT3 NT4 NT2 #id + := #int #int + #int Fig. 5. One possible derivation tree for positive sample a := 9 + 2 Yet, in some other cases inferring of the underlying context-free grammar was still not successful simply because composition of consecutive symbols is not always correct. What we need is to identify sub-languages and construct derivation tree for sub-programs first. But this is as hard as original problem. Since using completely structured [10] or partially structured samples [11] are impractical we are using an approximation: frequent sequences. A string of symbols is called a frequent sequence if it appears at least θ times, where θ is some preset threshold. Consider the above example of assignment ACM SIGPLAN Notices 43 Vol. 40(4), Apr 2005 NT5 NT6 NT6 NT4 NT2 NT1 NT3 NT1 NT3 NT1 #id := #int + #int + #int NT6 NT5 NT4 NT3 NT2 NT1 → → → → → → NT1 NT3 NT4 NT2 #id + := #int Fig. 6. Construction of derivation tree for positive sample a := 10 + 2 + 12 to the point where iteration of non-terminal NT6 was detected NT9 NT9 NT8 NT7 NT7 NT6 NT5 NT4 NT3 NT2 NT1 NT8 NT7 NT7 NT7 NT5 NT6 NT4 NT2 NT1 #id := NT3 NT6 ε NT1 NT3 NT2 #id := NT8 NT1 NT5 NT7 NT6 NT7 ε NT1 NT3 NT4 NT2 #id + := #int NT1 #int + #int + #int Fig. 7. One possible derivation tree for positive sample a := 10 + 2 + 12 NT9 NT9 NT8 NT7 NT7 NT6 NT5 NT4 NT3 NT2 NT1 NT9 NT4 → → → → → → → → → → NT8 NT7 NT9 NT5 NT6 NT7 NT1 NT3 NT1 → → → → → → → → → → → NT8 NT7 NT9 ε NT4 NT5 NT6 NT7 ε NT3 NT1 NT2 NT1 #id + := #int ε ε #int + #int Fig. 8. Correct context-free grammar was found in the generation 21 ACM SIGPLAN Notices 44 Vol. 40(4), Apr 2005 statements with simple arithmetic expressions on the right side. Some frequent sequences of length 2 in 4 positive cases are: pair occurrences #id #oper= 8 8 #oper= #int #int #oper+ 6 #oper+ #int 6 Our basic idea is to construct an initial derivation tree in which frequent sequences are recognized by a single nonterminal. For example, we might adjoin productions FR1 → #id #oper=, or FR2 → #int #oper+ into the initial grammar and then construct a valid derivation tree by composition of consecutive symbols. Table 1. Inferred grammars for some DSLs. 4 DSL G pop size N M Inferred grammar video store 28 300 8 stock and sales 1 300 10 3 DESK 1 300 5 3 FDL 127 300 8 4 8 An example of positive sample NT15 → NT11 NT7 NT15 jurassicpark child NT15 → ε roadtrip reg NT11 → NT10 NT6 ring new NT10 → NT5 NT10 andy 3 jurassicpark child NT10 → ε 2 roadtrip reg NT7 → NT5 NT7 ann 3 ring new NT7 → ε NT6 → #name #days NT5 → #title #type NT6 → NT5 NT3 stock description NT5 → NT4 #sales twix 0.70 10 NT4 → #stock NT2 mars 0.65 12 NT2 → FR2 NT2 bar 1.09 5 NT2 → ε sales description FR2 → #item #price #qty mars 0.65 NT3 → FR0 NT3 twix 0.70 NT3 → ε mars 0.65 FR0 → #item #price NT7 → NT6 NT3 print a + b NT6 → NT5 #where where a = 10 b = 20 NT5 → NT4 NT2 NT4 → #print #id NT3 → FR4 NT3 NT3 → ε NT2 → FR3 NT2 NT2 → ε FR4 → #id #assign #int FR3 → #plus #id NT7 → NT2 NT7 c : all(c1, more-of(f4, f5)) NT7 → ε c1 : one-of(f1, c2) NT2 → NT1 FR8 c2 : all(f4, f5) NT1 → #feature #: FR8 → #op #( NT11 #, NT11 #) NT11 → #feature NT11 → FR8 Experimental Results Using an evolutionary approach enhanced by grammar-specific heuristic operators and by better construction of the initial population we were able to infer grammars for small domain-specific languages ACM SIGPLAN Notices 45 Vol. 40(4), Apr 2005 [8] such as video store [13], stock and sales [13], simple desk calculation language - DESK [9], and a simplified version of feature description language (FDL) [12]. Inferred grammars as well as some other parameters (G - number of generations when solution is found, pop size – population size, N – number of positive samples, M – number of negative samples) are shown in the Table 1. Although results are promising more research work still need to be done. A lot of ideas need to be implemented and verified in practice. 5 Conclusions and Future Work Previous attempts at learning context-free grammars resulted in ineffectual success on real examples. We extend those works by introducing grammar-specific heuristic operators and facilitating better construction of the initial population where knowledge from positive samples has been exploited. Our future work involves exploring the use of data mining techniques in grammar inference, augmenting the brute force approach with heuristics, and investigating the Minimum Description Length (MDL) approach for context-free grammar inference. References 1. P. Dupont. Regular Grammatical Inference from Positive and Negative Samples by Genetic Search: The GIG method. Proceedings of the 2nd International Colloquium on Grammatical Inference and Applications, ICGI’94, LNAI, Vol. 862, pp. 236-245, 1994. 2. M.E. Gold. Language Identification in the Limit. Information and Control, Vol. 10, pp. 447-474, 1967. 3. J.R. Koza. Genetic Programming: On the Programming of Computers by Natural Selection. MIT Press, 1992. 4. W.B. Langdon. Genetic Programming and Data Structures: Genetic Programming + Data Structures = Automatic Programming! Kluwer Academic Publishers, 1998. 5. S. Lukas. Structuring Chromosomes for Context-Free Grammar Evolution. 1st International Conference on Evolutionary Computing, pp. 130-135, 1994. 6. M. Mernik, G. Gerlič, V. Žumer, B. Bryant. Can a Parser be Generated from Examples? Proceedings of the ACM Symposium on Applied Computing, Melbourne, pp. 1063-1067, 2003. 7. M. Mernik, M. Lenič, E. Avdičaušević, V. Žumer. LISA: An Interactive Environment for Programming Language Development. 11th International Conference on Compiler Construction, LNCS, Vol. 2304, pp. 1-4, 2002. 8. M. Mernik, J. Heering, T. Sloane. When and how to develop domain-specific languages. CWI Technical Report, SEN-E0309, 2003. 9. J. Paakki. Attribute Grammar Paradigms - A High-Level Methodology in Language Implementation. ACM Computing Surveys, Vol. 27, No. 2, pp. 196-255, 1995. 10. Y. Sakakibara. Efficient Learning of Context-Free Grammars from Positive Structural Examples. Information and Computation, Vol. 97, pp. 23-60, 1992. 11. Y. Sakakibara, H. Muramatsu. Learning Context-Free Grammars from Partially Structured Examples. Proceedings of the 5th International Colloquium on Grammatical Inference and Applications, ICGI’00, LNAI, Vol. 1891, pp. 229-240, 2000. 12. A. van Deursen, P. Klint. Domain-Specific Language Design Requires Feature Descriptions. Journal for Computing and Information Technology, Special issue on Domain-Specific Languages, Eds: R. Lämmel and M. Mernik, Vol. 9, No. 4, pp. 1-17, 2002. 13. M. Varanda Pereira, M. Mernik, T. Kosar, V. Žumer, P. Henriques. Object-Oriented Attribute Grammar based Grammatical Approach to Problem Specification. Technical Report, University of Braga, Department of Computer Science, 2002. 14. P. Wyard. Representational Issues for Context Free Grammar Induction Using Genetic Algorithm. Proceedings of the 2nd International Colloquium on Grammatical Inference and Applications, LNAI, Vol. 862, pp. 222-235, 1994. 15. M. Črepinšek, M. Mernik, V. Žumer. Extracting Grammar from Programs: Brute Force Approach. Submitted to ACM Sigplan Notices, 2004. 16. T. Bäck, D. Fogel, Z. Michalewicz. Handbook of Evolutionary Computation. University of Oxford Press, 1996. 17. T. Kammeyer, R. K. Belew. Stochastic Context-Free Grammar Induction with a Genetic Algorithm Using Local Search. Foundations of Genetic Algorithms IV, 1996. ACM SIGPLAN Notices 46 Vol. 40(4), Apr 2005