Disclosure of Invention
Aiming at the technical problems existing in the prior art, the invention provides the token-level automatic generation method for the program input grammar, which has the advantages of simple implementation method, high grammar generation effectiveness and efficiency and complete grammar, and can remarkably improve the automatic test capability of receiving complex input software and improve the effectiveness and efficiency of an automatic test tool.
In order to solve the technical problems, the technical scheme provided by the invention is as follows:
A token-level program input grammar automatic generation method comprises the following steps:
s01, obtaining a character string conforming to a grammar of an analysis program as initial input and performing symbol execution, generating a token sequence set corresponding to the analysis program, and collecting the upper and lower boundary ranges of each token;
s02, carrying out regular expression generalization on the token sequence set to generate a regular expression;
S03, combining specified non-terminal symbols in the regular expression obtained through generalization, and converting the non-terminal symbols into a context-free grammar;
s04, using the upper and lower boundary ranges of the token as constraints, performing character generalization on a character level generation formula of each token in the context-free grammar to obtain final grammar output, wherein the character level generation formula is used for representing a value of a character level corresponding to each token.
Further, in the step S01, a test case automatic generation tool GADSE (Grammar-Agnostic Dynamic Symbolic Execution) is used to automatically generate a token sequence set that can be received by the parsing program and to collect the upper and lower bound ranges corresponding to each token.
Further, the regular expression generalization in the step S02 uses repeated generalization and alternate generalization, where the repeated generalization is to repeat a given substring in a grammar, and the alternate generalization is to decompose the substring in the grammar and form alternations.
Further, performing regular expression generalization in step S02 also uses exchange generalization by dividing the current expression L into three parts and exchanging the first and third parts therein.
Further, each generalization in the step S02 further comprises a step of testing each possible token regular expression, wherein the step comprises the steps of generating a representative token sequence, and then converting the generated token sequence into character-level input by using the upper and lower bound ranges of the token as constraints so as to test the validity of the current token sequence.
Further, the step S03 includes traversing all non-terminal symbols in the generalized regular expression, judging whether the non-terminal symbols are equivalent, if so, merging the equivalent non-terminal symbols into a non-terminal symbol, and the merged regular expression is the context-free grammar obtained by conversion.
Further, in the step S04, the upper limit and the lower limit of each token character are calculated by using SMT, and each token character is generalized to a value between the upper limit and the lower limit, so as to implement character generalization.
Further, after the character is generalized in the step S04, the method further includes placing the generalized character value into a sequence in which the token character appears and checking the validity of the character value to check the generalized character value.
A token-level program input grammar automatic generation device, comprising:
the symbol execution module is used for acquiring a character string conforming to the grammar of the analysis program as initial input and executing symbols, generating a token sequence set corresponding to the analysis program and collecting the upper and lower boundary ranges of each token;
The first generalization module is used for generalizing the regular expression of the token sequence set to generate a regular expression;
The second generalization module is used for merging specified non-terminal symbols in the regular expression obtained by generalization and converting the non-terminal symbols into context-free grammar;
And the third generalization module is used for generalizing characters of a character level generating formula of each token in the context-free grammar by using the upper and lower boundary ranges of the token as constraints to obtain final grammar output, wherein the character level generating formula is used for representing a value of a character level corresponding to each token.
A computer readable storage medium storing a computer program which when executed performs a method as described above.
Compared with the prior art, the invention has the advantages that:
1. The invention generates a token sequence for a program with a complex input format by using a program symbol irrelevant to a token, simultaneously collects the upper and lower boundary ranges of each token, synthesizes the input token by using the generated token sequence, firstly generalizes the token into a regular expression, then carries out merging operation on non-terminal symbols in the regular expression, converts the non-terminal symbols into a context-free grammar, then carries out character generalization on a character level value corresponding to each token in the context-free grammar by combining the upper and lower boundary ranges of the token values to further expand the grammar range, so as to obtain final grammar output, realize token-level grammar synthesis, and effectively improve the effectiveness and efficiency of grammar synthesis and the integrity of grammar generation based on token-level synthesis compared with the traditional character-level input learning mode.
2. The invention further provides a large amount of initial inputs with good distribution for grammar learning by combining the test case automatic generation technology and using GADSE, further improves the automation degree and effect of grammar synthesis, realizes grammar synthesis based on token level, and can save a large amount of partition time of GADSE by representing a character sequence in the analysis process and realize faster and higher-level grammar synthesis.
3. The invention further considers the generation rule of the non-terminal symbol, introduces new exchange generalization operation on the basis of repeated and alternate operation, and realizes the exchange generalization by dividing the expression into three parts and exchanging the first part and the third part in the expression, thereby further improving the precision of grammar generation.
4. The invention uses the symbols to execute the collected token constraint and converts the token constraint into the SMT optimization problem, so that the SMT optimization can be used for generalizing the possible character-level values of the token, and the grammar accuracy can be further improved.
Detailed Description
The invention is further described below in connection with the drawings and the specific preferred embodiments, but the scope of protection of the invention is not limited thereby.
As shown in fig. 1 and 2, the method for automatically generating a token-level program input grammar according to the present embodiment includes the steps of:
s01, obtaining a character string conforming to a grammar of an analysis program as initial input and performing symbol execution, generating a token sequence set corresponding to the analysis program, and collecting the upper and lower boundary ranges of each token;
S02, carrying out regular expression generalization on the token sequence set to generate a regular expression;
S03, combining specified non-terminal symbols in the generalization obtained regular expression, and converting the non-terminal symbols into a context-free grammar;
S04, using the upper and lower boundary ranges of the token as constraints, performing character generalization on the character level generation formula of each token in the context-free grammar to obtain final grammar output, wherein the character level generation formula is used for representing the value of the character level corresponding to each token.
In lexical analysis, the input is first marked as a token sequence, then the token sequence is checked according to the grammar rules in the grammar analysis, and the token sequence is closer to the generation formula in the grammar, so that for grammar learning, if the grammar can be learned at the token level, i.e. the grammar learning is performed based on the token sequence, the learning process is more efficient, and the result is more accurate and complete.
In the embodiment, a character string conforming to a grammar of an analysis program is used as initial input, symbol execution is performed on the initial input, so that a legal token sequence set corresponding to the analysis program is automatically generated, namely, a token sequence is generated for a program with a complex input format by using a program symbol execution independent of the grammar, the upper and lower bound ranges of each token are collected, then the input grammar is synthesized by using the generated token sequence, the token sequence is subjected to generalization operation firstly, non-terminal characters in the generalization obtained regular expression are subjected to merging operation again, the recursion attribute of the grammar is obtained, the context-free grammar is converted, and then the character-level value corresponding to each token in the context-free grammar is subjected to character generalization by combining the upper and lower bound ranges of the token value, so that the final grammar output is obtained, and the grammar synthesis of the token level can be realized based on symbol execution.
As a white box learning method, compared with the traditional character-level input learning method, the token-level grammar learning method based on symbol execution can effectively improve the effectiveness and efficiency of grammar synthesis, realize more effective grammar synthesis, improve the integrity of grammar generation based on token-level synthesis, and be particularly suitable for automatic input grammar learning generation of analysis programs with good grammar distribution and unavailable input samples.
In step S01 of this embodiment, the test case automatic generation tool GADSE is used to automatically generate a set of token sequences that can be received by the analysis program, and collect the upper and lower bound ranges corresponding to each token, and the upper and lower bound ranges corresponding to each token are used as constraint values corresponding to each token. By combining the automatic test case generation technology, GADSE can provide a large number of well-distributed initial inputs for grammar learning, and further improve the automation degree and effect of grammar synthesis.
Taking the example grammar as in fig. 3, where the upper part is the example grammar, the lower part is the parsing program of the grammar, which in effect implements a context dependent grammar, where the grammar in the example program (.+ -.) +, (..+ -.) is a regular expression symbol, < T > represents a single character, < ID > represents a capitalized identification symbol of length greater than 1, < Number > must be a single digit Number, and < Sep > represents a special split symbol. The grammar requires that the tags should have the same name, e.g., < a > AA < a > be valid, but < a > AA < b > be invalid. This may be accomplished using a recursive descent parser and using a stack to check if the tags have the same name.
In a specific application embodiment, an initial input "< a >; < a >" is first given, and then symbol execution is performed on the initial input using GADSE, so as to generate a legal token sequence set corresponding to the parsing program and constraint values corresponding to each token. There are 6 token values in the example program. GADSE first generate constraints for the token in the parser, where the token is 1 in length, as shown in FIG. 4. Then GADSE generates a path constraint at the token level and solves to obtain a token sequence. By generating a new token sequence to search the token-level path space GADSE, the path space of the parser can be explored more efficiently. As shown in FIG. 5, the upper portion represents the initially entered token-level path constraints and the lower portion represents GADSE exploration of the two valid token sequences generated by parsing the program path space.
In this embodiment, repeated (repetition) generalization and alternate (alternation) generalization are specifically used when regular expression generalization is performed, where the repeated generalization is to repeat a given substring (for example, aa may be generalized to a (a)) in a grammar, and the alternate generalization is to decompose the substring in the grammar and form alternates (for example, (ab) may be generalized to a (a|b)). For the token sequence generated in step S01, a regularization operation is applied to each token sequence with the token sequence as an initial input, and candidate regularization grammars are iteratively generated. The candidate grammars are then checked for correctness by generating new inputs from the candidate grammars and checking whether the new inputs are accepted by the parser. And continuing to apply generalization operation to the candidate grammar passing the inspection until generalization cannot be performed, obtaining a final regular grammar, and then proceeding to step S03.
In step S02 of this embodiment, a black box grammar synthesizer Glade is specifically used to perform regular expression generalization, where Glade starts from a set of valid inputs, and for each input i, glade gradually extends i to a regular expression e, where the generalization process is a series of regular expressions L0. For each 0.ltoreq.i.ltoreq.n-1,Each generalization attempts to expand Li, but at the same time avoids excessive generalization. Glade provides two generalization operations, repetition (iteration) and alternation (alternation), wherein Li is first divided into three parts when repeating generalization, i.e., li=α1α2α3, and then the second part and the third part are repeated as follows:
α1([α2]alt)*[α3]rep (1)
wherein the expressions in [ ] rep, [ ] alt respectively denote that repetition, alternation, is to be used to generalize the expressions respectively.
Since there may be multiple partitioning scenarios, it is further possible to check new regular expressions by generating several strings for the new expression and checking whether these strings are acceptable to the program. If all strings are valid, the new expression is taken as a candidate generalization expression. For example, for the example program, assuming i is < a > AA;0<a > and using repetition to generalize i, the following four candidate rules can be derived and checked:
<a>([AA;]alt)*[0<a>]rep (2)
<a>([A]alt)*[A;0<a>]rep
<a>A([A]alt)*[;0<a>]rep
<a>AA([;0]alt)*[<a>]rep
Glade further selects the repeat of greatest length, i.e., the first expression (AA; length 3), as the generalization expression.
The alternating operation is to divide Li into two parts, i.e. li=a1α2, and recursively apply repetition and alternation to the first part and the second part, respectively, i.e. the rule is:
[α1]rep+[α2]alt (3)
As with the repeated operation, the present embodiment also examines different candidates at the time of the alternate operation, and selects one of the candidates as the candidate generalization expression li+1. For example, for the expression in formula (2), if AA cannot be separated, only the expression needs to be repeated to obtain the expression:
<a>([AA;]rep)*[0<a>]rep (4)
for each generalization step, using Glade can ensure that Li is a subset of li+1, and can also be made as less extensive as possible by examining the possible generalization expressions. Under the default condition, the initial input i is firstly generated repeatedly by AA, and the generalization process of O is as follows:
The above process has a total of 9 steps, based on which Glade generates the following for the final regular expression:
S:=<a>(S1)*0<a> S1:=(A)+; (5)
and Glade can also generalize a single character by trying all possible characters, e.g., the context free grammar described above can be generated to the following grammar:
S:=<a>(S1)*(0|...|9)<a> S1:=(A|...|Z)+(:|;) (6)
The final context-free grammar described above approximates the grammar in fig. 2. Glade can generalize one grammar for each input for multiple initial valid inputs and unify the generated grammars into one as a result. Glade, however, needs to assume that the initial input is valid, the distribution of valid inputs also determines the integrity of the integrated grammar, and if no grammar is entered, it is difficult to obtain well-distributed valid inputs, such as inputs that cover all production rules and grammar terminators. The present embodiment can obtain a well-distributed effective input by using Dynamic Symbol Execution (DSE) as an input for program generation. Since Glade is synthesized on a character level, in each generalization step, there are many possible partitions at the character level, so Glade requires much time to check the validity of each partition, which also limits the generalization capability of Glade, in the embodiment, grammar synthesis is realized based on a token level, one token represents a character sequence in the parsing process, and Glade does not need to partition the character string of the token, so that a large amount of partition time can be saved, and faster and higher-level grammar synthesis can be realized.
The production rules considering non-terminal < Content > are:
<Cpmtent>:=<Value>(<Sep><Content>)* (7)
there are two cases of < value >, namely, a sign or a number.
If the current expression is AA, 0, the current generalization operation of Glade limits the generalization capability because only sub-expressions (e.g., AA; or 0) can be repeated, however if generalization is desired to the following expression:
<Value>(<Sep><Content>)* (8)
It is desirable to have both 0 and AA (or AA; and 0), which is obtained by exchanging the first part and the third part. Thus, the generalization range can be extended by exchanging the first and third parts, generalizing AA, regular expressions of 0 to < content >.
The present embodiment uses the above feature to perform regular expression generalization in step S02, and also uses exchange ([ ] exh) generalization, that is, introduces an exchange generalization operation to perform generalization, where the exchange generalization is represented by dividing the current expression L into three parts and exchanging the first part and the third part thereof:
((([α1]rep+[α3]rep)[α2]rep)*([α1]rep+[α3]rep)) (9)
For example, using a swap operation, a token sequence may be generalized to the following three candidate canonical token expressions:
LTR([IS]alt)*[NLTR]rep (10)
LTRI([SN]alt)*[LTR]rep
LTR([ISN]exh)[LTR]rep
For simplicity, in the above expression we use the first letter of the second part of each token value to represent a token, e.g., S stands for T_Sep.
The embodiment further includes examining each possible token regular expression in each generalization step described above, specifically including first generating a representative token sequence and then converting the token sequence into character-level input by token value constraints (upper and lower bound ranges) and character-level constants (collected in GADSE) to examine its validity. For example, LTRNSILTR (used to test the last expression) and LTRISISNLTR are converted to < a >0, AA < a > and < a > AA, AA;0<a >, respectively, which are all accepted by the parser, i.e., are all valid.
In this embodiment, after checking the token validity, the generalization process of preferentially selecting the exchange generalization to continue generalization, LTRISNLTR may be specifically expressed as:
Wherein each letter represents a token corresponding to a first letter of the second portion of each token value.
The following context-free grammar can be obtained according to the steps:
S:=(LTR S1 LTR)S1:=(S2 S)*S2 S2:=S3|S4 S3:=I S4:=N
In this embodiment, step S03 further merges non-terminal symbols in the regular expression in step S02 to convert the non-terminal symbols into context-free related grammar, and the steps include traversing all non-terminal symbols in the regular expression obtained by generalization, judging whether the non-terminal symbols are equivalent, if so, merging the equivalent non-terminal symbols into one non-terminal symbol, and the merged regular expression is the context-free grammar obtained by conversion.
As in the context-free grammar described above, S1, S2, S3, and S4 are combinable and can be replaced with S1. In addition, for the first generating rule, S and S1 are combinable, but S and S2 are not combinable, so the present embodiment combines S and S1 only in the first rule, and the following syntax can be obtained after the combination:
S:=(LTR S LTR)|S1 S1:=S1SS1|I|N。
In the embodiment, step S04 further expands the grammar range by performing character generalization on each token non-terminal symbol in the grammar result in step S3 in combination with the upper and lower limits of the token value, so as to obtain the final grammar. For each token character t, using a grammar combination to generalize possible character values, and after character generalization, further comprising placing the generalization character values into sequences in which token characters appear and checking the validity of the character values to check the generalization character values.
In step S04 of this embodiment, the upper limit and the lower limit of each token character are calculated specifically using SMT, and each token character is generalized to a value between the upper limit and the lower limit, so as to implement character generalization. By performing the collected token constraints using symbols and translating to SMT optimization problems, the SMT optimization can be utilized to generalize the possible character-level values of the token, which can further improve the accuracy of the grammar.
For example, token I has a character value of "AA", which can be generalized to regular expression (A) +, and further, the character level value of each token is generalized by the token constraint collected in the first stage Gadse, the upper and lower limits of each character are calculated using SMT optimization theory, and the token is generalized to a value between the upper and lower limits. For example, for a token value L, it is constrained to be s [0] = '<'; therefore, its value can only be '<'. The constraint for the token value N is s [0 ]. Gtoreq.0 '. Gtoreq.s [0 ]. Gtoreq.9', so that the value of N is between '0' and '9'. Finally, based on the above steps, a context-free grammar of the following formula (12) can be synthesized, which is equivalent to the grammar in fig. 4.
In a specific application embodiment, taking the last token sequence in fig. 3 as an example for synthesis, the detailed steps for synthesizing the token sequence include:
Step 1, repeating, replacing and exchanging the last token sequence, namely LTRILTR, based on the synthesis of the token;
Step 2, testing each possible token regular expression in each generalization step, namely firstly generating a representative token sequence, and then converting the token sequence into character-level input through token value constraint and constant to test validity.
And step 3, carrying out merging operation on non-terminal symbols in the regular expression obtained after regularization to obtain recursive attributes of the grammar, and converting the recursive attributes into context-free grammar.
And 4, generalizing the character level value of each token value by using the token constraint (upper and lower bound range) collected in the step S01 through GADSE, wherein the upper bound and the lower bound of each character in the token are calculated by using an SMT optimization theory, the token is generalized into a value between the lower bound and the upper bound, and the final context-free grammar output can be synthesized based on the token sequence and the token value constraint.
In a specific application embodiment, the above grammar generation is implemented on a JPF-based grammar-independent concolic execution engine and Glade, which is applied to a certain Java parser, and the obtained grammar result is shown in fig. 6. From fig. 6, it can be seen that the grammar finally generated by the method is consistent with the target grammar, and the method can greatly improve the efficiency of grammar synthesis and the integrity of the grammar compared with the conventional grammar constraint based on the character level. Specifically, compared with state-of-the-arts (i.e. Glade and Arvada) equipped with character-level symbol execution, the F1 fraction of the method is respectively improved by 5.24 times and 3.54 times, and compared with Arvada, the method can realize 141.93 times time acceleration in grammar synthesis, i.e. the method can effectively improve the precision and recall rate of grammar generation.
The token-level program input grammar automatic generation device of the embodiment comprises:
the symbol execution module is used for acquiring a character string conforming to the grammar of the analysis program as initial input and executing symbols, generating a token sequence set corresponding to the analysis program and collecting the upper and lower boundary ranges of each token;
the first generalization module is used for carrying out regular expression generalization on the token sequence set to generate a regular expression;
the second generalization module is used for merging specified non-terminal symbols in the regular expression obtained by generalization and converting the non-terminal symbols into context-free grammar;
And the third generalization module is used for generalizing characters of a character level generating formula of each token in the context-free grammar by using the upper and lower boundary ranges of the token as constraints to obtain final grammar output, wherein the character level generating formula is used for representing a value of a character level corresponding to each token.
The automatic generating device of the token-level program input grammar and the automatic generating method of the token-level program input grammar are in one-to-one correspondence, and are not described in detail herein.
The present embodiment also provides a computer-readable storage medium storing a computer program which, when executed, implements the above method.
The foregoing is merely a preferred embodiment of the present invention and is not intended to limit the present invention in any way. While the invention has been described with reference to preferred embodiments, it is not intended to be limiting. Therefore, any simple modification, equivalent variation and modification of the above embodiments according to the technical substance of the present invention shall fall within the scope of the technical solution of the present invention.