[go: up one dir, main page]

CN114706760B - A token-level method for automatically generating program input grammar - Google Patents

A token-level method for automatically generating program input grammar Download PDF

Info

Publication number
CN114706760B
CN114706760B CN202210259489.9A CN202210259489A CN114706760B CN 114706760 B CN114706760 B CN 114706760B CN 202210259489 A CN202210259489 A CN 202210259489A CN 114706760 B CN114706760 B CN 114706760B
Authority
CN
China
Prior art keywords
token
grammar
generalization
character
regular expression
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202210259489.9A
Other languages
Chinese (zh)
Other versions
CN114706760A (en
Inventor
陈振邦
王戟
罗运来
潘威宇
毛晓光
董威
李姗姗
文艳军
陈立前
刘万伟
尹良泽
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
National University of Defense Technology
Original Assignee
National University of Defense Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by National University of Defense Technology filed Critical National University of Defense Technology
Priority to CN202210259489.9A priority Critical patent/CN114706760B/en
Publication of CN114706760A publication Critical patent/CN114706760A/en
Application granted granted Critical
Publication of CN114706760B publication Critical patent/CN114706760B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/36Prevention of errors by analysis, debugging or testing of software
    • G06F11/3668Testing of software
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F8/00Arrangements for software engineering
    • G06F8/40Transformation of program code
    • G06F8/41Compilation
    • G06F8/42Syntactic analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F8/00Arrangements for software engineering
    • G06F8/40Transformation of program code
    • G06F8/41Compilation
    • G06F8/43Checking; Contextual analysis
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • General Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Software Systems (AREA)
  • Computer Hardware Design (AREA)
  • Quality & Reliability (AREA)
  • Devices For Executing Special Programs (AREA)

Abstract

本发明公开一种token级的程序输入文法自动生成方法,步骤包括:S01.获取一个符合解析程序文法的字符串作为初始输入并进行符号执行,生成对应解析程序的token序列集合并收集每个token的上下界范围;S02.对token序列集合进行正则表达式泛化,生成正则表达式;S03.将泛化得到的正则表达式中指定非终结符进行合并,转化为上下文无关文法;S04.使用所述token的上下界范围作为约束,对所述上下文无关文法中每个token的字符级产生式进行字符泛化,得到最终的文法输出。本发明具有实现方法简单、生成有效性与效率高且文法完整等优点。

The present invention discloses a token-level program input grammar automatic generation method, the steps include: S01. obtaining a string that conforms to the grammar of the parsing program as the initial input and performing symbolic execution, generating a token sequence set corresponding to the parsing program and collecting the upper and lower bounds of each token; S02. performing regular expression generalization on the token sequence set to generate a regular expression; S03. merging the designated non-terminal symbols in the generalized regular expression to convert it into a context-free grammar; S04. using the upper and lower bounds of the token as constraints, performing character generalization on the character-level production of each token in the context-free grammar to obtain the final grammar output. The present invention has the advantages of simple implementation method, high generation effectiveness and efficiency, and complete grammar.

Description

Token-level program input grammar automatic generation method
Technical Field
The invention relates to the technical field of automatic program testing, in particular to a token-level automatic generation method for a program input grammar.
Background
During the software automatic test process, the software typically accepts only certain inputs in a particular format, and if one input does not meet the format requirements, it will be rejected by the software and an input error report may be generated. For complex structured input formats, there will typically be some input grammar that can be specified by regular expressions or context-free grammars, such as programming language grammars. However, the input grammar may not be available, and even if there are some descriptions of the input grammar, such as natural language descriptions, the grammar may be incomplete and rarely in machine-readable format. When input grammars for software with complex input formats are not available, automatic testing of the software becomes very challenging. Such as during symbol execution and fuzzy testing, a large number of inputs may be generated that do not meet the input format requirements and are rejected by the software, rendering the testing process ineffective or inefficient, e.g., resulting in low code coverage or undetectable program errors, particularly functional errors. If the input is rejected during the format check phase, the software will not execute the test function code. Therefore, the automatic testing capability of receiving complex input software can be improved through grammar learning, the effectiveness and efficiency of an automatic testing tool can be remarkably improved through grammar learning, and for example, a learned grammar can be used for carrying out gray box fuzzy test based on the grammar or symbol execution facing the grammar.
It is challenging to automatically integrate input grammars for complex parsing programs, and traditional grammar integration methods have different feasibility, effectiveness, and efficiency, and different methods behave differently in terms of availability of software source code (e.g., white or black boxes), expressiveness of learned grammars, type of predictor oracle available, and so forth. The current state-of-the-art grammar learning methods generally assume that there is a set of initial valid inputs, given an initial set of inputs α, by expanding the grammar range continuously to obtain the target grammar, where the inputs are typically given at the character level, such as a string or byte level file. However, the valid input of the program is sometimes not available, and if a good distribution of valid inputs is required, such input may not be obtained, and the above-described grammar learning method cannot be used in the case where the input is not obtained. Generating an efficient character-level input for grammar learning using symbol execution may solve the above-described problems, but learning processes based on character-level input are not efficient and may even limit the effectiveness of the learned grammar.
Disclosure of Invention
Aiming at the technical problems existing in the prior art, the invention provides the token-level automatic generation method for the program input grammar, which has the advantages of simple implementation method, high grammar generation effectiveness and efficiency and complete grammar, and can remarkably improve the automatic test capability of receiving complex input software and improve the effectiveness and efficiency of an automatic test tool.
In order to solve the technical problems, the technical scheme provided by the invention is as follows:
A token-level program input grammar automatic generation method comprises the following steps:
s01, obtaining a character string conforming to a grammar of an analysis program as initial input and performing symbol execution, generating a token sequence set corresponding to the analysis program, and collecting the upper and lower boundary ranges of each token;
s02, carrying out regular expression generalization on the token sequence set to generate a regular expression;
S03, combining specified non-terminal symbols in the regular expression obtained through generalization, and converting the non-terminal symbols into a context-free grammar;
s04, using the upper and lower boundary ranges of the token as constraints, performing character generalization on a character level generation formula of each token in the context-free grammar to obtain final grammar output, wherein the character level generation formula is used for representing a value of a character level corresponding to each token.
Further, in the step S01, a test case automatic generation tool GADSE (Grammar-Agnostic Dynamic Symbolic Execution) is used to automatically generate a token sequence set that can be received by the parsing program and to collect the upper and lower bound ranges corresponding to each token.
Further, the regular expression generalization in the step S02 uses repeated generalization and alternate generalization, where the repeated generalization is to repeat a given substring in a grammar, and the alternate generalization is to decompose the substring in the grammar and form alternations.
Further, performing regular expression generalization in step S02 also uses exchange generalization by dividing the current expression L into three parts and exchanging the first and third parts therein.
Further, each generalization in the step S02 further comprises a step of testing each possible token regular expression, wherein the step comprises the steps of generating a representative token sequence, and then converting the generated token sequence into character-level input by using the upper and lower bound ranges of the token as constraints so as to test the validity of the current token sequence.
Further, the step S03 includes traversing all non-terminal symbols in the generalized regular expression, judging whether the non-terminal symbols are equivalent, if so, merging the equivalent non-terminal symbols into a non-terminal symbol, and the merged regular expression is the context-free grammar obtained by conversion.
Further, in the step S04, the upper limit and the lower limit of each token character are calculated by using SMT, and each token character is generalized to a value between the upper limit and the lower limit, so as to implement character generalization.
Further, after the character is generalized in the step S04, the method further includes placing the generalized character value into a sequence in which the token character appears and checking the validity of the character value to check the generalized character value.
A token-level program input grammar automatic generation device, comprising:
the symbol execution module is used for acquiring a character string conforming to the grammar of the analysis program as initial input and executing symbols, generating a token sequence set corresponding to the analysis program and collecting the upper and lower boundary ranges of each token;
The first generalization module is used for generalizing the regular expression of the token sequence set to generate a regular expression;
The second generalization module is used for merging specified non-terminal symbols in the regular expression obtained by generalization and converting the non-terminal symbols into context-free grammar;
And the third generalization module is used for generalizing characters of a character level generating formula of each token in the context-free grammar by using the upper and lower boundary ranges of the token as constraints to obtain final grammar output, wherein the character level generating formula is used for representing a value of a character level corresponding to each token.
A computer readable storage medium storing a computer program which when executed performs a method as described above.
Compared with the prior art, the invention has the advantages that:
1. The invention generates a token sequence for a program with a complex input format by using a program symbol irrelevant to a token, simultaneously collects the upper and lower boundary ranges of each token, synthesizes the input token by using the generated token sequence, firstly generalizes the token into a regular expression, then carries out merging operation on non-terminal symbols in the regular expression, converts the non-terminal symbols into a context-free grammar, then carries out character generalization on a character level value corresponding to each token in the context-free grammar by combining the upper and lower boundary ranges of the token values to further expand the grammar range, so as to obtain final grammar output, realize token-level grammar synthesis, and effectively improve the effectiveness and efficiency of grammar synthesis and the integrity of grammar generation based on token-level synthesis compared with the traditional character-level input learning mode.
2. The invention further provides a large amount of initial inputs with good distribution for grammar learning by combining the test case automatic generation technology and using GADSE, further improves the automation degree and effect of grammar synthesis, realizes grammar synthesis based on token level, and can save a large amount of partition time of GADSE by representing a character sequence in the analysis process and realize faster and higher-level grammar synthesis.
3. The invention further considers the generation rule of the non-terminal symbol, introduces new exchange generalization operation on the basis of repeated and alternate operation, and realizes the exchange generalization by dividing the expression into three parts and exchanging the first part and the third part in the expression, thereby further improving the precision of grammar generation.
4. The invention uses the symbols to execute the collected token constraint and converts the token constraint into the SMT optimization problem, so that the SMT optimization can be used for generalizing the possible character-level values of the token, and the grammar accuracy can be further improved.
Drawings
Fig. 1 is a schematic flow chart of an implementation of the token-level automatic generation method of the program input grammar according to the present embodiment.
FIG. 2 is a flowchart of the token-level automatic generation method for a program input grammar according to the present embodiment.
Fig. 3 is a schematic diagram of an exemplary procedure used in a specific application embodiment of the present invention.
FIG. 4 is a schematic representation of token constraints obtained using the GADSE tool obtained in an embodiment of a specific application of the present invention.
FIG. 5 is a schematic representation of a token sequence obtained using the GADSE tool obtained in an example of a specific application of the present invention.
FIG. 6 is a schematic diagram of the grammatical combination of results obtained in an embodiment of the invention.
Detailed Description
The invention is further described below in connection with the drawings and the specific preferred embodiments, but the scope of protection of the invention is not limited thereby.
As shown in fig. 1 and 2, the method for automatically generating a token-level program input grammar according to the present embodiment includes the steps of:
s01, obtaining a character string conforming to a grammar of an analysis program as initial input and performing symbol execution, generating a token sequence set corresponding to the analysis program, and collecting the upper and lower boundary ranges of each token;
S02, carrying out regular expression generalization on the token sequence set to generate a regular expression;
S03, combining specified non-terminal symbols in the generalization obtained regular expression, and converting the non-terminal symbols into a context-free grammar;
S04, using the upper and lower boundary ranges of the token as constraints, performing character generalization on the character level generation formula of each token in the context-free grammar to obtain final grammar output, wherein the character level generation formula is used for representing the value of the character level corresponding to each token.
In lexical analysis, the input is first marked as a token sequence, then the token sequence is checked according to the grammar rules in the grammar analysis, and the token sequence is closer to the generation formula in the grammar, so that for grammar learning, if the grammar can be learned at the token level, i.e. the grammar learning is performed based on the token sequence, the learning process is more efficient, and the result is more accurate and complete.
In the embodiment, a character string conforming to a grammar of an analysis program is used as initial input, symbol execution is performed on the initial input, so that a legal token sequence set corresponding to the analysis program is automatically generated, namely, a token sequence is generated for a program with a complex input format by using a program symbol execution independent of the grammar, the upper and lower bound ranges of each token are collected, then the input grammar is synthesized by using the generated token sequence, the token sequence is subjected to generalization operation firstly, non-terminal characters in the generalization obtained regular expression are subjected to merging operation again, the recursion attribute of the grammar is obtained, the context-free grammar is converted, and then the character-level value corresponding to each token in the context-free grammar is subjected to character generalization by combining the upper and lower bound ranges of the token value, so that the final grammar output is obtained, and the grammar synthesis of the token level can be realized based on symbol execution.
As a white box learning method, compared with the traditional character-level input learning method, the token-level grammar learning method based on symbol execution can effectively improve the effectiveness and efficiency of grammar synthesis, realize more effective grammar synthesis, improve the integrity of grammar generation based on token-level synthesis, and be particularly suitable for automatic input grammar learning generation of analysis programs with good grammar distribution and unavailable input samples.
In step S01 of this embodiment, the test case automatic generation tool GADSE is used to automatically generate a set of token sequences that can be received by the analysis program, and collect the upper and lower bound ranges corresponding to each token, and the upper and lower bound ranges corresponding to each token are used as constraint values corresponding to each token. By combining the automatic test case generation technology, GADSE can provide a large number of well-distributed initial inputs for grammar learning, and further improve the automation degree and effect of grammar synthesis.
Taking the example grammar as in fig. 3, where the upper part is the example grammar, the lower part is the parsing program of the grammar, which in effect implements a context dependent grammar, where the grammar in the example program (.+ -.) +, (..+ -.) is a regular expression symbol, < T > represents a single character, < ID > represents a capitalized identification symbol of length greater than 1, < Number > must be a single digit Number, and < Sep > represents a special split symbol. The grammar requires that the tags should have the same name, e.g., < a > AA < a > be valid, but < a > AA < b > be invalid. This may be accomplished using a recursive descent parser and using a stack to check if the tags have the same name.
In a specific application embodiment, an initial input "< a >; < a >" is first given, and then symbol execution is performed on the initial input using GADSE, so as to generate a legal token sequence set corresponding to the parsing program and constraint values corresponding to each token. There are 6 token values in the example program. GADSE first generate constraints for the token in the parser, where the token is 1 in length, as shown in FIG. 4. Then GADSE generates a path constraint at the token level and solves to obtain a token sequence. By generating a new token sequence to search the token-level path space GADSE, the path space of the parser can be explored more efficiently. As shown in FIG. 5, the upper portion represents the initially entered token-level path constraints and the lower portion represents GADSE exploration of the two valid token sequences generated by parsing the program path space.
In this embodiment, repeated (repetition) generalization and alternate (alternation) generalization are specifically used when regular expression generalization is performed, where the repeated generalization is to repeat a given substring (for example, aa may be generalized to a (a)) in a grammar, and the alternate generalization is to decompose the substring in the grammar and form alternates (for example, (ab) may be generalized to a (a|b)). For the token sequence generated in step S01, a regularization operation is applied to each token sequence with the token sequence as an initial input, and candidate regularization grammars are iteratively generated. The candidate grammars are then checked for correctness by generating new inputs from the candidate grammars and checking whether the new inputs are accepted by the parser. And continuing to apply generalization operation to the candidate grammar passing the inspection until generalization cannot be performed, obtaining a final regular grammar, and then proceeding to step S03.
In step S02 of this embodiment, a black box grammar synthesizer Glade is specifically used to perform regular expression generalization, where Glade starts from a set of valid inputs, and for each input i, glade gradually extends i to a regular expression e, where the generalization process is a series of regular expressions L0. For each 0.ltoreq.i.ltoreq.n-1,Each generalization attempts to expand Li, but at the same time avoids excessive generalization. Glade provides two generalization operations, repetition (iteration) and alternation (alternation), wherein Li is first divided into three parts when repeating generalization, i.e., li=α1α2α3, and then the second part and the third part are repeated as follows:
α1([α2]alt)*[α3]rep (1)
wherein the expressions in [ ] rep, [ ] alt respectively denote that repetition, alternation, is to be used to generalize the expressions respectively.
Since there may be multiple partitioning scenarios, it is further possible to check new regular expressions by generating several strings for the new expression and checking whether these strings are acceptable to the program. If all strings are valid, the new expression is taken as a candidate generalization expression. For example, for the example program, assuming i is < a > AA;0<a > and using repetition to generalize i, the following four candidate rules can be derived and checked:
<a>([AA;]alt)*[0<a>]rep (2)
<a>([A]alt)*[A;0<a>]rep
<a>A([A]alt)*[;0<a>]rep
<a>AA([;0]alt)*[<a>]rep
Glade further selects the repeat of greatest length, i.e., the first expression (AA; length 3), as the generalization expression.
The alternating operation is to divide Li into two parts, i.e. li=a1α2, and recursively apply repetition and alternation to the first part and the second part, respectively, i.e. the rule is:
1]rep+[α2]alt (3)
As with the repeated operation, the present embodiment also examines different candidates at the time of the alternate operation, and selects one of the candidates as the candidate generalization expression li+1. For example, for the expression in formula (2), if AA cannot be separated, only the expression needs to be repeated to obtain the expression:
<a>([AA;]rep)*[0<a>]rep (4)
for each generalization step, using Glade can ensure that Li is a subset of li+1, and can also be made as less extensive as possible by examining the possible generalization expressions. Under the default condition, the initial input i is firstly generated repeatedly by AA, and the generalization process of O is as follows:
The above process has a total of 9 steps, based on which Glade generates the following for the final regular expression:
S:=<a>(S1)*0<a> S1:=(A)+; (5)
and Glade can also generalize a single character by trying all possible characters, e.g., the context free grammar described above can be generated to the following grammar:
S:=<a>(S1)*(0|...|9)<a> S1:=(A|...|Z)+(:|;) (6)
The final context-free grammar described above approximates the grammar in fig. 2. Glade can generalize one grammar for each input for multiple initial valid inputs and unify the generated grammars into one as a result. Glade, however, needs to assume that the initial input is valid, the distribution of valid inputs also determines the integrity of the integrated grammar, and if no grammar is entered, it is difficult to obtain well-distributed valid inputs, such as inputs that cover all production rules and grammar terminators. The present embodiment can obtain a well-distributed effective input by using Dynamic Symbol Execution (DSE) as an input for program generation. Since Glade is synthesized on a character level, in each generalization step, there are many possible partitions at the character level, so Glade requires much time to check the validity of each partition, which also limits the generalization capability of Glade, in the embodiment, grammar synthesis is realized based on a token level, one token represents a character sequence in the parsing process, and Glade does not need to partition the character string of the token, so that a large amount of partition time can be saved, and faster and higher-level grammar synthesis can be realized.
The production rules considering non-terminal < Content > are:
<Cpmtent>:=<Value>(<Sep><Content>)* (7)
there are two cases of < value >, namely, a sign or a number.
If the current expression is AA, 0, the current generalization operation of Glade limits the generalization capability because only sub-expressions (e.g., AA; or 0) can be repeated, however if generalization is desired to the following expression:
<Value>(<Sep><Content>)* (8)
It is desirable to have both 0 and AA (or AA; and 0), which is obtained by exchanging the first part and the third part. Thus, the generalization range can be extended by exchanging the first and third parts, generalizing AA, regular expressions of 0 to < content >.
The present embodiment uses the above feature to perform regular expression generalization in step S02, and also uses exchange ([ ] exh) generalization, that is, introduces an exchange generalization operation to perform generalization, where the exchange generalization is represented by dividing the current expression L into three parts and exchanging the first part and the third part thereof:
((([α1]rep+[α3]rep)[α2]rep)*([α1]rep+[α3]rep)) (9)
For example, using a swap operation, a token sequence may be generalized to the following three candidate canonical token expressions:
LTR([IS]alt)*[NLTR]rep (10)
LTRI([SN]alt)*[LTR]rep
LTR([ISN]exh)[LTR]rep
For simplicity, in the above expression we use the first letter of the second part of each token value to represent a token, e.g., S stands for T_Sep.
The embodiment further includes examining each possible token regular expression in each generalization step described above, specifically including first generating a representative token sequence and then converting the token sequence into character-level input by token value constraints (upper and lower bound ranges) and character-level constants (collected in GADSE) to examine its validity. For example, LTRNSILTR (used to test the last expression) and LTRISISNLTR are converted to < a >0, AA < a > and < a > AA, AA;0<a >, respectively, which are all accepted by the parser, i.e., are all valid.
In this embodiment, after checking the token validity, the generalization process of preferentially selecting the exchange generalization to continue generalization, LTRISNLTR may be specifically expressed as:
Wherein each letter represents a token corresponding to a first letter of the second portion of each token value.
The following context-free grammar can be obtained according to the steps:
S:=(LTR S1 LTR)S1:=(S2 S)*S2 S2:=S3|S4 S3:=I S4:=N
In this embodiment, step S03 further merges non-terminal symbols in the regular expression in step S02 to convert the non-terminal symbols into context-free related grammar, and the steps include traversing all non-terminal symbols in the regular expression obtained by generalization, judging whether the non-terminal symbols are equivalent, if so, merging the equivalent non-terminal symbols into one non-terminal symbol, and the merged regular expression is the context-free grammar obtained by conversion.
As in the context-free grammar described above, S1, S2, S3, and S4 are combinable and can be replaced with S1. In addition, for the first generating rule, S and S1 are combinable, but S and S2 are not combinable, so the present embodiment combines S and S1 only in the first rule, and the following syntax can be obtained after the combination:
S:=(LTR S LTR)|S1 S1:=S1SS1|I|N。
In the embodiment, step S04 further expands the grammar range by performing character generalization on each token non-terminal symbol in the grammar result in step S3 in combination with the upper and lower limits of the token value, so as to obtain the final grammar. For each token character t, using a grammar combination to generalize possible character values, and after character generalization, further comprising placing the generalization character values into sequences in which token characters appear and checking the validity of the character values to check the generalization character values.
In step S04 of this embodiment, the upper limit and the lower limit of each token character are calculated specifically using SMT, and each token character is generalized to a value between the upper limit and the lower limit, so as to implement character generalization. By performing the collected token constraints using symbols and translating to SMT optimization problems, the SMT optimization can be utilized to generalize the possible character-level values of the token, which can further improve the accuracy of the grammar.
For example, token I has a character value of "AA", which can be generalized to regular expression (A) +, and further, the character level value of each token is generalized by the token constraint collected in the first stage Gadse, the upper and lower limits of each character are calculated using SMT optimization theory, and the token is generalized to a value between the upper and lower limits. For example, for a token value L, it is constrained to be s [0] = '<'; therefore, its value can only be '<'. The constraint for the token value N is s [0 ]. Gtoreq.0 '. Gtoreq.s [0 ]. Gtoreq.9', so that the value of N is between '0' and '9'. Finally, based on the above steps, a context-free grammar of the following formula (12) can be synthesized, which is equivalent to the grammar in fig. 4.
In a specific application embodiment, taking the last token sequence in fig. 3 as an example for synthesis, the detailed steps for synthesizing the token sequence include:
Step 1, repeating, replacing and exchanging the last token sequence, namely LTRILTR, based on the synthesis of the token;
Step 2, testing each possible token regular expression in each generalization step, namely firstly generating a representative token sequence, and then converting the token sequence into character-level input through token value constraint and constant to test validity.
And step 3, carrying out merging operation on non-terminal symbols in the regular expression obtained after regularization to obtain recursive attributes of the grammar, and converting the recursive attributes into context-free grammar.
And 4, generalizing the character level value of each token value by using the token constraint (upper and lower bound range) collected in the step S01 through GADSE, wherein the upper bound and the lower bound of each character in the token are calculated by using an SMT optimization theory, the token is generalized into a value between the lower bound and the upper bound, and the final context-free grammar output can be synthesized based on the token sequence and the token value constraint.
In a specific application embodiment, the above grammar generation is implemented on a JPF-based grammar-independent concolic execution engine and Glade, which is applied to a certain Java parser, and the obtained grammar result is shown in fig. 6. From fig. 6, it can be seen that the grammar finally generated by the method is consistent with the target grammar, and the method can greatly improve the efficiency of grammar synthesis and the integrity of the grammar compared with the conventional grammar constraint based on the character level. Specifically, compared with state-of-the-arts (i.e. Glade and Arvada) equipped with character-level symbol execution, the F1 fraction of the method is respectively improved by 5.24 times and 3.54 times, and compared with Arvada, the method can realize 141.93 times time acceleration in grammar synthesis, i.e. the method can effectively improve the precision and recall rate of grammar generation.
The token-level program input grammar automatic generation device of the embodiment comprises:
the symbol execution module is used for acquiring a character string conforming to the grammar of the analysis program as initial input and executing symbols, generating a token sequence set corresponding to the analysis program and collecting the upper and lower boundary ranges of each token;
the first generalization module is used for carrying out regular expression generalization on the token sequence set to generate a regular expression;
the second generalization module is used for merging specified non-terminal symbols in the regular expression obtained by generalization and converting the non-terminal symbols into context-free grammar;
And the third generalization module is used for generalizing characters of a character level generating formula of each token in the context-free grammar by using the upper and lower boundary ranges of the token as constraints to obtain final grammar output, wherein the character level generating formula is used for representing a value of a character level corresponding to each token.
The automatic generating device of the token-level program input grammar and the automatic generating method of the token-level program input grammar are in one-to-one correspondence, and are not described in detail herein.
The present embodiment also provides a computer-readable storage medium storing a computer program which, when executed, implements the above method.
The foregoing is merely a preferred embodiment of the present invention and is not intended to limit the present invention in any way. While the invention has been described with reference to preferred embodiments, it is not intended to be limiting. Therefore, any simple modification, equivalent variation and modification of the above embodiments according to the technical substance of the present invention shall fall within the scope of the technical solution of the present invention.

Claims (8)

1. The token-level automatic generation method for the program input grammar is characterized by comprising the following steps:
s01, obtaining a character string conforming to a grammar of an analysis program as initial input and performing symbol execution, generating a token sequence set corresponding to the analysis program, and collecting the upper and lower boundary ranges of each token;
s02, carrying out regular expression generalization on the token sequence set to generate a regular expression;
S03, combining specified non-terminal symbols in the regular expression obtained through generalization, and converting the non-terminal symbols into a context-free grammar;
S04, using the upper and lower boundary ranges of the token as constraints, performing character generalization on a character level generation formula of each token in the context-free grammar to obtain final grammar output, wherein the character level generation formula is used for representing a value of a character level corresponding to each token;
Performing regular expression generalization in the step S02, wherein repeated generalization is used for repeated generalization, the repeated generalization is repeated to form sub-strings given in a grammar, and the repeated generalization is performed to decompose the sub-strings in the grammar and form alternation;
In the step S04, the upper limit and the lower limit of each token character are calculated by using SMT, and each token character is generalized to a value between the upper limit and the lower limit, so as to realize character generalization.
2. The method according to claim 1, wherein in the step S01, the test case automatic generation tool GADSE is used to automatically generate a token sequence set that can be received by the parsing program and to collect the upper and lower bounds corresponding to each token.
3. The method according to claim 1, wherein the regular expression generalizing in step S02 further uses a swap generalizing by dividing the current expression L into three parts and swapping the first and third parts therein.
4. The method according to claim 1, wherein each generalization in the step S02 further comprises a step of testing each possible token regular expression, including generating a representative token sequence, and then converting the generated token sequence into character-level input to test validity of a current token sequence using upper and lower bound ranges of the token as constraints.
5. The method for automatically generating a token-level program input grammar according to any one of claims 1 to 4, wherein the step S03 includes traversing all non-terminators in the generalized regular expression, determining whether there are non-terminators equivalent thereto, if so, merging the equivalent non-terminators into one non-terminators, and the merged regular expression is the context-free grammar obtained by the transformation.
6. The method according to claim 5, wherein after the character is generalized in the step S04, the method further comprises putting the generalized character value into a sequence in which the token character appears and checking validity of the character value to check the generalized character value.
7. A token-level program input grammar automatic generation device, comprising:
the symbol execution module is used for acquiring a character string conforming to the grammar of the analysis program as initial input and executing symbols, generating a token sequence set corresponding to the analysis program and collecting the upper and lower boundary ranges of each token;
The first generalization module is used for generalizing the regular expression of the token sequence set to generate a regular expression;
The second generalization module is used for merging specified non-terminal symbols in the regular expression obtained by generalization and converting the non-terminal symbols into context-free grammar;
the third generalization module is used for generalizing characters of a character level generation formula of each token in the context-free grammar by using the upper and lower boundary ranges of the token as constraints to obtain final grammar output, wherein the character level generation formula is used for representing a value of a character level corresponding to each token;
performing regular expression generalization in the first generalization module, wherein repeated generalization is used for alternate generalization, the repeated generalization is given by a sub-character string in a repeated grammar, and the alternate generalization is to decompose the sub-character string in the grammar and form alternation;
in the third generalization module, an upper limit and a lower limit of each token character are calculated by using SMT, and each token character is generalized to a value between the upper limit and the lower limit, so that character generalization is realized.
8. A computer readable storage medium storing a computer program, wherein the computer program when executed implements the method of any one of claims 1-6.
CN202210259489.9A 2022-03-16 2022-03-16 A token-level method for automatically generating program input grammar Active CN114706760B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210259489.9A CN114706760B (en) 2022-03-16 2022-03-16 A token-level method for automatically generating program input grammar

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210259489.9A CN114706760B (en) 2022-03-16 2022-03-16 A token-level method for automatically generating program input grammar

Publications (2)

Publication Number Publication Date
CN114706760A CN114706760A (en) 2022-07-05
CN114706760B true CN114706760B (en) 2024-12-06

Family

ID=82168338

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210259489.9A Active CN114706760B (en) 2022-03-16 2022-03-16 A token-level method for automatically generating program input grammar

Country Status (1)

Country Link
CN (1) CN114706760B (en)

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102141959B (en) * 2011-03-15 2013-10-30 中国科学院研究生院 Test case generation method restrained by context-free grammar
WO2017030538A1 (en) * 2015-08-14 2017-02-23 Hewlett Packard Enterprise Development Lp Dynamic lexer object construction
US11442932B2 (en) * 2019-07-16 2022-09-13 Thoughtspot, Inc. Mapping natural language to queries using a query grammar
CN110543421B (en) * 2019-08-31 2022-03-29 华南理工大学 Unit test automatic execution method based on test case automatic generation algorithm

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
面向解析程序的符号执行研究;潘威宇;硕士电子期刊出版;20240715(第07期);全文 *

Also Published As

Publication number Publication date
CN114706760A (en) 2022-07-05

Similar Documents

Publication Publication Date Title
CN113051285B (en) SQL sentence conversion method, system, equipment and storage medium
EP0179334A2 (en) System for generating a translator program
CN113504900B (en) Programming language conversion method and device
Salomon et al. Scannerless NSLR (1) parsing of programming languages
US11599447B2 (en) Detection of runtime errors using machine learning
Chattopadhyay Compiler design
CN115576603B (en) Method and device for acquiring variable values in code segment
CN114706760B (en) A token-level method for automatically generating program input grammar
CN108563561A (en) A kind of program recessiveness constraint extracting method and system
CN117971236B (en) Operator parsing method, device, equipment and medium based on lexical and grammatical analysis
CN119179520A (en) Conversion method, device and storage medium based on RISC-V architecture built-in function
CN112651197B (en) Circuit partitioning preprocessing method and gate-level circuit parallel simulation method
KR101229677B1 (en) Method and apparatus for the determination of a repetitive bit value pattern
CN114090017B (en) Method and device for analyzing programming language and nonvolatile storage medium
US8095912B2 (en) Testing a context-free language compiler
CN119026556B (en) Integrated circuit IC modeling method based on graph theory
CN117010019B (en) Data desensitization method and system based on NLP language model
EP4421621B1 (en) Method and system for matching source code and binary code
Khan et al. GNN-Based Transfer Learning and Tuning for Detecting Code Vulnerabilities across Different Programming Languages
CN119088822B (en) Method, system and equipment for converting domain model language and general decision table
Kukluk Inference of node and edge replacement graph grammars
Spasova et al. Development of a Java Syntax Analyzer for C/C++ Code Recognition
JPH11259482A (en) Machine translation of compound nouns
CN118796712A (en) A domain-sensitive program slicing method based on high-order access paths
Madhavan et al. Towards automating grammar equivalence checking

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant