Disclosure of Invention
Therefore, the invention provides a software fuzzy test method and device based on path record truncation, which can improve the test efficiency and test coverage rate of deep codes and have strong engineering application prospect.
According to the design scheme provided by the invention, the software fuzzing test method based on path record truncation comprises the following contents:
A) constructing a project data set and extracting a condition code structure, performing model input data processing on the extracted condition code structure, and performing model training as the input of a low-frequency path transfer condition code structure classification model, wherein the classification model adopts an LSTM network model structure;
B) adding a path truncation mark and a mark checking instruction in a pile inserting code of the fuzzy tester;
C) and extracting a condition code structure and carrying out model input data processing aiming at the program to be tested, inputting the processed data as a trained classification model, identifying a low-frequency path transfer condition code structure, carrying out source code level instrumentation at a corresponding position in a source file, carrying out path truncation according to a path truncation mark and finishing source code fuzzy test.
In the above, in a) extracting the condition code structure, firstly, defining a symbolic description facing a source code data set and extracting all condition code structures in the data set, then, performing code analysis on the extracted condition code structures, and performing labeling processing on an analysis result to obtain a code token sequence; and carrying out vector conversion on the token sequence to obtain input data of the classification model.
Preferably, in a), all condition code structures in the data set are extracted, and the following contents are included: firstly, preprocessing a source code data set and extracting effective codes; and then, extracting a condition structure set from the effective codes, constructing a stack structure, recording the position corresponding relation between the code segments and the source code data set, and performing iterative processing on the nested condition structure set according to the stack structure to obtain a minimized code segment.
Preferably, in a), the code parsing process includes the following steps: and extracting a code sequence by adopting an abstract syntax tree, performing symbolization processing, grouping according to the meaning of the entries in the code sequence, acquiring a synonym set, and expanding the code sequence.
Preferably, in a), vector conversion is performed on the token sequence, and the method includes the following steps: performing text vectorization conversion by using word2vec, and obtaining a word vector model by setting feature vector dimensions and word frequency parameter output; and acquiring a dictionary index dictionary and a word vector dictionary according to the word vector model, and acquiring classification model input data according to the dictionary index dictionary and the word vector dictionary.
In the above, in the process of adding the path truncation flag and the flag check instruction, the path truncation flag is inserted into the source code data set, and the path truncation flag is defined in the bss section of the code; and a path truncation marking check is performed at the original stake entry.
Preferably, in step C), for the test program, the error handling code segments identified by the classification model in the training process are instrumented by combining the condition code structure and the position index in the source code data set in an internal and external instrumentation manner.
Further, in C), for the case that the path truncation flag is after the instrumentation code, the original instrumentation of the current condition is cancelled, the instrumentation is performed with a null instruction, and an annotation flag is added at the instruction annotation.
Further, in C), the instrumentation is set to cancel instrumentation at the marked branch for the compilation optimization problem of the same conditional statement in different source code datasets.
A software fuzzing test device based on path record truncation comprises: a training module, a labeling module, and a testing module, wherein,
the training module is used for constructing a project data set, extracting a condition code structure, performing model input data processing on the extracted condition code structure, and performing model training as the input of a low-frequency path transfer condition code structure classification model, wherein the classification model adopts an LSTM network model structure;
the marking module is used for adding a path truncation mark and a mark checking instruction in the instrumentation code of the fuzzy tester;
and the test module is used for extracting a condition code structure and carrying out model input data processing aiming at the program to be tested, inputting the processed data as a trained classification model, identifying a low-frequency path transfer condition code structure, carrying out source code level instrumentation at a corresponding position in a source file, carrying out path truncation according to a path truncation mark and finishing source code fuzzy test.
The invention has the beneficial effects that:
1. aiming at the problem that the high-frequency path influences the fuzzy testing efficiency, the low-frequency path transfer condition code recognition is carried out by adopting a deep learning neural network, source code level instrumentation is carried out on the transfer condition code structure obtained by the training model recognition, and the path record is cut off according to the mark code, so that the probability of low-frequency sample variation is increased, and the fuzzy testing efficiency is finally improved.
2. According to the method, the low-frequency path transfer condition codes are recognized before the program is specifically executed, the high-frequency path sample test is cancelled by adopting the path truncation strategy, the blank of the high-frequency path sample in the aspect of influencing the analysis is filled, the complicated dynamic analysis technology is not relied on, the overhead problem is not caused, the method can be effectively combined with other ash box test technologies, the coverage rate is further improved on the basis of the original test tool, and the method has an important guiding significance on the development of the software test technology.
The specific implementation mode is as follows:
in order to make the objects, technical solutions and advantages of the present invention clearer and more obvious, the present invention is further described in detail below with reference to the accompanying drawings and technical solutions.
In view of the situations of limiting the low-frequency path test probability and limiting the overall test efficiency in the existing software fuzz test, in the embodiment of the present invention, referring to fig. 1, a software fuzz test method based on path record truncation is provided, which includes the following contents:
s101, constructing a project data set and extracting a condition code structure, performing model input data processing on the extracted condition code structure, and performing model training as the input of a low-frequency path transfer condition code structure classification model, wherein the classification model adopts an LSTM network model structure;
s102, adding a path truncation mark and a mark checking instruction in a pile inserting code of the fuzzy tester;
s103, extracting a condition code structure and performing model input data processing aiming at the program to be tested, inputting the processed data as a trained classification model, identifying a low-frequency path transfer condition code structure, performing source code level instrumentation at a corresponding position in a source file, performing path truncation according to a path truncation mark, and completing source code fuzzy test.
Referring to fig. 2, by constructing a data set, extracting all condition code structures in the data set, and performing code analysis on the code structures; preprocessing the analysis result to extract a code token sequence and performing vector conversion on the token sequence; generating a label by the code, and taking the generated vector as the input of an LSTM neural network model to train a low-frequency path transfer condition code structure classification model; adding a path truncation mark and a mark checking instruction in a pile inserting code of the fuzzy tester; and carrying out condition code structure extraction on the test program to obtain a low-frequency path transfer condition code structure, carrying out light-weight source code instrumentation on the corresponding position of the identified structure in the source file, and starting to test the program to be tested.
Most input parsing programs usually have a large number of format checks, and there is a corresponding error handling structure for failure of the check, and such structure usually results in a coverage rate that is difficult to increase. Therefore, the error handling code belongs to a most representative type of low frequency path branch condition code, and refers to a code segment executed when a program input causes a program to be incorrectly exited due to various different reasons. In the extraction of the conditional code structure, in another embodiment of the invention, firstly, the symbolic description facing to the source code data set is defined, all the conditional code structures in the data set are extracted, then, the extracted conditional code structures are subjected to code analysis, and the analysis result is subjected to labeling processing to obtain a code token sequence; and carrying out vector conversion on the token sequence to obtain input data of the classification model. Preferably, all condition code structures in the extracted data set include the following: firstly, preprocessing a source code data set and extracting effective codes; and then, extracting a condition structure set from the effective codes, constructing a stack structure, recording the position corresponding relation between the code segments and the source code data set, and performing iterative processing on the nested condition structure set according to the stack structure to obtain a minimized code segment. Specifically, the descriptors are defined as shown in table 1.
TABLE 1 symbolic description
Firstly to S
oPreprocessing is performed to remove some unnecessary information, such as code comments, line breaks, and the like. Here, R is used for efficient code extraction to obtain S
n. Then to the processed S
nExtraction of I
eThere is proposed a method based on bracket stack balancing, i.e.
C
l0, so that B can be identified by constructing a stack-like structure
lThen C is
s+1 when B is identified
r,C
s1, when C is
sWhen the ratio is 0, mixing
tIs added to I
e. And simultaneously recording the corresponding relation between the code segment and the source file position so as to facilitate the subsequent instrumentation processing. However, after extraction in this way I appears
n,I
nCan lead to false positives because of assumptions
And is
If it is considered to be I
nE is E, then
Creating a contradiction. Therefore, also need to be on
nAn iterative process is performed to ensure that the extracted code fragments are minimized. By comparing the first extracted I
eExtracting in the same manner until each structure is I
r。
In another embodiment of the present invention, the code parsing process includes the following steps: and extracting a code sequence by adopting an abstract syntax tree, performing symbolization processing, grouping according to the meaning of the entries in the code sequence, acquiring a synonym set, and expanding the code sequence.
The extracted code segments are parsed into word sequences, and all segments are parsed into sequences of equal length for input into the LSTM model. The code section parsing stage extracts a code sequence using an Abstract Syntax Tree (AST). And simultaneously performing symbolization processing, such as expressing an integer as num and expressing a character string as str. However, in this classification, the content of the character string has a certain influence. Most error handling code fragments contain a characteristic that if a code fragment contains a character string, the character string usually contains words with similar meanings such as error, fail and the like. Therefore, two ways are used to perform string symbolization, which is divided into whether special keywords are included, specifically denoted as errstr and str. Although a part of the misexpression vocabulary has been extracted by the previous analysis, the number is limited. In the embodiment of the invention, WordNet can be used, an English dictionary established and maintained by Princeton university is grouped by vocabulary entry meanings, each vocabulary entry with the same meaning forms a synonym set, and the vocabulary entry group in the synonym set can be used for expanding wrongly expressed vocabulary. Because of adopting the supervised learning method, each sample needs to be labeled, if the code segment belongs to the error processing code segment, the label is marked as 1, otherwise, the label is marked as 0. Here, a heuristic-based approach is used for labeling, and the following 5 heuristics are summarized through a large number of source file analyses: 1) comparisons are typically included in if (…); 2) if the character string is contained, the character string contains an error expression vocabulary; 3) may contain return or jump keywords such as return, goto, etc.; 4) if the function is contained, the error expression vocabulary is usually contained in the function name; 5) it may contain system error macro definitions such as 'EPERM', 'enont', etc.
In another embodiment of the present invention, vector conversion is performed on token sequences, which includes the following steps: performing text vectorization conversion by using word2vec, and obtaining a word vector model by setting feature vector dimensions and word frequency parameter output; and acquiring a dictionary index dictionary and a word vector dictionary according to the word vector model, and acquiring classification model input data according to the dictionary index dictionary and the word vector dictionary.
Vectorization is carried out by taking the obtained token sequence as input, a tool word2vec widely applied to text vectorization is used for conversion, a word vector model is obtained by setting feature vector dimensions and word frequency parameter output, and a dictionary index dictionary and a word vector dictionary are established according to the obtained model and are used as the input of a subsequent LSTM model. Referring to fig. 3, since different code segments contain different numbers of tokens, but the LSTM can only accept inputs of the same length, a padding and trimming process is required. After vectorization and code segment labeling of the code segments are obtained, training of the LSTM network can be started, and a Dropout layer is added in the model except essential matrixes such as an Embedding layer and an LSTM unit to prevent data overfitting results.
The detection stage is used for detecting the type of a given unknown code segment and outputting a file to which the code segment belongs and the position of the code segment in a source file if the code segment belongs to an error processing code. Given an unknown item, the specific detection process is as follows: 1. extracting an error processing code segment structure in each source file in the project and recording the file name and the position of the error processing code segment structure; 2. analyzing the code segments to obtain respective code sequences; 3. vectorizing the code sequence obtained in the last step according to rules by using a word2vec model obtained by early training; 4. and inputting the obtained vector into a trained LSTM network for judgment.
Further, in another embodiment of the present invention, in the process of adding the path truncation flag and the flag checking instruction, the path truncation flag is inserted into the source code data set, and the path truncation flag is defined in the bss section of the code; and a path truncation marking check is performed at the original stake entry.
The assembly code is embedded in the source code. And a mark continue _ log is set by inserting a mark into the source code to realize the function of canceling the record of the subsequent instrumentation. The flag is defined in the bss segment, and since the bss segment data is uninitialized data and the memory is cleared before each operation, the flag may be set to 1 to indicate that the subsequent basic block is not recorded any more. And (4) canceling the record of the basic block at the time by performing continueLog mark query at the entrance of the original instrumentation and jumping to a return point if the index is 1. The subsequent program always marks 1 in operation, so the subsequent basic block is no longer recorded, i.e. if the original path is 1 → 2 → 3 → …, if the mark is contained at 3, this time recording is 1 → 2. And when the next round of test starts, the mark is cleared, and the execution path can be recorded normally.
And (3) performing instrumentation on the error processing code segment by means of the error processing code segment identified by the error code classification model obtained by utilizing LSTM network training in the early stage and combining an index established by the structure and the position of the if-else in the source file. The method is realized by adopting an internal and external instrumentation mode, because if statements are compiled into conditional jump instructions, and instrumentation of the fuzzy tester is judged by the conditional jump instructions, before the if statements, namely, source code instrumentation outside an if structure can influence whether subsequent basic blocks are instrumented or not. Before the first statement of the if structure, that is, performing source code instrumentation in the if structure, the recording mode of the subsequent instrumented basic block may be determined, because the first statement belongs to the beginning of the jump basic block, and a conditional jump instruction may still exist subsequently, the first statement will determine to traverse the recording results of all the subsequent basic blocks on the path of the basic block. The following three cases are mainly distinguished:
the first case is where the error handling code is located in the if structure, in which case it is only necessary to do instrumentation before the if statement and before the first statement in the if structure.
The second case is where the error handling code is located in the else structure, in which case it needs to be instrumented before the first statement in the else structure and the adjacent if statement before the else structure. If the preamble structure is an else if structure, the else if structure needs to be converted into an if structure, and then instrumentation is performed before if.
In the third case, the error processing code is located in the else if structure, and the instrumentation is performed before the if statement and before the first statement in the else if structure after the structure conversion processing is also performed. When the else if structure is instrumented, the original code structure is damaged, so that the repair is needed to ensure the code to be compiled and run correctly. The source code structure repair algorithm in the embodiment of the invention can be designed as follows:
the algorithm can realize accurate compiling and running of the program.
Aiming at a test program, the error processing code segments identified by a classification model in a training process are inserted by combining a condition code structure and a position index in a source code data set in an internal and external insertion mode. Preferably, for the situation that the path truncation mark is behind the instrumentation code, the original instrumentation of the current condition is cancelled, the instrumentation is performed by adopting a null instruction, and an annotation mark is added at the instruction annotation position. Preferably, for the compiling optimization problem of the same conditional statement in different source code data sets, the marking branch is set to cancel the instrumentation.
If the continue _ log mark causes the basic block to be recorded after the instrumentation code, although the subsequent code block cannot be recorded continuously because the flag bit is already set to 1, the basic block record is still considered to generate a new path, so that the test sample is retained and the purpose of the targeted test is not achieved. It is therefore necessary in this case to cancel the original instrumentation of the current conditional jump. Null instructions (nop) are instrumented and tagged at instruction annotations, allowing for normal execution flow that does not affect the program. Since the assembly code comments are not cleared after the source code is compiled into assembly code, the decision can be made using the comment tag. And when meeting the annotation mark, assigning the instrumentation mark, then judging the instrumentation according to the instrumentation mark when meeting the conditional jump instruction, then clearing the instrumentation mark, and circulating the process until all code instrumentation is finished. Therefore, the subsequent code block skipping instrumentation process is realized. Due to the problem of compilation optimization, that is, when the same if (i | ═ 0) statement is assembled in different source files, cases such as jz and jnz may exist, and the original instrumentation only performs instrumentation at a negative jump. Therefore, the above situation is dealt with by adopting a mode of canceling the insertion at all the marked branches, and although one effective basic block record is canceled, the information of the effective path is not influenced.
Based on the software fuzzing test method, an embodiment of the present invention further provides a software fuzzing test apparatus based on path record truncation, as shown in fig. 4, including: a training module 101, a labeling module 102, and a testing module 103, wherein,
the training module 101 is used for constructing a project data set, extracting a condition code structure, performing model input data processing on the extracted condition code structure, and performing model training as the input of a low-frequency path transfer condition code structure classification model, wherein the classification model adopts an LSTM network model structure;
the marking module 102 is used for adding a path truncation mark and a mark checking instruction in the instrumentation code of the fuzzy tester;
the test module 103 is configured to extract a condition code structure and perform model input data processing for a program to be tested, input the processed data as a trained classification model, identify a low-frequency path transfer condition code structure, perform source code level instrumentation at a corresponding position in a source file, perform path truncation according to a path truncation flag, and complete a source code fuzzy test.
Unless specifically stated otherwise, the relative steps, numerical expressions, and values of the components and steps set forth in these embodiments do not limit the scope of the present invention.
Based on the foregoing method, an embodiment of the present invention further provides a server, including: one or more processors; a storage device for storing one or more programs which, when executed by the one or more processors, cause the one or more processors to implement the method described above.
Based on the above method, the embodiment of the present invention further provides a computer readable medium, on which a computer program is stored, wherein the program, when executed by a processor, implements the above method.
The device provided by the embodiment of the present invention has the same implementation principle and technical effect as the method embodiments, and for the sake of brief description, reference may be made to the corresponding contents in the method embodiments without reference to the device embodiments.
It is clear to those skilled in the art that, for convenience and brevity of description, the specific working processes of the system and the apparatus described above may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.
In all examples shown and described herein, any particular value should be construed as merely exemplary, and not as a limitation, and thus other examples of example embodiments may have different values.
It should be noted that: like reference numbers and letters refer to like items in the following figures, and thus, once an item is defined in one figure, it need not be further defined and explained in subsequent figures.
The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present invention. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.
In the several embodiments provided in the present application, it should be understood that the disclosed system, apparatus and method may be implemented in other ways. The above-described embodiments of the apparatus are merely illustrative, and for example, the division of the units is only one logical division, and there may be other divisions when actually implemented, and for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection of devices or units through some communication interfaces, and may be in an electrical, mechanical or other form.
The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.
In addition, functional units in the embodiments of the present invention may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit.
The functions, if implemented in the form of software functional units and sold or used as a stand-alone product, may be stored in a non-volatile computer-readable storage medium executable by a processor. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.
Finally, it should be noted that: the above-mentioned embodiments are only specific embodiments of the present invention, which are used for illustrating the technical solutions of the present invention and not for limiting the same, and the protection scope of the present invention is not limited thereto, although the present invention is described in detail with reference to the foregoing embodiments, those skilled in the art should understand that: any person skilled in the art can modify or easily conceive the technical solutions described in the foregoing embodiments or equivalent substitutes for some technical features within the technical scope of the present disclosure; such modifications, changes or substitutions do not depart from the spirit and scope of the embodiments of the present invention, and they should be construed as being included therein. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.