CN109885479B

CN109885479B - Software fuzzy test method and device based on path record truncation

Info

Publication number: CN109885479B
Application number: CN201910012433.1A
Authority: CN
Inventors: 宋晓斌; 柳晓龙; 王允超; 武泽慧; 魏强; 曹琰
Original assignee: PLA Information Engineering University
Current assignee: PLA Information Engineering University
Priority date: 2019-01-07
Filing date: 2019-01-07
Publication date: 2022-02-01
Anticipated expiration: 2039-01-07
Also published as: CN109885479A

Abstract

The invention belongs to the technical field of software testing, and in particular relates to a software fuzzy testing method and device based on path record truncation. The method comprises: constructing a project data set, extracting a conditional code structure, and obtaining a model input of a low-frequency path transition conditional code structure classification model Data, carry out model training, among which, the classification model adopts the LSTM network model structure; add path truncation marks and mark check instructions in the fuzzer instrumentation code; for the program to be tested, extract the conditional code structure and obtain the model input data, and transmit it to The trained classification model identifies the low-frequency path transition condition code structure, performs source code-level instrumentation at the corresponding position in the source file, truncates the path according to the path truncation mark, and completes the fuzzing test. The invention identifies the low-frequency path transition condition code before program execution and adopts the path truncation strategy to cancel the high-frequency path sample test, improves the efficiency and coverage of the fuzzy test, and has a strong engineering application prospect.

Description

Software fuzzy test method and device based on path record truncation

Technical Field

The invention belongs to the technical field of software testing, and particularly relates to a software fuzzy testing method and device based on path record truncation.

Background

Fuzz testing (Fuzzing) is an automated software testing technique that provides semi-valid data as input to a test program and monitors the program for anomalies. Due to the simplicity and high efficiency, the method is widely applied to various large software manufacturers and development and test of open source software, and a large number of bugs are found in various types of software. However, with the wide application of software security testing tools and the development awareness of code security, vulnerabilities often occur at locations where code structures are more complex. The existing fuzzing test has remarkable effect on mining code bugs with relatively simple code structures, but is often difficult to capture exceptions when the existing fuzzing test faces codes with complex structures. The reason for this problem is that most test samples perform the same high frequency path, whereas it is difficult to explore the low frequency path.

In order to solve the problems, researchers combine other vulnerability analysis technologies with the fuzzy test technology and successively put forward different fuzzy test methods. The method is mainly divided into a fuzzy test method based on symbolic execution, a fuzzy test method based on taint analysis and a fuzzy test method based on static analysis. The fuzz testing method based on symbolic execution is a fuzz testing technology combined with symbolic execution, and fuzz testing and selective concolic execution are utilized in a balanced mode to find out deeper errors. Selective concolic execution is used to test paths that the fuzz tester judges to be more "valuable" but obstructed. By combining the advantages of lightweight fuzz testing and concopic execution, the disadvantages of path explosion and fuzz imperfection inherent in symbol execution are avoided. A fuzzy test technology based on taint analysis is adopted, and a dynamic taint analysis technology is adopted to analyze which bytes in a test sample are mutated to more easily trigger the exploration of unknown codes, so that more targeted mutation is performed, and finally a better input sample is generated to realize the detection of deep codes. The fuzzy test method based on the static analysis adjusts the attention degree of different seeds by combining the program static analysis technology, optimizes the seed sequencing and selection strategy by utilizing the edge coverage rate information, thereby improving the test probability of low-frequency paths and improving the code coverage rate. Although the three fuzzy testing methods adopt different technologies to improve the low-frequency path testing probability, the testing of a high-frequency path sample still exists, so that the improvement of the low-frequency path testing probability is limited, and the improvement of the overall testing efficiency is not obvious.

Disclosure of Invention

Therefore, the invention provides a software fuzzy test method and device based on path record truncation, which can improve the test efficiency and test coverage rate of deep codes and have strong engineering application prospect.

According to the design scheme provided by the invention, the software fuzzing test method based on path record truncation comprises the following contents:

A) constructing a project data set and extracting a condition code structure, performing model input data processing on the extracted condition code structure, and performing model training as the input of a low-frequency path transfer condition code structure classification model, wherein the classification model adopts an LSTM network model structure;

B) adding a path truncation mark and a mark checking instruction in a pile inserting code of the fuzzy tester;

C) and extracting a condition code structure and carrying out model input data processing aiming at the program to be tested, inputting the processed data as a trained classification model, identifying a low-frequency path transfer condition code structure, carrying out source code level instrumentation at a corresponding position in a source file, carrying out path truncation according to a path truncation mark and finishing source code fuzzy test.

In the above, in a) extracting the condition code structure, firstly, defining a symbolic description facing a source code data set and extracting all condition code structures in the data set, then, performing code analysis on the extracted condition code structures, and performing labeling processing on an analysis result to obtain a code token sequence; and carrying out vector conversion on the token sequence to obtain input data of the classification model.

Preferably, in a), all condition code structures in the data set are extracted, and the following contents are included: firstly, preprocessing a source code data set and extracting effective codes; and then, extracting a condition structure set from the effective codes, constructing a stack structure, recording the position corresponding relation between the code segments and the source code data set, and performing iterative processing on the nested condition structure set according to the stack structure to obtain a minimized code segment.

Preferably, in a), the code parsing process includes the following steps: and extracting a code sequence by adopting an abstract syntax tree, performing symbolization processing, grouping according to the meaning of the entries in the code sequence, acquiring a synonym set, and expanding the code sequence.

Preferably, in a), vector conversion is performed on the token sequence, and the method includes the following steps: performing text vectorization conversion by using word2vec, and obtaining a word vector model by setting feature vector dimensions and word frequency parameter output; and acquiring a dictionary index dictionary and a word vector dictionary according to the word vector model, and acquiring classification model input data according to the dictionary index dictionary and the word vector dictionary.

In the above, in the process of adding the path truncation flag and the flag check instruction, the path truncation flag is inserted into the source code data set, and the path truncation flag is defined in the bss section of the code; and a path truncation marking check is performed at the original stake entry.

Preferably, in step C), for the test program, the error handling code segments identified by the classification model in the training process are instrumented by combining the condition code structure and the position index in the source code data set in an internal and external instrumentation manner.

Further, in C), for the case that the path truncation flag is after the instrumentation code, the original instrumentation of the current condition is cancelled, the instrumentation is performed with a null instruction, and an annotation flag is added at the instruction annotation.

Further, in C), the instrumentation is set to cancel instrumentation at the marked branch for the compilation optimization problem of the same conditional statement in different source code datasets.

A software fuzzing test device based on path record truncation comprises: a training module, a labeling module, and a testing module, wherein,

the training module is used for constructing a project data set, extracting a condition code structure, performing model input data processing on the extracted condition code structure, and performing model training as the input of a low-frequency path transfer condition code structure classification model, wherein the classification model adopts an LSTM network model structure;

the marking module is used for adding a path truncation mark and a mark checking instruction in the instrumentation code of the fuzzy tester;

and the test module is used for extracting a condition code structure and carrying out model input data processing aiming at the program to be tested, inputting the processed data as a trained classification model, identifying a low-frequency path transfer condition code structure, carrying out source code level instrumentation at a corresponding position in a source file, carrying out path truncation according to a path truncation mark and finishing source code fuzzy test.

The invention has the beneficial effects that:

1. aiming at the problem that the high-frequency path influences the fuzzy testing efficiency, the low-frequency path transfer condition code recognition is carried out by adopting a deep learning neural network, source code level instrumentation is carried out on the transfer condition code structure obtained by the training model recognition, and the path record is cut off according to the mark code, so that the probability of low-frequency sample variation is increased, and the fuzzy testing efficiency is finally improved.

2. According to the method, the low-frequency path transfer condition codes are recognized before the program is specifically executed, the high-frequency path sample test is cancelled by adopting the path truncation strategy, the blank of the high-frequency path sample in the aspect of influencing the analysis is filled, the complicated dynamic analysis technology is not relied on, the overhead problem is not caused, the method can be effectively combined with other ash box test technologies, the coverage rate is further improved on the basis of the original test tool, and the method has an important guiding significance on the development of the software test technology.

Description of the drawings:

FIG. 1 is a flow chart of a software fuzz testing method in an embodiment;

FIG. 2 is a schematic diagram of software fuzz testing in an embodiment;

FIG. 3 is a schematic diagram of a classification model constructed by combining word2vec and LSTM neural network models in the embodiment;

FIG. 4 is a schematic diagram of an embodiment of a software fuzz testing apparatus.

The specific implementation mode is as follows:

in order to make the objects, technical solutions and advantages of the present invention clearer and more obvious, the present invention is further described in detail below with reference to the accompanying drawings and technical solutions.

In view of the situations of limiting the low-frequency path test probability and limiting the overall test efficiency in the existing software fuzz test, in the embodiment of the present invention, referring to fig. 1, a software fuzz test method based on path record truncation is provided, which includes the following contents:

s101, constructing a project data set and extracting a condition code structure, performing model input data processing on the extracted condition code structure, and performing model training as the input of a low-frequency path transfer condition code structure classification model, wherein the classification model adopts an LSTM network model structure;

s102, adding a path truncation mark and a mark checking instruction in a pile inserting code of the fuzzy tester;

s103, extracting a condition code structure and performing model input data processing aiming at the program to be tested, inputting the processed data as a trained classification model, identifying a low-frequency path transfer condition code structure, performing source code level instrumentation at a corresponding position in a source file, performing path truncation according to a path truncation mark, and completing source code fuzzy test.

Referring to fig. 2, by constructing a data set, extracting all condition code structures in the data set, and performing code analysis on the code structures; preprocessing the analysis result to extract a code token sequence and performing vector conversion on the token sequence; generating a label by the code, and taking the generated vector as the input of an LSTM neural network model to train a low-frequency path transfer condition code structure classification model; adding a path truncation mark and a mark checking instruction in a pile inserting code of the fuzzy tester; and carrying out condition code structure extraction on the test program to obtain a low-frequency path transfer condition code structure, carrying out light-weight source code instrumentation on the corresponding position of the identified structure in the source file, and starting to test the program to be tested.

Most input parsing programs usually have a large number of format checks, and there is a corresponding error handling structure for failure of the check, and such structure usually results in a coverage rate that is difficult to increase. Therefore, the error handling code belongs to a most representative type of low frequency path branch condition code, and refers to a code segment executed when a program input causes a program to be incorrectly exited due to various different reasons. In the extraction of the conditional code structure, in another embodiment of the invention, firstly, the symbolic description facing to the source code data set is defined, all the conditional code structures in the data set are extracted, then, the extracted conditional code structures are subjected to code analysis, and the analysis result is subjected to labeling processing to obtain a code token sequence; and carrying out vector conversion on the token sequence to obtain input data of the classification model. Preferably, all condition code structures in the extracted data set include the following: firstly, preprocessing a source code data set and extracting effective codes; and then, extracting a condition structure set from the effective codes, constructing a stack structure, recording the position corresponding relation between the code segments and the source code data set, and performing iterative processing on the nested condition structure set according to the stack structure to obtain a minimized code segment. Specifically, the descriptors are defined as shown in table 1.

TABLE 1 symbolic description

Firstly to S_oPreprocessing is performed to remove some unnecessary information, such as code comments, line breaks, and the like. Here, R is used for efficient code extraction to obtain S_n. Then to the processed S_nExtraction of I_eThere is proposed a method based on bracket stack balancing, i.e.

C_l0, so that B can be identified by constructing a stack-like structure_lThen C is_s+1 when B is identified_r，C_s1, when C is_sWhen the ratio is 0, mixing_tIs added to I_e. And simultaneously recording the corresponding relation between the code segment and the source file position so as to facilitate the subsequent instrumentation processing. However, after extraction in this way I appears_n，I_nCan lead to false positives because of assumptions

And is

If it is considered to be I_nE is E, then

Creating a contradiction. Therefore, also need to be on_nAn iterative process is performed to ensure that the extracted code fragments are minimized. By comparing the first extracted I_eExtracting in the same manner until each structure is I_r。

In another embodiment of the present invention, the code parsing process includes the following steps: and extracting a code sequence by adopting an abstract syntax tree, performing symbolization processing, grouping according to the meaning of the entries in the code sequence, acquiring a synonym set, and expanding the code sequence.

The extracted code segments are parsed into word sequences, and all segments are parsed into sequences of equal length for input into the LSTM model. The code section parsing stage extracts a code sequence using an Abstract Syntax Tree (AST). And simultaneously performing symbolization processing, such as expressing an integer as num and expressing a character string as str. However, in this classification, the content of the character string has a certain influence. Most error handling code fragments contain a characteristic that if a code fragment contains a character string, the character string usually contains words with similar meanings such as error, fail and the like. Therefore, two ways are used to perform string symbolization, which is divided into whether special keywords are included, specifically denoted as errstr and str. Although a part of the misexpression vocabulary has been extracted by the previous analysis, the number is limited. In the embodiment of the invention, WordNet can be used, an English dictionary established and maintained by Princeton university is grouped by vocabulary entry meanings, each vocabulary entry with the same meaning forms a synonym set, and the vocabulary entry group in the synonym set can be used for expanding wrongly expressed vocabulary. Because of adopting the supervised learning method, each sample needs to be labeled, if the code segment belongs to the error processing code segment, the label is marked as 1, otherwise, the label is marked as 0. Here, a heuristic-based approach is used for labeling, and the following 5 heuristics are summarized through a large number of source file analyses: 1) comparisons are typically included in if (…); 2) if the character string is contained, the character string contains an error expression vocabulary; 3) may contain return or jump keywords such as return, goto, etc.; 4) if the function is contained, the error expression vocabulary is usually contained in the function name; 5) it may contain system error macro definitions such as 'EPERM', 'enont', etc.

In another embodiment of the present invention, vector conversion is performed on token sequences, which includes the following steps: performing text vectorization conversion by using word2vec, and obtaining a word vector model by setting feature vector dimensions and word frequency parameter output; and acquiring a dictionary index dictionary and a word vector dictionary according to the word vector model, and acquiring classification model input data according to the dictionary index dictionary and the word vector dictionary.

Vectorization is carried out by taking the obtained token sequence as input, a tool word2vec widely applied to text vectorization is used for conversion, a word vector model is obtained by setting feature vector dimensions and word frequency parameter output, and a dictionary index dictionary and a word vector dictionary are established according to the obtained model and are used as the input of a subsequent LSTM model. Referring to fig. 3, since different code segments contain different numbers of tokens, but the LSTM can only accept inputs of the same length, a padding and trimming process is required. After vectorization and code segment labeling of the code segments are obtained, training of the LSTM network can be started, and a Dropout layer is added in the model except essential matrixes such as an Embedding layer and an LSTM unit to prevent data overfitting results.

The detection stage is used for detecting the type of a given unknown code segment and outputting a file to which the code segment belongs and the position of the code segment in a source file if the code segment belongs to an error processing code. Given an unknown item, the specific detection process is as follows: 1. extracting an error processing code segment structure in each source file in the project and recording the file name and the position of the error processing code segment structure; 2. analyzing the code segments to obtain respective code sequences; 3. vectorizing the code sequence obtained in the last step according to rules by using a word2vec model obtained by early training; 4. and inputting the obtained vector into a trained LSTM network for judgment.

Further, in another embodiment of the present invention, in the process of adding the path truncation flag and the flag checking instruction, the path truncation flag is inserted into the source code data set, and the path truncation flag is defined in the bss section of the code; and a path truncation marking check is performed at the original stake entry.

The assembly code is embedded in the source code. And a mark continue _ log is set by inserting a mark into the source code to realize the function of canceling the record of the subsequent instrumentation. The flag is defined in the bss segment, and since the bss segment data is uninitialized data and the memory is cleared before each operation, the flag may be set to 1 to indicate that the subsequent basic block is not recorded any more. And (4) canceling the record of the basic block at the time by performing continueLog mark query at the entrance of the original instrumentation and jumping to a return point if the index is 1. The subsequent program always marks 1 in operation, so the subsequent basic block is no longer recorded, i.e. if the original path is 1 → 2 → 3 → …, if the mark is contained at 3, this time recording is 1 → 2. And when the next round of test starts, the mark is cleared, and the execution path can be recorded normally.

And (3) performing instrumentation on the error processing code segment by means of the error processing code segment identified by the error code classification model obtained by utilizing LSTM network training in the early stage and combining an index established by the structure and the position of the if-else in the source file. The method is realized by adopting an internal and external instrumentation mode, because if statements are compiled into conditional jump instructions, and instrumentation of the fuzzy tester is judged by the conditional jump instructions, before the if statements, namely, source code instrumentation outside an if structure can influence whether subsequent basic blocks are instrumented or not. Before the first statement of the if structure, that is, performing source code instrumentation in the if structure, the recording mode of the subsequent instrumented basic block may be determined, because the first statement belongs to the beginning of the jump basic block, and a conditional jump instruction may still exist subsequently, the first statement will determine to traverse the recording results of all the subsequent basic blocks on the path of the basic block. The following three cases are mainly distinguished:

the first case is where the error handling code is located in the if structure, in which case it is only necessary to do instrumentation before the if statement and before the first statement in the if structure.

The second case is where the error handling code is located in the else structure, in which case it needs to be instrumented before the first statement in the else structure and the adjacent if statement before the else structure. If the preamble structure is an else if structure, the else if structure needs to be converted into an if structure, and then instrumentation is performed before if.

In the third case, the error processing code is located in the else if structure, and the instrumentation is performed before the if statement and before the first statement in the else if structure after the structure conversion processing is also performed. When the else if structure is instrumented, the original code structure is damaged, so that the repair is needed to ensure the code to be compiled and run correctly. The source code structure repair algorithm in the embodiment of the invention can be designed as follows:

the algorithm can realize accurate compiling and running of the program.

Aiming at a test program, the error processing code segments identified by a classification model in a training process are inserted by combining a condition code structure and a position index in a source code data set in an internal and external insertion mode. Preferably, for the situation that the path truncation mark is behind the instrumentation code, the original instrumentation of the current condition is cancelled, the instrumentation is performed by adopting a null instruction, and an annotation mark is added at the instruction annotation position. Preferably, for the compiling optimization problem of the same conditional statement in different source code data sets, the marking branch is set to cancel the instrumentation.

If the continue _ log mark causes the basic block to be recorded after the instrumentation code, although the subsequent code block cannot be recorded continuously because the flag bit is already set to 1, the basic block record is still considered to generate a new path, so that the test sample is retained and the purpose of the targeted test is not achieved. It is therefore necessary in this case to cancel the original instrumentation of the current conditional jump. Null instructions (nop) are instrumented and tagged at instruction annotations, allowing for normal execution flow that does not affect the program. Since the assembly code comments are not cleared after the source code is compiled into assembly code, the decision can be made using the comment tag. And when meeting the annotation mark, assigning the instrumentation mark, then judging the instrumentation according to the instrumentation mark when meeting the conditional jump instruction, then clearing the instrumentation mark, and circulating the process until all code instrumentation is finished. Therefore, the subsequent code block skipping instrumentation process is realized. Due to the problem of compilation optimization, that is, when the same if (i | ═ 0) statement is assembled in different source files, cases such as jz and jnz may exist, and the original instrumentation only performs instrumentation at a negative jump. Therefore, the above situation is dealt with by adopting a mode of canceling the insertion at all the marked branches, and although one effective basic block record is canceled, the information of the effective path is not influenced.

Based on the software fuzzing test method, an embodiment of the present invention further provides a software fuzzing test apparatus based on path record truncation, as shown in fig. 4, including: a training module 101, a labeling module 102, and a testing module 103, wherein,

the training module 101 is used for constructing a project data set, extracting a condition code structure, performing model input data processing on the extracted condition code structure, and performing model training as the input of a low-frequency path transfer condition code structure classification model, wherein the classification model adopts an LSTM network model structure;

the marking module 102 is used for adding a path truncation mark and a mark checking instruction in the instrumentation code of the fuzzy tester;

the test module 103 is configured to extract a condition code structure and perform model input data processing for a program to be tested, input the processed data as a trained classification model, identify a low-frequency path transfer condition code structure, perform source code level instrumentation at a corresponding position in a source file, perform path truncation according to a path truncation flag, and complete a source code fuzzy test.

Unless specifically stated otherwise, the relative steps, numerical expressions, and values of the components and steps set forth in these embodiments do not limit the scope of the present invention.

Based on the foregoing method, an embodiment of the present invention further provides a server, including: one or more processors; a storage device for storing one or more programs which, when executed by the one or more processors, cause the one or more processors to implement the method described above.

Based on the above method, the embodiment of the present invention further provides a computer readable medium, on which a computer program is stored, wherein the program, when executed by a processor, implements the above method.

The device provided by the embodiment of the present invention has the same implementation principle and technical effect as the method embodiments, and for the sake of brief description, reference may be made to the corresponding contents in the method embodiments without reference to the device embodiments.

It is clear to those skilled in the art that, for convenience and brevity of description, the specific working processes of the system and the apparatus described above may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.

In all examples shown and described herein, any particular value should be construed as merely exemplary, and not as a limitation, and thus other examples of example embodiments may have different values.

It should be noted that: like reference numbers and letters refer to like items in the following figures, and thus, once an item is defined in one figure, it need not be further defined and explained in subsequent figures.

The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present invention. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

In the several embodiments provided in the present application, it should be understood that the disclosed system, apparatus and method may be implemented in other ways. The above-described embodiments of the apparatus are merely illustrative, and for example, the division of the units is only one logical division, and there may be other divisions when actually implemented, and for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection of devices or units through some communication interfaces, and may be in an electrical, mechanical or other form.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, functional units in the embodiments of the present invention may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit.

The functions, if implemented in the form of software functional units and sold or used as a stand-alone product, may be stored in a non-volatile computer-readable storage medium executable by a processor. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.

Finally, it should be noted that: the above-mentioned embodiments are only specific embodiments of the present invention, which are used for illustrating the technical solutions of the present invention and not for limiting the same, and the protection scope of the present invention is not limited thereto, although the present invention is described in detail with reference to the foregoing embodiments, those skilled in the art should understand that: any person skilled in the art can modify or easily conceive the technical solutions described in the foregoing embodiments or equivalent substitutes for some technical features within the technical scope of the present disclosure; such modifications, changes or substitutions do not depart from the spirit and scope of the embodiments of the present invention, and they should be construed as being included therein. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.

Claims

1. a software fuzzing method based on path record truncation, is characterized in that, comprises following content:

A) Construct the project data set and extract the condition code structure. After the extracted condition code structure is processed into the model input data, it is used as the input of the low-frequency path transition condition code structure classification model for model training. The classification model adopts the LSTM network model structure. ;

B) Add path truncation marks and mark checking instructions in the fuzzer instrumentation code;

C) For the program to be tested, extract the conditional code structure and process the model input data, use the processed data as the input of the trained classification model, identify the low-frequency path transfer conditional code structure, and execute the source code in the corresponding position in the source file Advanced instrumentation, path truncation is performed according to the path truncation mark, and source code fuzzing is completed.

2. The software fuzzing method based on path record truncation according to claim 1, characterized in that, A) in extracting the conditional code structure, first define the symbol description oriented to the source code data set and extract all the conditional code structures in the data set, Then perform code analysis on the extracted conditional code structure, and perform tagging processing on the analysis result to obtain the code token sequence; perform vector transformation on the token sequence to obtain the input data of the classification model.

3. The software fuzzing method based on path record truncation according to claim 2, characterized in that, in A), all conditional code structures in the data set are extracted, including the following content: first, the source data set is preprocessed, extracted Valid code; then, extract the conditional structure set from the valid code, build a stack structure, record the corresponding relationship between the code segment and the source data set, and iteratively process the nested conditional structure set according to the stack structure to obtain the minimized code segment .

4. The software fuzzing method based on path record truncation according to claim 2, characterized in that, in A), the code parsing process comprises the following content: using abstract syntax tree to extract code sequence and symbolize processing, and simultaneously according to code The meanings of the entries in the sequence are grouped, the synonym set is obtained, and the code sequence is expanded.

5. The software fuzzing method based on path record truncation according to claim 2, wherein in A), vector conversion is performed on the token sequence, including the following content: using word2vec to perform text vectorization conversion, by setting the feature The vector dimension and word frequency parameters are output to obtain the word vector model; the dictionary index dictionary and the word vector dictionary are obtained according to the word vector model, and the classification model input data is obtained according to the dictionary index dictionary and the word vector dictionary.

6. The software fuzzing method based on path record truncation according to claim 1, wherein in B), in the process of adding a path truncation mark and a mark check instruction, by inserting a path truncation mark in the source data set, the path The truncation marker is defined in the bss section of the code; and the path truncation marker is checked at the original instrumentation entry.

7. The software fuzzing method based on path record truncation according to claim 6, characterized in that, in C), for the test program, rely on the error handling code segment identified by the classification model in the training process, and combine with the source code data set. Condition code structure and position index, using internal and external instrumentation to instrument the error handling code segment.

8. The software fuzzing method based on path record truncation according to claim 7, wherein in C), for the situation where the path truncation mark is behind the instrumentation code, the original instrumentation of the current condition is cancelled, and an empty instruction is adopted Do instrumentation and add comment markers to directive comments.

9. The software fuzzing method based on path record truncation according to claim 7, wherein in C), for the compilation optimization problem of the same conditional statement in different source data sets, it is set to cancel the insertion at the marked branch. pile.

10. A software fuzzing device based on path record truncation, characterized in that it comprises: a training module, a marking module and a testing module, wherein,

The training module is used to construct the project data set and extract the conditional code structure. After the extracted conditional code structure is processed into the model input data, it is used as the input of the low-frequency path transition conditional code structure classification model for model training. The classification model adopts LSTM. network model structure;

Marker module, which is used to add path truncation markers and marker checking instructions in the fuzzer instrumented code;

The test module is used to extract the conditional code structure and process the model input data for the program to be tested, and use the processed data as the input of the trained classification model, identify the low-frequency path transfer conditional code structure, and put it in the corresponding position in the source file. Perform source code-level instrumentation, truncate paths according to path truncation marks, and complete source code fuzzing.