Disclosure of Invention
Accordingly, there is a need for a method and system for classifying file fragments.
The invention provides a file fragment classification method, which comprises the following steps: a. constructing a file fragment data set by using the file data set, wherein the file fragment data set comprises: training and testing sets; b. preprocessing the constructed file fragment data set; c. constructing a deep convolutional neural network model; d. training and evaluating the constructed deep convolutional neural network model by utilizing the preprocessed training set and test set; e. and predicting the file type of the file fragment by using the deep convolutional neural network model.
Wherein, the step a specifically comprises:
decompressing all zip compressed package files contained in the public file data set govdocs1, and dividing the files in the decompressed folders into different categories according to the file types to which the files belong;
dividing the selected files corresponding to the file types to be researched into two types to generate file fragments respectively used for a training set and a testing set;
and slicing each file according to the selected file fragment size to generate a large number of file fragments, and deleting the first file fragment of each file and the last file fragment of each file, which is smaller than the size of the specified file fragment.
The step b specifically comprises the following steps:
converting each file fragment in the generated training set and the test set, and converting one-dimensional file fragments into two-dimensional gray images through simple shape change;
and carrying out normalization processing on each two-dimensional gray image, calculating the maximum value and the minimum value of each position pixel point in a training set, scaling the corresponding pixel points in the training set and a testing set according to the maximum value and the minimum value obtained in the training set, and enabling the gray value of the pixel points to fall between-1 and 1.
The deep convolutional neural network model comprises L convolutional blocks, a global average pooling layer and two full-connection layers.
The convolution block includes: a convolutional layer, a residual unit and a maximum pooling layer;
the number of the volume blocks L is limited by the size of the converted grayscale image:
Lmax=min(log2max(w,h)-1,log2min(w,h))
in the formula, LmaxRefers to the maximum number of volume blocks that can be stacked in the model, and w and h refer to the width and height, respectively, of the converted two-dimensional grayscale image.
The convolutional layer uses d convolutional kernels of 1 × 1, and assuming that C IxJ feature maps are input to the convolutional block, the convolutional layer up-samples the number of channels of the input feature map.
The residual error unit comprises two convolution layers and is connected in a jumping mode by adopting a residual error learning method.
The maximum pooling layer performs spatial down-sampling on each input feature map to reduce the input feature map to the original one
Namely, it is
The step d specifically comprises the following steps:
and evaluating the deep convolutional neural network by utilizing the preprocessed test set, wherein evaluation indexes comprise average classification accuracy of a plurality of file fragment categories, macro-average F1 scores and micro-average F1 scores.
The invention provides a file fragment classification system, which comprises a fragment data set construction module, a preprocessing module, a model construction module, a training evaluation module and a file type prediction module, wherein: the fragment data set construction module is used for constructing a file fragment data set by using a file data set, and the file fragment data set comprises: training and testing sets; the preprocessing module is used for preprocessing the constructed file fragment data set; the model construction module is used for constructing a deep convolutional neural network model; the training evaluation module is used for training and evaluating the constructed deep convolutional neural network model by utilizing the preprocessed training set and the preprocessed test set; the file type prediction module is used for predicting the file type of the file fragment by utilizing the deep convolutional neural network model.
The application provides a file fragment classification method and a file fragment classification system, and prediction can be carried out only by converting input file fragments into two-dimensional gray images and inputting the two-dimensional gray images into a model. According to the invention, when the file fragments are converted into the two-dimensional gray scale image, no extra calculation amount is needed. When the type of the file fragment is predicted, the method completely judges based on the content of the file fragment without other prior knowledge. The method can automatically learn the features from the input file fragments directly, and does not need to manually extract the features from the file fragments and then carry out modeling. In addition, the deep convolutional neural network designed by the invention can be suitable for the classification task of file fragments with different sizes. The deep convolutional neural network designed by the invention adopts a residual error structure design, can build a deeper network model, is suitable for processing file fragment classification tasks of different sizes, effectively improves the classification accuracy of the file fragments, and has a better classification effect.
Detailed Description
The present invention will be described in further detail with reference to the accompanying drawings and specific embodiments.
FIG. 1 is a flowchart illustrating the operation of the file fragment classifying method according to the preferred embodiment of the present invention.
In step S1, a file fragment data set is constructed using the file data set. The file fragment data set comprises: training set and test set. Specifically, the method comprises the following steps:
in this embodiment, the file shard data set is generated using the exposed file data set govdocs 1. The file data set contains 1000 zip compressed package files. Decompressing all zip compressed packet files contained in the file data set, and dividing files in the decompressed folder into different categories according to the file types.
And aiming at the file fragment types needing to be researched, a certain number of files are selected for experiments. And respectively enabling the selected files corresponding to the file types to be researched to be 6: 4 into two classes to generate file fragments for the training set and the test set, respectively.
Each file is sliced according to the selected file fragment size to generate a large number of file fragments. In order to avoid the file signature which can be used for identifying the file type in the file header, the first file fragment of each file is deleted, and simultaneously, the last file fragment which is smaller than the size of the specified file fragment of each file is deleted. And aiming at the training set and the test set, limiting the number of the file fragments corresponding to each file type in a random sampling mode so as to balance the data set as much as possible and obtain a large number of file fragments which are respectively used for training and testing and correspond to different file types.
Step S2, pre-process the constructed file fragment data set, that is, pre-process the training set and the test set. Specifically, the method comprises the following steps:
each file fragment in the generated training set and test set is converted, and a one-dimensional file fragment can be converted into a two-dimensional gray image through simple shape change, please refer to fig. 2. Wherein the file fragments consist of a sequence of bytes; each byte corresponds to each pixel point in the two-dimensional gray image. When converting a file fragment (one-dimensional byte sequence) into a two-dimensional grayscale image, the shape of the grayscale image should be as close to a square as possible, so as to facilitate the construction of a model deep enough for classifying the file fragment.
In the present embodiment, a file fragment of 512 bytes is converted into a two-dimensional grayscale image of 16x32(16x32 ═ 512); a 4096 byte file fragment is converted to a 64x64(64x64 ═ 4096) two-dimensional grayscale image.
And finally, carrying out normalization processing on each two-dimensional gray image, calculating the maximum value and the minimum value of each position pixel point in a training set, carrying out scaling on the corresponding pixel points in the training set and a testing set according to the maximum value and the minimum value obtained in the training set, and enabling the gray value of the pixel points to fall between-1 and 1.
And step S3, constructing a deep convolutional neural network model. Specifically, the method comprises the following steps:
as shown in fig. 3, the deep convolutional neural network model includes L convolutional blocks, a global average pooling layer and two fully connected layers. The relu (rectified Linear unit) shown in fig. 3 refers to a modified Linear unit, which is an activation function.
The structure of each volume block is shown in fig. 4, and includes three parts: convolutional layers, residual units, and max-pooling layers. Wherein: the convolutional layer uses d convolutional kernels of 1 × 1, and assuming that C IxJ feature maps are input to the convolutional layer block, the convolutional layer up-samples the number of channels of the input feature maps (increases from C to d); the residual unit performs feature learning, and the max-pooling layer performs spatial down-sampling on each input feature map to reduce the input feature map to the original one
Namely, it is
The number of feature maps remains unchanged.
The number of the convolution blocks L is limited by the size of the converted grayscale image, as follows:
Lmax=min(log2max(w,h)-1,log2min(w,h))
in the formula, LmaxRefers to the maximum number of volume blocks that can be stacked in the model, and w and h refer to the width and height, respectively, of the converted two-dimensional grayscale image.
The structure of the residual error unit is shown in fig. 5, the residual error unit includes two convolution layers, and jump connection is performed by using a residual error learning method. Both convolutional layers use d convolution kernels of 3x3 for learning the features of the input feature map. The input signature is calculated by the ReLU activation function before being input to both convolutional layers.
Both fully connected layers of the model had 2048 neurons.
Although the present application constructs the model structures as shown in fig. 3, fig. 4, and fig. 5 based on certain practical considerations, and provides parameters of relevant portions of the model, the model structure of the present invention should not be limited to these parameters, nor should it be limited to the parameters of the model structure.
And step S4, training and evaluating the built deep convolutional neural network model by utilizing the preprocessed training set and test set. The evaluation indicators include average classification accuracy of a plurality of file fragment categories, a macro-averaged F1 score, and a micro-averaged F1 score. Specifically, the method comprises the following steps:
in this embodiment:
and training the deep convolutional neural network by adopting a gradient descent method based on Adam. Here, the initial learning rate was set to 0.001, the learning rate was lowered to 0.2 every 5 rounds, and the total round of training was set to 40. In addition, the deep convolutional neural network is trained by adopting an earlystop technology. And when the evaluation indexes of the deep convolutional neural network on the test set are not improved in 5 continuous rounds, stopping training in advance, and taking the current model parameters as the optimal parameters of the deep convolutional neural network.
Step S5: and predicting the file type of the file fragment by using the deep convolutional neural network model. The method specifically comprises the following steps:
after the file fragments to be predicted are given, the file fragments are converted into two-dimensional gray scale images according to step S2, and then the converted gray scale images are normalized.
Specifically, the gray value of the pixel point at the corresponding position of the gray image in the training set is scaled to be between-1 and 1 according to the maximum value and the minimum value of the pixel point at the corresponding position of the gray image in the training set, and then the normalized two-dimensional gray image is input into the deep convolution neural network model so as to predict the file type of the file fragment.
Referring to fig. 6, there is shown a hardware architecture diagram of the file fragment sorting system 10 of the present invention. The system comprises: a fragmentation data set building module 101, a preprocessing module 102, a model building module 103, a training evaluation module 104, and a file type prediction module 105.
The fragment data set constructing module 101 is configured to construct a file fragment data set by using a file data set. The file fragment data set comprises: training set and test set. Specifically, the method comprises the following steps:
in this embodiment, the shard data set building module 101 generates the file shard data set by using the public file data set govdocs 1. The file data set contains 1000 zip compressed package files. Decompressing all zip compressed packet files contained in the file data set, and dividing files in the decompressed folder into different categories according to the file types.
And aiming at the file fragment types needing to be researched, a certain number of files are selected for experiments. And respectively enabling the selected files corresponding to the file types to be researched to be 6: 4 into two classes to generate file fragments for the training set and the test set, respectively.
The shard data set building module 101 slices each file according to the selected file shard size to generate a large number of file shards. In order to avoid the file signature which can be used for identifying the file type in the file header, the first file fragment of each file is deleted, and simultaneously, the last file fragment which is smaller than the size of the specified file fragment of each file is deleted. And aiming at the training set and the test set, limiting the number of the file fragments corresponding to each file type in a random sampling mode so as to balance the data set as much as possible and obtain a large number of file fragments which are respectively used for training and testing and correspond to different file types.
The preprocessing module 102 is configured to preprocess the constructed file fragment data set, that is, the training set and the test set. The method specifically comprises the following steps:
the preprocessing module 102 converts each file fragment in the generated training set and the test set, and converts the one-dimensional file fragment into a two-dimensional gray image through simple shape change, please refer to fig. 2. Wherein the file fragments consist of a sequence of bytes; each byte corresponds to each pixel point in the two-dimensional gray image. When converting a file fragment (one-dimensional byte sequence) into a two-dimensional grayscale image, the shape of the grayscale image should be as close to a square as possible, so as to facilitate the construction of a model deep enough for classifying the file fragment.
In this embodiment, the pre-processing module 102 converts the file fragments of 512 bytes into a two-dimensional grayscale image of 16x32(16x32 ═ 512); a 4096 byte file fragment is converted to a 64x64(64x64 ═ 4096) two-dimensional grayscale image.
Finally, the preprocessing module 102 normalizes each two-dimensional grayscale image, calculates the maximum value and the minimum value of each pixel point at each position in the training set, scales the corresponding pixel points in the training set and the testing set according to the maximum value and the minimum value obtained in the training set, and enables the grayscale values of the pixel points to fall between-1 and 1.
The model building module 103 is used for building a deep convolutional neural network model. Specifically, the method comprises the following steps:
as shown in fig. 3, the deep convolutional neural network model includes L convolutional blocks, a global average pooling layer and two fully connected layers. The relu (rectified linear unit) described in fig. 3 refers to a modified linear unit, which is an activation function.
The structure of each volume block is shown in fig. 4, and includes three parts: convolutional layers, residual units, and max-pooling layers. Wherein: the convolutional layer uses d convolutional kernels of 1 × 1, and assuming that C IxJ feature maps are input to the convolutional layer block, the convolutional layer up-samples the number of channels of the input feature maps (increases from C to d); the residual unit performs feature learning, and the max-pooling layer performs spatial down-sampling on each input feature map to reduce the input feature map to the original one
Namely, it is
The number of feature maps remains unchanged.
The number of the convolution blocks L is limited by the size of the converted grayscale image, as follows:
Lmax=min(log2max(w,h)-1,log2min(w,h))
in the formula, LmaxRefers to the maximum number of volume blocks that can be stacked in the model, and w and h refer to the width and height, respectively, of the converted two-dimensional grayscale image.
The structure of the residual error unit is shown in fig. 5, the residual error unit includes two convolution layers, and jump connection is performed by using a residual error learning method. Both convolutional layers use d convolution kernels of 3x3 for learning the features of the input feature map. The input signature is calculated by the ReLU activation function before being input to both convolutional layers.
Both fully connected layers of the model had 2048 neurons.
Although the present application constructs the model structures as shown in fig. 3, fig. 4, and fig. 5 based on certain practical considerations, and provides parameters of relevant portions of the model, the model structure of the present invention should not be limited to these parameters, nor should it be limited to the parameters of the model structure.
The training evaluation module 104 is configured to train and evaluate the deep convolutional neural network model constructed as described above by using the preprocessed training set and test set. The evaluation indicators include average classification accuracy of a plurality of file fragment categories, a macro-averaged F1 score, and a micro-averaged F1 score. Specifically, the method comprises the following steps:
in this embodiment:
the training evaluation module 104 trains the deep convolutional neural network by using an Adam-based gradient descent method. Here, the initial learning rate was set to 0.001, the learning rate was lowered to 0.2 every 5 rounds, and the total round of training was set to 40. In addition, the deep convolutional neural network is trained by adopting an earlystop technology. And when the evaluation indexes of the deep convolutional neural network on the test set are not improved in 5 continuous rounds, stopping training in advance, and taking the current model parameters as the optimal parameters of the deep convolutional neural network.
The file type prediction module 105 is configured to predict a file type to which the file fragment belongs by using the deep convolutional neural network model. The method specifically comprises the following steps:
after the file fragment to be predicted is given, the file type prediction module 105 converts the file fragment into a two-dimensional gray image, and then performs normalization processing on the converted gray image.
Specifically, the file type prediction module 105 scales the gray value of the pixel point at the corresponding position of the gray image to be between-1 and 1 according to the maximum value and the minimum value of the pixel point at the corresponding position of the gray image in the training set, and then inputs the normalized two-dimensional gray image into the deep convolutional neural network model to predict the file type to which the file fragment belongs.
Although the present invention has been described with reference to the presently preferred embodiments, it will be understood by those skilled in the art that the foregoing description is illustrative only and is not intended to limit the scope of the invention, as claimed.