CN110928848A

CN110928848A - File fragment classification method and system

Info

Publication number: CN110928848A
Application number: CN201911146348.0A
Authority: CN
Inventors: 尹凌; 奚桂锴
Original assignee: Shenzhen Institute of Advanced Technology of CAS
Current assignee: Shenzhen Institute of Advanced Technology of CAS
Priority date: 2019-11-21
Filing date: 2019-11-21
Publication date: 2020-03-27
Also published as: WO2021098620A1

Abstract

The invention relates to a file fragment classification method, comprising: using a file data set to construct a file fragment data set, the file fragment data set includes: a training set and a test set; preprocessing the constructed file fragment data set; Deep convolutional neural network model; use the preprocessed training set and test set to train and evaluate the deep convolutional neural network model constructed above; use the deep convolutional neural network model to predict the file type to which the file fragment belongs. The invention also relates to a file fragment classification system. The present invention does not need to manually design features or other prior knowledge, and can automatically learn the features of the input file fragments, and the designed deep convolutional neural network can be applied to the classification tasks of file fragments of different sizes, and has better classification effect. .

Description

File fragment classification method and system

Technical Field

The invention relates to a file fragment classification method and a file fragment classification system.

Background

When a criminal suspect deletes a file stored in a disk, the disk often has residual file contents. If a judicial forensics staff wants to find evidence through file fragments in a disk, the file fragments need to be recombined and spliced into a file.

If a large number of file fragments are directly spliced two by two, huge calculation amount is consumed. If the file type of the file to which each file fragment belongs (i.e. the type of the file fragment) can be known in advance, the number of combinations that need to be tried can be greatly reduced.

One type of existing file fragment classification method is to use magic numbers or the like to identify files of different file types. These magic numbers typically appear at the file header and file footer, and files of different file types will appear with different numbers of magic numbers at different locations. Since files in a disk are often stored in fragmented form, and multiple file fragments belonging to a file are not always connected in sequence, it is often difficult to identify file fragments of different file types by using file header information and file trailer information of the file.

Another type of file fragment classification method is a content-based file fragment classification method. The file fragment classification method based on the content is to directly predict the file type of the file fragment through the analysis of the content of the file fragment. The method does not need to rely on file signatures or magic numbers or the like. The existing file fragment classification method based on contents mainly starts from the statistical perspective, and establishes a traditional machine learning model such as LDA, SVM, KNN and the like by extracting the statistical characteristics of each file fragment such as frequency distribution of unigram and bigram, entropy and the like, so as to identify the type corresponding to each file fragment. In the content-based file fragment classification method, the method of establishing a traditional machine learning model by extracting statistical features of file fragments heavily depends on the design of features, is time-consuming and requires a large amount of professional knowledge. Moreover, such methods cannot achieve a good classification effect at present.

In the file fragment classification method based on the content, the existing file fragment classification method based on the deep learning is not mature, the corresponding classification effect is not good, and the method is lower than the file fragment classification method based on the traditional machine learning model. The existing deep learning-based research needs to design different neural network architectures for different sizes of file fragments, so the applicability of the existing method is limited to a certain extent.

Disclosure of Invention

Accordingly, there is a need for a method and system for classifying file fragments.

The invention provides a file fragment classification method, which comprises the following steps: a. constructing a file fragment data set by using the file data set, wherein the file fragment data set comprises: training and testing sets; b. preprocessing the constructed file fragment data set; c. constructing a deep convolutional neural network model; d. training and evaluating the constructed deep convolutional neural network model by utilizing the preprocessed training set and test set; e. and predicting the file type of the file fragment by using the deep convolutional neural network model.

Wherein, the step a specifically comprises:

decompressing all zip compressed package files contained in the public file data set govdocs1, and dividing the files in the decompressed folders into different categories according to the file types to which the files belong;

dividing the selected files corresponding to the file types to be researched into two types to generate file fragments respectively used for a training set and a testing set;

and slicing each file according to the selected file fragment size to generate a large number of file fragments, and deleting the first file fragment of each file and the last file fragment of each file, which is smaller than the size of the specified file fragment.

The step b specifically comprises the following steps:

converting each file fragment in the generated training set and the test set, and converting one-dimensional file fragments into two-dimensional gray images through simple shape change;

and carrying out normalization processing on each two-dimensional gray image, calculating the maximum value and the minimum value of each position pixel point in a training set, scaling the corresponding pixel points in the training set and a testing set according to the maximum value and the minimum value obtained in the training set, and enabling the gray value of the pixel points to fall between-1 and 1.

The deep convolutional neural network model comprises L convolutional blocks, a global average pooling layer and two full-connection layers.

The convolution block includes: a convolutional layer, a residual unit and a maximum pooling layer;

the number of the volume blocks L is limited by the size of the converted grayscale image:

L_max＝min(log₂max(w,h)-1,log₂min(w,h))

in the formula, L_maxRefers to the maximum number of volume blocks that can be stacked in the model, and w and h refer to the width and height, respectively, of the converted two-dimensional grayscale image.

The convolutional layer uses d convolutional kernels of 1 × 1, and assuming that C IxJ feature maps are input to the convolutional block, the convolutional layer up-samples the number of channels of the input feature map.

The residual error unit comprises two convolution layers and is connected in a jumping mode by adopting a residual error learning method.

The maximum pooling layer performs spatial down-sampling on each input feature map to reduce the input feature map to the original one

Namely, it is

The step d specifically comprises the following steps:

and evaluating the deep convolutional neural network by utilizing the preprocessed test set, wherein evaluation indexes comprise average classification accuracy of a plurality of file fragment categories, macro-average F1 scores and micro-average F1 scores.

The invention provides a file fragment classification system, which comprises a fragment data set construction module, a preprocessing module, a model construction module, a training evaluation module and a file type prediction module, wherein: the fragment data set construction module is used for constructing a file fragment data set by using a file data set, and the file fragment data set comprises: training and testing sets; the preprocessing module is used for preprocessing the constructed file fragment data set; the model construction module is used for constructing a deep convolutional neural network model; the training evaluation module is used for training and evaluating the constructed deep convolutional neural network model by utilizing the preprocessed training set and the preprocessed test set; the file type prediction module is used for predicting the file type of the file fragment by utilizing the deep convolutional neural network model.

The application provides a file fragment classification method and a file fragment classification system, and prediction can be carried out only by converting input file fragments into two-dimensional gray images and inputting the two-dimensional gray images into a model. According to the invention, when the file fragments are converted into the two-dimensional gray scale image, no extra calculation amount is needed. When the type of the file fragment is predicted, the method completely judges based on the content of the file fragment without other prior knowledge. The method can automatically learn the features from the input file fragments directly, and does not need to manually extract the features from the file fragments and then carry out modeling. In addition, the deep convolutional neural network designed by the invention can be suitable for the classification task of file fragments with different sizes. The deep convolutional neural network designed by the invention adopts a residual error structure design, can build a deeper network model, is suitable for processing file fragment classification tasks of different sizes, effectively improves the classification accuracy of the file fragments, and has a better classification effect.

Drawings

FIG. 1 is a flow chart of a file fragment classification method of the present invention;

FIG. 2 is a schematic diagram of a process for converting a document fragment into a grayscale image according to an embodiment of the present invention;

FIG. 3 is a schematic diagram of a deep convolutional neural network model according to an embodiment of the present invention;

FIG. 4 is a diagram of a convolution block in a deep convolutional neural network model according to an embodiment of the present invention;

FIG. 5 is a diagram of residual error units in a deep convolutional neural network model according to an embodiment of the present invention.

Fig. 6 is a diagram of the hardware architecture of the file fragment sorting system of the present invention.

Detailed Description

The present invention will be described in further detail with reference to the accompanying drawings and specific embodiments.

FIG. 1 is a flowchart illustrating the operation of the file fragment classifying method according to the preferred embodiment of the present invention.

In step S1, a file fragment data set is constructed using the file data set. The file fragment data set comprises: training set and test set. Specifically, the method comprises the following steps:

in this embodiment, the file shard data set is generated using the exposed file data set govdocs 1. The file data set contains 1000 zip compressed package files. Decompressing all zip compressed packet files contained in the file data set, and dividing files in the decompressed folder into different categories according to the file types.

And aiming at the file fragment types needing to be researched, a certain number of files are selected for experiments. And respectively enabling the selected files corresponding to the file types to be researched to be 6: 4 into two classes to generate file fragments for the training set and the test set, respectively.

Each file is sliced according to the selected file fragment size to generate a large number of file fragments. In order to avoid the file signature which can be used for identifying the file type in the file header, the first file fragment of each file is deleted, and simultaneously, the last file fragment which is smaller than the size of the specified file fragment of each file is deleted. And aiming at the training set and the test set, limiting the number of the file fragments corresponding to each file type in a random sampling mode so as to balance the data set as much as possible and obtain a large number of file fragments which are respectively used for training and testing and correspond to different file types.

Step S2, pre-process the constructed file fragment data set, that is, pre-process the training set and the test set. Specifically, the method comprises the following steps:

each file fragment in the generated training set and test set is converted, and a one-dimensional file fragment can be converted into a two-dimensional gray image through simple shape change, please refer to fig. 2. Wherein the file fragments consist of a sequence of bytes; each byte corresponds to each pixel point in the two-dimensional gray image. When converting a file fragment (one-dimensional byte sequence) into a two-dimensional grayscale image, the shape of the grayscale image should be as close to a square as possible, so as to facilitate the construction of a model deep enough for classifying the file fragment.

In the present embodiment, a file fragment of 512 bytes is converted into a two-dimensional grayscale image of 16x32(16x32 ═ 512); a 4096 byte file fragment is converted to a 64x64(64x64 ═ 4096) two-dimensional grayscale image.

And finally, carrying out normalization processing on each two-dimensional gray image, calculating the maximum value and the minimum value of each position pixel point in a training set, carrying out scaling on the corresponding pixel points in the training set and a testing set according to the maximum value and the minimum value obtained in the training set, and enabling the gray value of the pixel points to fall between-1 and 1.

And step S3, constructing a deep convolutional neural network model. Specifically, the method comprises the following steps:

as shown in fig. 3, the deep convolutional neural network model includes L convolutional blocks, a global average pooling layer and two fully connected layers. The relu (rectified Linear unit) shown in fig. 3 refers to a modified Linear unit, which is an activation function.

The structure of each volume block is shown in fig. 4, and includes three parts: convolutional layers, residual units, and max-pooling layers. Wherein: the convolutional layer uses d convolutional kernels of 1 × 1, and assuming that C IxJ feature maps are input to the convolutional layer block, the convolutional layer up-samples the number of channels of the input feature maps (increases from C to d); the residual unit performs feature learning, and the max-pooling layer performs spatial down-sampling on each input feature map to reduce the input feature map to the original one

Namely, it is

The number of feature maps remains unchanged.

The number of the convolution blocks L is limited by the size of the converted grayscale image, as follows:

L_max＝min(log₂max(w,h)-1,log₂min(w,h))

The structure of the residual error unit is shown in fig. 5, the residual error unit includes two convolution layers, and jump connection is performed by using a residual error learning method. Both convolutional layers use d convolution kernels of 3x3 for learning the features of the input feature map. The input signature is calculated by the ReLU activation function before being input to both convolutional layers.

Both fully connected layers of the model had 2048 neurons.

Although the present application constructs the model structures as shown in fig. 3, fig. 4, and fig. 5 based on certain practical considerations, and provides parameters of relevant portions of the model, the model structure of the present invention should not be limited to these parameters, nor should it be limited to the parameters of the model structure.

And step S4, training and evaluating the built deep convolutional neural network model by utilizing the preprocessed training set and test set. The evaluation indicators include average classification accuracy of a plurality of file fragment categories, a macro-averaged F1 score, and a micro-averaged F1 score. Specifically, the method comprises the following steps:

in this embodiment:

and training the deep convolutional neural network by adopting a gradient descent method based on Adam. Here, the initial learning rate was set to 0.001, the learning rate was lowered to 0.2 every 5 rounds, and the total round of training was set to 40. In addition, the deep convolutional neural network is trained by adopting an earlystop technology. And when the evaluation indexes of the deep convolutional neural network on the test set are not improved in 5 continuous rounds, stopping training in advance, and taking the current model parameters as the optimal parameters of the deep convolutional neural network.

Step S5: and predicting the file type of the file fragment by using the deep convolutional neural network model. The method specifically comprises the following steps:

after the file fragments to be predicted are given, the file fragments are converted into two-dimensional gray scale images according to step S2, and then the converted gray scale images are normalized.

Specifically, the gray value of the pixel point at the corresponding position of the gray image in the training set is scaled to be between-1 and 1 according to the maximum value and the minimum value of the pixel point at the corresponding position of the gray image in the training set, and then the normalized two-dimensional gray image is input into the deep convolution neural network model so as to predict the file type of the file fragment.

Referring to fig. 6, there is shown a hardware architecture diagram of the file fragment sorting system 10 of the present invention. The system comprises: a fragmentation data set building module 101, a preprocessing module 102, a model building module 103, a training evaluation module 104, and a file type prediction module 105.

The fragment data set constructing module 101 is configured to construct a file fragment data set by using a file data set. The file fragment data set comprises: training set and test set. Specifically, the method comprises the following steps:

in this embodiment, the shard data set building module 101 generates the file shard data set by using the public file data set govdocs 1. The file data set contains 1000 zip compressed package files. Decompressing all zip compressed packet files contained in the file data set, and dividing files in the decompressed folder into different categories according to the file types.

The shard data set building module 101 slices each file according to the selected file shard size to generate a large number of file shards. In order to avoid the file signature which can be used for identifying the file type in the file header, the first file fragment of each file is deleted, and simultaneously, the last file fragment which is smaller than the size of the specified file fragment of each file is deleted. And aiming at the training set and the test set, limiting the number of the file fragments corresponding to each file type in a random sampling mode so as to balance the data set as much as possible and obtain a large number of file fragments which are respectively used for training and testing and correspond to different file types.

The preprocessing module 102 is configured to preprocess the constructed file fragment data set, that is, the training set and the test set. The method specifically comprises the following steps:

the preprocessing module 102 converts each file fragment in the generated training set and the test set, and converts the one-dimensional file fragment into a two-dimensional gray image through simple shape change, please refer to fig. 2. Wherein the file fragments consist of a sequence of bytes; each byte corresponds to each pixel point in the two-dimensional gray image. When converting a file fragment (one-dimensional byte sequence) into a two-dimensional grayscale image, the shape of the grayscale image should be as close to a square as possible, so as to facilitate the construction of a model deep enough for classifying the file fragment.

In this embodiment, the pre-processing module 102 converts the file fragments of 512 bytes into a two-dimensional grayscale image of 16x32(16x32 ═ 512); a 4096 byte file fragment is converted to a 64x64(64x64 ═ 4096) two-dimensional grayscale image.

Finally, the preprocessing module 102 normalizes each two-dimensional grayscale image, calculates the maximum value and the minimum value of each pixel point at each position in the training set, scales the corresponding pixel points in the training set and the testing set according to the maximum value and the minimum value obtained in the training set, and enables the grayscale values of the pixel points to fall between-1 and 1.

The model building module 103 is used for building a deep convolutional neural network model. Specifically, the method comprises the following steps:

as shown in fig. 3, the deep convolutional neural network model includes L convolutional blocks, a global average pooling layer and two fully connected layers. The relu (rectified linear unit) described in fig. 3 refers to a modified linear unit, which is an activation function.

Namely, it is

The number of feature maps remains unchanged.

L_max＝min(log₂max(w,h)-1,log₂min(w,h))

Both fully connected layers of the model had 2048 neurons.

The training evaluation module 104 is configured to train and evaluate the deep convolutional neural network model constructed as described above by using the preprocessed training set and test set. The evaluation indicators include average classification accuracy of a plurality of file fragment categories, a macro-averaged F1 score, and a micro-averaged F1 score. Specifically, the method comprises the following steps:

in this embodiment:

the training evaluation module 104 trains the deep convolutional neural network by using an Adam-based gradient descent method. Here, the initial learning rate was set to 0.001, the learning rate was lowered to 0.2 every 5 rounds, and the total round of training was set to 40. In addition, the deep convolutional neural network is trained by adopting an earlystop technology. And when the evaluation indexes of the deep convolutional neural network on the test set are not improved in 5 continuous rounds, stopping training in advance, and taking the current model parameters as the optimal parameters of the deep convolutional neural network.

The file type prediction module 105 is configured to predict a file type to which the file fragment belongs by using the deep convolutional neural network model. The method specifically comprises the following steps:

after the file fragment to be predicted is given, the file type prediction module 105 converts the file fragment into a two-dimensional gray image, and then performs normalization processing on the converted gray image.

Specifically, the file type prediction module 105 scales the gray value of the pixel point at the corresponding position of the gray image to be between-1 and 1 according to the maximum value and the minimum value of the pixel point at the corresponding position of the gray image in the training set, and then inputs the normalized two-dimensional gray image into the deep convolutional neural network model to predict the file type to which the file fragment belongs.

Although the present invention has been described with reference to the presently preferred embodiments, it will be understood by those skilled in the art that the foregoing description is illustrative only and is not intended to limit the scope of the invention, as claimed.

Claims

1. A file fragment classification method is characterized by comprising the following steps:

a. constructing a file fragment data set by using the file data set, wherein the file fragment data set comprises: training and testing sets;

b. preprocessing the constructed file fragment data set;

c. constructing a deep convolutional neural network model;

d. training and evaluating the constructed deep convolutional neural network model by utilizing the preprocessed training set and test set;

e. and predicting the file type of the file fragment by using the deep convolutional neural network model.

2. The method according to claim 1, wherein said step a specifically comprises:

slicing each file according to the selected file fragment size to generate a large number of file fragments, and deleting the first file fragment and the last file fragment of each file, which are smaller than the size of the specified file fragment.

3. The method according to claim 2, wherein said step b specifically comprises:

4. The method of claim 3, wherein the deep convolutional neural network model comprises L convolutional blocks, a global average pooling layer and two fully connected layers.

5. The method of claim 4, wherein the volume block comprises: a convolutional layer, a residual unit and a maximum pooling layer;

L_max＝min(log₂max(w,h)-1,log₂min(w,h))

6. The method of claim 5, wherein the convolutional layer uses d 1x1 convolutional kernels, and assuming that C IxJ feature maps are input to the convolutional block, the convolutional layer upsamples the number of channels of the input feature map.

7. The method of claim 6, wherein the residual unit comprises two convolutional layers, and the jump connection is performed by using a residual learning method.

8. The method of claim 7, wherein the max-pooling layer spatially down-samples each input feature map to reduce it to the original

Namely, it is

9. The method according to claim 8, wherein said step d specifically comprises:

10. A file fragment classification system is characterized by comprising a fragment data set building module, a preprocessing module, a model building module, a training evaluation module and a file type prediction module, wherein:

the fragment data set construction module is used for constructing a file fragment data set by using a file data set, and the file fragment data set comprises: training and testing sets;

the preprocessing module is used for preprocessing the constructed file fragment data set;

the model construction module is used for constructing a deep convolutional neural network model;

the training evaluation module is used for training and evaluating the constructed deep convolutional neural network model by utilizing the preprocessed training set and the preprocessed test set;

the file type prediction module is used for predicting the file type of the file fragment by utilizing the deep convolutional neural network model.