CN110602078B

CN110602078B - Application encryption traffic generation method and system based on generation countermeasure network

Info

Publication number: CN110602078B
Application number: CN201910832196.3A
Authority: CN
Inventors: 王攀; 王梓炫; 李书航; 黄琛
Original assignee: Nanjing University of Posts and Telecommunications
Current assignee: Nanjing University of Posts and Telecommunications
Priority date: 2019-09-04
Filing date: 2019-09-04
Publication date: 2022-12-20
Anticipated expiration: 2039-09-04
Also published as: CN110602078A

Abstract

The invention discloses a method and system for generating application encrypted traffic based on a generative confrontation network, which extracts encrypted traffic packets (including data packet headers) of real applications into decimal data and intercepts them to a fixed length (possible digits are filled with 0) , separated by commas, each line is a piece of traffic data, which is sent to the generative confrontation network (GAN) for feature extraction, after the generator and discriminator of the GAN network tend to be stable. Inputting a small amount of encrypted traffic of real applications into the generator of GAN can generate any number of encrypted traffic containing the characteristics of the application traffic. This method cleverly abstracts the features of encrypted traffic through GAN, without decrypting the traffic itself, reducing the work of decryption while effectively protecting user privacy and greatly reducing the cost of obtaining samples. This method is applicable to all encrypted traffic recognition scenarios based on deep learning, where few encrypted traffic samples are difficult to obtain and the recognition rate is low.

Description

A method and system for generating application encrypted traffic based on generative confrontation network

技术领域technical field

本发明涉及一种基于生成对抗网络(GAN)的应用加密流量生成方法，属于数据加密技术领域。The invention relates to a method for generating application encryption traffic based on a generative confrontation network (GAN), and belongs to the technical field of data encryption.

背景技术Background technique

流量分类与识别是提升网络管理与安全监测水平，改善服务质量的基础，也是网络设计与规划等网络行为的前提。随着用户隐私保护和安全意识的增强，SSL、SSH、VPN等技术得到了越来越广泛地应用，导致加密流量在网路传输中的比重越来越高。Traffic classification and identification are the basis for improving network management and security monitoring, improving service quality, and also the prerequisite for network behavior such as network design and planning. With the enhancement of user privacy protection and security awareness, technologies such as SSL, SSH, and VPN have been more and more widely used, resulting in an increasing proportion of encrypted traffic in network transmission.

因采用应用层加密，传统的端口匹配、DPI已经无法准确识别应用流量；相比于机器学习，深度学习能较好的表达数据的本质特征，但其训练时依赖大量有标记的样本，样本的准确性直接导致训练结果的识别率。Due to the use of application layer encryption, traditional port matching and DPI can no longer accurately identify application traffic; compared with machine learning, deep learning can better express the essential characteristics of data, but its training relies on a large number of labeled samples. Accuracy directly leads to the recognition rate of the training results.

生成对抗网络(GAN)算法的主要思想是利用生成网络和判别网络这两个网络，通过一个极小极大的博弈来收敛到最优解。GAN在图像、声音和文本的生成方面展示了其巨大优势。为了学习生成器G(z)的分布，预先定义了输入噪声变量p_z(z)同时定义了判别器的分布D(x)，其表示x来自于真实数据p_data(x)而不是G(z)的概率。通过训练判别器D，使其尽可能的从真实样本和生成器G生成的样本中区分出真实样本的概率，使log(1-D(G(z)))最小。综上所述，D和G使用值函数V(D,G)进行了如下的两人极大极小博弈，即GAN的损失如下式(1)所示：The main idea of the generative confrontation network (GAN) algorithm is to use the two networks of the generation network and the discriminant network to converge to the optimal solution through a minimax game. GAN has shown its great advantages in the generation of images, sounds and texts. In order to learn the distribution of the generator G(z), the input noise variable p _z (z) is pre-defined and the distribution D(x) of the discriminator is defined, which means that x comes from the real data p _data (x) instead of G( z) probability. By training the discriminator D, the probability of distinguishing the real sample from the real sample and the sample generated by the generator G is as far as possible, so that log(1-D(G(z))) is minimized. To sum up, D and G use the value function V(D,G) to perform the following two-person maximin game, that is, the loss of GAN is shown in the following formula (1):

生成器G以随机噪声为输入，生成样本。判别器D同时接受来自生成器和真实样本的数据，并尝试区分这两种数据。这两个网络在进行一场博弈游戏，生成器G不断学习生成更加真实的样本，判别器D不断学习更好区分生成样本和真实样本。这两个网络是同时训练的，期望这种博弈能使生成的样本和真实样本难以区分。生成器G的输出是一个合成的样本X_fake＝G(z)。判别器D接收来自生成器的合成样本或真实数据样本，输出为可能源的概率分布P(F|X)＝D(X)如等式(2)所示。判别器被训练使logD(x)最大化，同时训练生成器G使等式中的log(1-D(G(z)))最小化。The generator G takes random noise as input and generates samples. The discriminator D accepts data from both the generator and real samples and tries to distinguish between the two. These two networks are playing a game, the generator G is constantly learning to generate more realistic samples, and the discriminator D is constantly learning to better distinguish between generated samples and real samples. The two networks are trained simultaneously, and this game is expected to make generated samples indistinguishable from real samples. The output of the generator G is a synthesized sample X _fake =G(z). The discriminator D receives synthetic samples or real data samples from the generator, and outputs a probability distribution of possible sources P(F|X)=D(X) as shown in Equation (2). The discriminator is trained to maximize logD(x), while the generator G is trained to minimize log(1-D(G(z))) in Eq.

L＝E[logP(F＝real|X_real)]+E[logP(F＝fake|X_fake)] (2)L=E[logP(F=real|X _real )]+E[logP(F=fake|X _fake )] (2)

判别器的损失logD(x)生成器的损失log(1-D(G(z)))Discriminator's loss logD(x) Generator's loss log(1-D(G(z)))

关于训练：About training:

每一轮训练包含两个部分minG和maxD,Each round of training contains two parts minG and maxD,

训练判别器：Train the discriminator:

首先看一下maxD部分，训练一般是先保持G(生成器)不变训练D的。D的训练目标是正确区分fake/true，如果我们以1/0代表true/fake，则对第一项E因为输入采样自真实数据所以我们期望D(x)趋近于1，也就是第一项更大。而z是随机的输入，G(z)表示生成的样本，对于生成的样本，我们希望判别器的判别结果D(G(z))越接近于0越好，也就是让总数值最大，也就是maxD的含义了。First look at the maxD part. The training is generally to keep G (generator) unchanged and train D. The training goal of D is to correctly distinguish fake/true. If we use 1/0 to represent true/fake, then for the first item E, because the input is sampled from real data, we expect D(x) to approach 1, which is the first item is larger. And z is a random input, and G(z) represents the generated samples. For the generated samples, we hope that the discriminant's discriminant result D(G(z)) is closer to 0, the better, that is, to maximize the total value, and also It is the meaning of maxD.

训练生成器：Training generator:

第二部分保持D不变，训练G，这个时候只有第二项E有用了，因为我们要迷惑D，希望D(G(z))输出接近于1更好，也就是这一项越小越好，这就是minG。当然判别器哪有这么好糊弄，所以这个时候判别器就会产生比较大的误差，误差会更新G，那么G就会变得更好了。The second part keeps D unchanged and trains G. At this time, only the second item E is useful, because we want to confuse D. It is better to hope that the output of D(G(z)) is close to 1, that is, the smaller the item, the better. Well, this is minG. Of course, the discriminator is not so easy to fool, so at this time, the discriminator will generate a relatively large error, and the error will update G, then G will become better.

重复以上两个步骤Repeat the above two steps

训练迭代N轮之后，判别器和生成器的损失就趋于稳定了。After N rounds of training iterations, the discriminator and generator losses stabilize.

另外，加密应用的流量获取及标记十分困难，要达到训练出较好的模型需要的样本量，直接采集比较困难，成本较高。In addition, it is very difficult to obtain and mark the traffic of encryption applications. To achieve the sample size required for training a better model, direct collection is difficult and costly.

发明内容Contents of the invention

发明目的：为了克服现有技术中存在的不足，本发明提供一种基于生成对抗网络的应用加密流量生成方法及系统，本发明借助少量应用的真实数据，通过生成对抗网络(GAN)自身的生成对抗机制，提取每个应用的加密流量特征，通过产生高斯噪声触发生成对抗网络(GAN)的生成器，产生具有应用加密流量特征的新数据。新数据因具有应用加密的特征，所以可以参与之后的识别模型训练，从而大大降低了深度学习样本获取难，标记成本大的问题。Purpose of the invention: In order to overcome the deficiencies in the prior art, the present invention provides a method and system for generating application-encrypted traffic based on Generative Adversarial Networks. The present invention uses real data of a small amount of applications to generate The confrontation mechanism extracts the encrypted traffic characteristics of each application, triggers the generator of the generative adversarial network (GAN) by generating Gaussian noise, and generates new data with the encrypted traffic characteristics of the application. Because the new data has the characteristics of application encryption, it can participate in the subsequent identification model training, which greatly reduces the difficulty of obtaining deep learning samples and the high cost of labeling.

技术方案：为实现上述目的，本发明采用的技术方案为：Technical scheme: in order to achieve the above object, the technical scheme adopted in the present invention is:

一种基于生成对抗网络的应用加密流量生成方法，包括以下步骤：A method for generating application encrypted traffic based on generative confrontation network, comprising the following steps:

步骤1，获取真实应用的加密流量数据，提取真实应用的加密流量数据，转化成10进制，并截取至784位，不足用0补齐，每一字节之间用“，”分割，每一行只放一条流量数据，将同一个应用的流量以上述形式放在同一个csv文件中。Step 1. Obtain the encrypted traffic data of the real application, extract the encrypted traffic data of the real application, convert it into decimal, and intercept it to 784 bits. If it is insufficient, use 0 to fill it up. Use "," to divide each byte. Put only one piece of traffic data in one line, and put the traffic of the same application in the same csv file in the above format.

步骤2，构建生成对抗网络GAN，生成对抗网络GAN包括连接在一起的生成器和判别器，将csv文件送入生成对抗网络GAN中的判别器进行特征提取，并使用高斯噪声触发生成器，将生成数据一同送入判别器，同时训练生成器与判别器，直至生成器与判别器的损失趋于稳定Step 2, build a GAN, which includes a generator and a discriminator connected together, send the csv file to the discriminator in the GAN for feature extraction, and use Gaussian noise to trigger the generator. The generated data is sent to the discriminator together, and the generator and the discriminator are trained at the same time until the loss of the generator and the discriminator tends to be stable

步骤3，使用指定数量的高斯噪声触发生成器，生成器产生对应数量的包含了应用特征的加密流量特征的生成流量，生成流量的格式和真实应用的加密流量数据一样，长度784字节，每一行是一条流量数据，最后为生成的数据打上应用标签，标签人为指定，应用1的生成数据每一行对应一个标记1，应用2的生成数据的每一行对应一个标记2，以此类推，生成数据和标签数据存放于两个csv文件中，即可达到产生有标记数据集的目的。Step 3: Use a specified number of Gaussian noises to trigger the generator, and the generator generates a corresponding amount of generated traffic that contains the encrypted traffic characteristics of the application. The format of the generated traffic is the same as the encrypted traffic data of the real application, with a length of 784 bytes. One row is a piece of traffic data, and finally the generated data is tagged with an application label. The label is specified manually. Each row of data generated by application 1 corresponds to a tag 1, and each row of data generated by application 2 corresponds to a tag 2. By analogy, the generated data and label data are stored in two csv files to achieve the purpose of generating a labeled data set.

优选的：重复步骤1-3多次，即可获得不同真实应用的加密流量数据的生成数据及其对应的标签数据。Preferably: by repeating steps 1-3 multiple times, the generated data of encrypted traffic data of different real applications and the corresponding tag data can be obtained.

优选的：所述真实应用的加密流量数据包括数据包头部。Preferably: the encrypted traffic data of the real application includes a data packet header.

一种采用基于生成对抗网络的应用加密流量生成方法的生成系统，包括生成对抗网络GAN、输入模块、输出模块和高斯噪声产生器，所述生成对抗网络GAN包括连接在一起的生成器和判别器，所述高斯噪声产生器与生成器连接，其中：A generation system that adopts an application encryption traffic generation method based on a generation confrontation network, including a generation confrontation network GAN, an input module, an output module and a Gaussian noise generator, and the generation confrontation network GAN includes a generator and a discriminator connected together , the Gaussian noise generator is connected to the generator, where:

所述输入模块，用于获取真实应用的加密流量数据，提取真实应用的加密流量数据，转化成10进制，并截取至784位，不足用0补齐，每一字节之间用“，”分割，每一行只放一条流量数据，将同一个应用的流量以上述形式放在同一个csv文件中。The input module is used to obtain the encrypted traffic data of the real application, extract the encrypted traffic data of the real application, convert it into a decimal system, and intercept it to 784 bits, fill it with 0 if it is insufficient, and use ", ", put only one piece of traffic data in each line, and put the traffic of the same application in the same csv file in the above format.

所述高斯噪声产生器，通过产生一组随机的高斯噪声数据，用于触发生成器。The Gaussian noise generator is used to trigger the generator by generating a set of random Gaussian noise data.

所述生成器，通过进入神经网络的高斯噪声数据，根据判别器发过来的判别结果修正网络各个神经元的权重，拟合出包含特征的生成流量。The generator corrects the weights of each neuron in the network according to the discrimination results sent by the discriminator through the Gaussian noise data entering the neural network, and fits the generated traffic containing features.

判别器：通过输入模块输入的真实应用的加密流量数据训练出的神经网络，对生成器产生的生成流量是否包含应用特征做出判断，并将判断结果反馈给生成器，直至生成器和判别器的损失趋于稳定，判别器已无法判别输入数据是真实数据还是生成器产生的数据，完成生成对抗网络GAN的训练。Discriminator: The neural network trained by the encrypted traffic data of the real application input by the input module makes a judgment on whether the generated traffic generated by the generator contains application characteristics, and feeds back the judgment result to the generator until the generator and the discriminator The loss tends to be stable, and the discriminator can no longer distinguish whether the input data is real data or the data generated by the generator, and the training of the generative confrontation network GAN is completed.

所述输出模块，用于输出训练好的生成器产生的生成流量。The output module is used to output generated traffic generated by the trained generator.

本发明相比现有技术，具有以下有益效果：Compared with the prior art, the present invention has the following beneficial effects:

本发明基于生成对抗网络(GAN)的应用加密流量生成方法，其利用少量真实应用的加密流量，即可生成任意条数的包含该应用流量特征的加密流量，该方法不需要采集较大数量的样本数据，仅需几千条数据就可以生成较好的结果。该方法通过生成对抗网络(GAN)巧妙的将加密流量中的特征抽象出来，不需要对流量本身进行解密，减少了解密的工作的同时有效的保护了用户隐私并大大降低了获取样本的成本。本方法适用于所有基于深度学习的加密流量识别场景中加密流量样本少、获取困难、导致识别率低的场景。The present invention is based on a Generative Adversarial Network (GAN) application encrypted traffic generation method, which uses a small amount of encrypted traffic of real applications to generate any number of encrypted traffic containing the characteristics of the application traffic. The method does not need to collect a large number of Sample data, only a few thousand pieces of data are needed to generate better results. This method cleverly abstracts the features in the encrypted traffic through the Generative Adversarial Network (GAN), and does not need to decrypt the traffic itself, which reduces the work of decryption while effectively protecting user privacy and greatly reducing the cost of obtaining samples. This method is applicable to all encrypted traffic recognition scenarios based on deep learning, where there are few encrypted traffic samples, difficult acquisition, and low recognition rate.

本发明借助少量真实应用的加密流量数据，即可简单快捷地产生具有较好应用特征的加密流量，从而大幅降低流量采集与标记的成本。The present invention can simply and quickly generate encrypted traffic with better application characteristics by means of a small amount of encrypted traffic data of real applications, thereby greatly reducing the cost of traffic collection and marking.

附图说明Description of drawings

图1是真实数据与生成数据的内容及格式。Figure 1 shows the content and format of real data and generated data.

图2是生成过程及生成器和判别器的神经网络构成。Figure 2 is the generation process and the neural network structure of the generator and the discriminator.

图3是真实数据训练出的识别模型上的真实数据的识别情况(4种应用为例)Figure 3 is the recognition of real data on the recognition model trained on real data (4 applications as examples)

图4是生成数据和真实数据共同训练的模型上真实数据的识别情况(4种应用为例)Figure 4 shows the recognition of real data on a model that is jointly trained with generated data and real data (four applications as examples)

具体实施方式detailed description

下面结合附图和具体实施例，进一步阐明本发明，应理解这些实例仅用于说明本发明而不用于限制本发明的范围，在阅读了本发明之后，本领域技术人员对本发明的各种等价形式的修改均落于本申请所附权利要求所限定的范围。Below in conjunction with accompanying drawing and specific embodiment, further illustrate the present invention, should be understood that these examples are only for illustrating the present invention and are not intended to limit the scope of the present invention, after having read the present invention, those skilled in the art will understand various aspects of the present invention All modifications of the valence form fall within the scope defined by the appended claims of the present application.

一种基于生成对抗网络的应用加密流量生成方法，如图1所示，通过对真实应用的加密流量包，提取出流量数据(包括协议头)，转化成10进制，并截取至784位(不足用0补齐)，每一字节之间用“，”分割，每一行只放一条流量数据，将同一个应用的流量以上述形式放在同一个csv文件中，送入生成对抗网络(GAN)进行特征提取，之后用指定条数的高斯噪声触发生成对抗网络的生成器，产生对应条数的带有其应用加密特征的新数据并生成对应的标签数据(一行生成数据对应一个标签)，标签人为指定，应用1的生成数据每一行对应一个标记1，应用2的生成数据的每一行对应一个标记2，以此类推，生成数据和标签数据存放于两个csv文件中，包括以下步骤：A method for generating application encrypted traffic based on generative confrontation networks, as shown in Figure 1, extracts traffic data (including protocol headers) from encrypted traffic packets of real applications, converts them into decimals, and intercepts them to 784 bits ( Use "," to separate each byte, put only one flow data in each line, put the flow of the same application in the same csv file in the above form, and send it to the generative confrontation network ( GAN) for feature extraction, and then use the specified number of Gaussian noise to trigger the generator of the generation confrontation network, generate the corresponding number of new data with its application encryption features and generate the corresponding label data (one line of generated data corresponds to one label) , the label is manually specified, each line of the generated data of application 1 corresponds to a label 1, each line of the generated data of application 2 corresponds to a label 2, and so on, the generated data and label data are stored in two csv files, including the following steps :

步骤1，获取真实应用的加密流量数据(包括协议头)，提取真实应用的加密流量数据，转化成10进制，并截取至784位(不足用0补齐)，每一字节之间用“，”分割，每一行只放一条流量数据，将同一个应用的流量以上述形式放在同一个csv文件中。将真实应用的加密流量包(包括数据包头部)，提取成十进制数据并截取至定长(不足位用0补齐)，用逗号分隔，每一行是一条流量数据，将其送入生成对抗网络(GAN)用以训练判别器，使判别器能较好识别真实应用的加密流量特征。Step 1. Obtain the encrypted traffic data (including the protocol header) of the real application, extract the encrypted traffic data of the real application, convert it into decimal, and intercept it to 784 bits (if it is insufficient, fill it with 0), and use 0 between each byte "," split, each line only puts one piece of traffic data, and puts the traffic of the same application in the same csv file in the above format. Extract the encrypted traffic packet (including the data packet header) of the real application into decimal data and intercept it to a fixed length (fill the gap with 0), separated by commas, each line is a traffic data, and send it to the generated confrontation network (GAN) is used to train the discriminator so that the discriminator can better identify the encrypted traffic characteristics of real applications.

步骤2，构建生成对抗网络GAN，生成对抗网络GAN包括连接在一起的生成器和判别器，将csv文件送入生成对抗网络GAN中的判别器进行特征提取，并使用高斯噪声触发生成器，将生成数据一同送入判别器，同时训练生成器与判别器，直至生成器与判别器的损失趋于稳定，即损失小于预定损失阈值。Step 2, build a GAN, which includes a generator and a discriminator connected together, send the csv file to the discriminator in the GAN for feature extraction, and use Gaussian noise to trigger the generator. The generated data is sent to the discriminator together, and the generator and the discriminator are trained at the same time until the loss of the generator and the discriminator tends to be stable, that is, the loss is less than the predetermined loss threshold.

步骤3，使用指定数量的高斯噪声触发生成器，生成器产生对应数量的包含了应用特征的加密流量特征的生成流量，生成流量的格式和真实应用的加密流量数据一样，长度784字节，每一行是一条流量数据，最后为生成的数据打上应用标签，标签人为指定，应用1的生成数据每一行对应一个标记1，应用2的生成数据的每一行对应一个标记2，以此类推，生成数据和标签数据存放于两个csv文件中。生成流量传递给判别器，判别器根据预先设置阈值对生成流量和真实应用的加密流量特征进行判别，并将判别结果反馈给生成器，生成器根据判别结果进行更新，如此迭代下去，直至判别器对生成流量和真实应用的加密流量特征进行判别达到预先设置阈值，完成生成对抗网络GAN的训练。Step 3: Use a specified number of Gaussian noises to trigger the generator, and the generator generates a corresponding amount of generated traffic that contains the encrypted traffic characteristics of the application. The format of the generated traffic is the same as the encrypted traffic data of the real application, with a length of 784 bytes. One row is a piece of traffic data, and finally the generated data is tagged with an application label. The label is specified manually. Each row of data generated by application 1 corresponds to a tag 1, and each row of data generated by application 2 corresponds to a tag 2. By analogy, the generated data and label data are stored in two csv files. The generated traffic is passed to the discriminator, and the discriminator distinguishes the generated traffic and the encrypted traffic characteristics of the real application according to the preset threshold, and feeds back the discrimination result to the generator, and the generator updates it according to the discrimination result, and iterates until the discriminator The generated traffic and the encrypted traffic characteristics of the real application are discriminated to reach the preset threshold, and the training of the generated confrontation network GAN is completed.

步骤4，使用指定数量的高斯噪声触训练好的发生成器，训练好的生成器产生对应数量的包含了应用特征的加密流量特征的生成流量，如图1所示，即可达到产生有标记数据集的目的。Step 4, use the specified number of Gaussian noises to trigger the trained generator, and the trained generator will generate a corresponding number of generated traffic that contains the encrypted traffic characteristics of the application characteristics, as shown in Figure 1, to achieve the generation of marked Purpose of the dataset.

重复步骤1-3多次，即可获得不同真实应用的加密流量数据的生成数据及其对应的标签数据。Repeat steps 1-3 multiple times to obtain the generated data and corresponding label data of encrypted traffic data of different real applications.

一种采用基于生成对抗网络的应用加密流量生成方法的生成系统，如图2所示，，包括生成对抗网络GAN、输入模块、输出模块和高斯噪声产生器，所述生成对抗网络GAN包括连接在一起的生成器和判别器，所述高斯噪声产生器与生成器连接，其中：A generation system that adopts an application encryption traffic generation method based on a generation confrontation network, as shown in Figure 2, includes a generation confrontation network GAN, an input module, an output module and a Gaussian noise generator, and the generation confrontation network GAN includes a A generator and a discriminator together, the Gaussian noise generator connected to the generator, where:

本发明利用公开数据集或采集的应用的真实加密流量，通过训练迭代生成对抗网络(GAN)的判别器，生成器，最终使生成器能够较好的掌握应用的流量特征，通过产生指定数量的高斯噪声，生成器可以生成对应数量的包含应用流量特征的加密流量，达到扩充数据集和标签集的目的。利用公开数据集或已采集好的真实数据，随机产生指定数量的高斯噪声，将其送入迭代训练好的生成对抗网络(GAN)模型，利用生成器即可产生对应数量的包含该应用特征的加密流量数据。The present invention utilizes the real encrypted traffic of the public data set or the collected application, and iteratively generates the discriminator and generator of the confrontational network (GAN) through training, and finally enables the generator to better grasp the traffic characteristics of the application. For Gaussian noise, the generator can generate a corresponding amount of encrypted traffic containing application traffic characteristics to expand the data set and label set. Use public datasets or collected real data to randomly generate a specified number of Gaussian noises, send them into the iteratively trained Generative Adversarial Network (GAN) model, and use the generator to generate a corresponding number of noises containing the characteristics of the application Encrypt traffic data.

利用指定条数的高斯噪声，使生成器能产生对应数量的包含应用加密流量特征的生成数据，且和真实数据一同训练模型时能达到较好的识别率，从而达到扩充数据集，降低数据采集、标注成本，提升识别模型准确率的目的。Using the specified number of Gaussian noises, the generator can generate a corresponding amount of generated data containing the characteristics of the encrypted traffic of the application, and can achieve a better recognition rate when training the model with real data, so as to expand the data set and reduce data collection. , Mark the cost, and improve the accuracy of the recognition model.

为了证明该方法生成的数据是具有真实应用特征的，我们以“ISCX VPN-non VPNtraffic dataset”中的4中常见应用(aimchat,hangouts,icq,netflix)做了生成试验，通过将真实数据和生成数据混在一起进行训练，可以有效的提高识别率，图3是真实数据训练出的识别模型上的真实数据的识别情况(4种应用为例)，图4是生成数据和真实数据共同训练的模型上真实数据的识别情况(4种应用为例)。如图3图4所示，该方法生成的数据可以有效提高应用的识别率，由此证明通过本方法生成的数据能达到扩充数据集，降低获取数据和标记难度的功能。In order to prove that the data generated by this method has the characteristics of real applications, we conducted a generation experiment with 4 common applications (aimchat, hangouts, icq, netflix) in the "ISCX VPN-non VPNtraffic dataset", by combining the real data with the generated Mixing data together for training can effectively improve the recognition rate. Figure 3 shows the recognition of real data on the recognition model trained on real data (for example, 4 applications). Figure 4 shows the model jointly trained with generated data and real data. The recognition situation of the real data (4 kinds of applications as an example). As shown in Figure 3 and Figure 4, the data generated by this method can effectively improve the recognition rate of the application, which proves that the data generated by this method can expand the data set and reduce the difficulty of obtaining data and labeling.

本发明利用少量真实应用的加密流量，即可生成任意条数的包含该应用流量特征的加密流量。该方法不需要采集较大数量的样本数据，仅需几千条数据就可以生成较好的结果。该方法通过GAN巧妙的将加密流量中的特征抽象出来，不需要对流量本身进行解密，减少了解密的工作的同事有效的保护了用户隐私并大大降低了获取样本的成本。本方法适用于所有基于深度学习的加密流量识别场景中加密流量样本少获取困难导致识别率低的场景。The present invention utilizes a small amount of encrypted flow of real applications to generate any number of encrypted flows containing the characteristics of the application flow. This method does not need to collect a large number of sample data, and only a few thousand pieces of data can generate better results. This method cleverly abstracts the features of encrypted traffic through GAN, without decrypting the traffic itself, reducing the work of decryption, effectively protecting user privacy and greatly reducing the cost of obtaining samples. This method is applicable to all encrypted traffic recognition scenarios based on deep learning, where few encrypted traffic samples are difficult to obtain and the recognition rate is low.

以上所述仅是本发明的优选实施方式，应当指出：对于本技术领域的普通技术人员来说，在不脱离本发明原理的前提下，还可以做出若干改进和润饰，这些改进和润饰也应视为本发明的保护范围。The above is only a preferred embodiment of the present invention, it should be pointed out that for those of ordinary skill in the art, without departing from the principle of the present invention, some improvements and modifications can also be made, and these improvements and modifications are also possible. It should be regarded as the protection scope of the present invention.

Claims

1. A method for generating application encrypted traffic based on generating an adversarial network, characterized in that, comprising the following steps:

Step 1. Get the encrypted traffic data of the real application and extract the encrypted traffic data of the real application. There is no need to decrypt the encrypted traffic data itself, convert it into decimal, and intercept it to 784 bits. Sections are separated by ",", and only one flow data is placed in each line, and the flow of the same application is placed in the same csv file in the above format;

Step 2, build a GAN, which includes a generator and a discriminator connected together, send the csv file to the discriminator in the GAN for feature extraction, and use Gaussian noise to trigger the generator. The generated data is sent to the discriminator together, and the generator and the discriminator are trained at the same time until the loss of the generator and the discriminator tends to be stable; the Gaussian noise generator generates a set of random Gaussian noise data to trigger the generator;

Step 3, use the specified amount of Gaussian noise to trigger the generator, and the generator generates a corresponding amount of generated traffic that contains the characteristics of the encrypted traffic of the application. The format of the generated traffic is the same as the encrypted traffic data of the real application, with a length of 784 bytes. Each line is A piece of traffic data, and finally mark the generated data with an application label. The label is specified manually. Each line of the generated data of application 1 corresponds to a label 1, and each line of the generated data of application 2 corresponds to a label 2. By analogy, the generated data and labels The data is stored in two csv files; by generating a specified amount of Gaussian noise, the generator generates a corresponding amount of encrypted traffic containing application traffic characteristics to achieve the purpose of expanding the data set and label set;

Repeat steps 1-3 multiple times to obtain the generated data and corresponding label data of encrypted traffic data of different real applications.

2. The method for generating application encrypted traffic based on Generative Adversarial Networks according to claim 1, characterized in that: the encrypted traffic data of the real application includes a data packet header.

3. A generation system that adopts the generation method of application encryption traffic based on generating confrontation network according to claim 1, characterized in that: comprising generating confrontation network GAN, input module, output module and Gaussian noise generator, said generating confrontation network GAN includes a generator and a discriminator connected together, the Gaussian noise generator is connected to the generator, where:

The input module is used to obtain the encrypted traffic data of the real application, extract the encrypted traffic data of the real application, convert it into a decimal system, and intercept it to 784 bits, fill it with 0 if it is insufficient, and use ", "Split, put only one piece of traffic data in each line, and put the traffic of the same application in the same csv file in the above format;

The Gaussian noise generator is used to trigger the generator by generating a set of random Gaussian noise data;

The generator corrects the weights of each neuron in the network according to the discriminant results sent by the discriminator through the Gaussian noise data entering the neural network, and fits the generated traffic containing features;

Discriminator: The neural network trained by the encrypted traffic data of the real application input by the input module makes a judgment on whether the generated traffic generated by the generator contains application characteristics, and feeds back the judgment result to the generator until the generator and the discriminator The loss tends to be stable, and the discriminator can no longer distinguish whether the input data is real data or the data generated by the generator, and the training of the generated confrontation network GAN is completed;

The output module is used to output generated traffic generated by the trained generator.