[go: up one dir, main page]

CN103746982A - Automatic generation method and system for HTTP (Hyper Text Transport Protocol) network feature code - Google Patents

Automatic generation method and system for HTTP (Hyper Text Transport Protocol) network feature code Download PDF

Info

Publication number
CN103746982A
CN103746982A CN201310745102.1A CN201310745102A CN103746982A CN 103746982 A CN103746982 A CN 103746982A CN 201310745102 A CN201310745102 A CN 201310745102A CN 103746982 A CN103746982 A CN 103746982A
Authority
CN
China
Prior art keywords
clustering
grained
fine
http
coarse
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201310745102.1A
Other languages
Chinese (zh)
Other versions
CN103746982B (en
Inventor
李可
刘潮歌
崔翔
李丹
梁玉
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Institute of Computing Technology of CAS
Original Assignee
Institute of Computing Technology of CAS
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Institute of Computing Technology of CAS filed Critical Institute of Computing Technology of CAS
Priority to CN201310745102.1A priority Critical patent/CN103746982B/en
Publication of CN103746982A publication Critical patent/CN103746982A/en
Application granted granted Critical
Publication of CN103746982B publication Critical patent/CN103746982B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Landscapes

  • Data Exchanges In Wide-Area Networks (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

本发明公开了一种HTTP网络特征码自动生成方法,该方法包括:包特征码生成步骤、URI特征码生成步骤和HTTP网络特征码总集合生成步骤,包特征码生成步骤为针对多个网络样本的一问一答包提取出的特征统计和包内容,通过二次聚类生成粗粒度聚类集,进而在粗粒度聚类集的基础上二次聚类生成细粒度聚类集,通过细粒度聚类集生成网络样本的一问一答包特征码集合

Figure DDA0000449500680000011
URI特征码生成步骤为针对网络样本中被划分为单独一类的流量,进行URI路径及参数特征码的补充提取,生成URI的特征码集合
Figure DDA0000449500680000012
最终通过一问一答包特征码集合
Figure DDA0000449500680000013
和URI的特征码集合
Figure DDA0000449500680000014
合并生成特征码总集合Tall

Figure 201310745102

The invention discloses a method for automatically generating HTTP network feature codes. The method includes: a step of generating packet feature codes, a step of generating URI feature codes, and a step of generating a total set of HTTP network feature codes. The step of generating packet feature codes is for multiple network samples The feature statistics and package content extracted from the question-and-answer package are used to generate coarse-grained cluster sets through secondary clustering, and then to generate fine-grained cluster sets through secondary clustering on the basis of coarse-grained cluster sets. Granular clustering set to generate a set of Q&A packet signatures for network samples

Figure DDA0000449500680000011
The URI feature code generation step is to perform supplementary extraction of URI path and parameter feature codes for the traffic that is divided into a single category in the network sample, and generate a URI feature code set
Figure DDA0000449500680000012
Finally, through a question and answer package feature code collection
Figure DDA0000449500680000013
and URI signature set
Figure DDA0000449500680000014
Combined to generate a total set of feature codes T all .

Figure 201310745102

Description

一种HTTP网络特征码自动生成方法及其系统A method and system for automatically generating HTTP network feature codes

技术领域technical field

本发明涉及网络安全领域技术,特别涉及一种未知HTTP僵尸网络的特征码生成方法,更具体地,是一种HTTP网络特征码自动生成方法及其系统。The invention relates to the technology in the field of network security, in particular to a method for generating signatures of unknown HTTP botnets, more specifically, a method for automatically generating HTTP network signatures and a system thereof.

背景技术Background technique

近年来网络安全相关的事件频繁发生,网络安全已上升成为了国家战略层面的热点议题。然而,由于网民普遍缺乏安全意识、计算机操作系统和应用软件包含各种漏洞等因素,越来越多的计算机已悄然成为了僵尸网络中的“肉鸡”,成为了他人从事窃取隐私、攻击网络资源、非法牟取暴利等违法犯罪活动的棋子。In recent years, incidents related to network security have occurred frequently, and network security has risen to become a hot topic at the national strategic level. However, due to the general lack of security awareness among netizens and various loopholes in computer operating systems and application software, more and more computers have quietly become "broilers" in botnets, and they have become targets for others to steal privacy and attack network resources. , Illegal profiteering and other illegal and criminal activities.

僵尸网络(Botnet)是一种“通过入侵网络空间内若干非合作用户终端构建的、可被攻击者远程控制的通用计算平台”。其中,“非合作”是指被入侵的用户终端没有感知;“攻击者”指的是对所形成的僵尸网络具有操控权力的控制者(Botmaster);“远程控制”指攻击者可以通过命令与控制(commandand control,简写为C&C)信道一对多地控制非合作用户终端。一个被控制的受害用户终端成为僵尸网络的一个节点,可称之为“僵尸主机”,俗称“肉鸡”。常见的僵尸网络的命令与控制协议主要有IRC、HTTP、P2P三种类型。由于HTTP协议具有良好的穿透性及集中控制性,越来越多的僵尸网络控制者采用HTTP协议作为其通信与控制协议。控制者通过僵尸网络控制大量的僵尸主机,可以获得强大的分布式计算能力和丰富的信息资源储备。攻击者更易于发起分布式拒绝服务攻击(DDoS)、在线身份窃取(Online Identity Theft)、垃圾邮件(Spam)、点击欺诈(Click Fraud)、比特币挖掘(BitCoin Mining)等恶意行为。僵尸网络作为攻击者手中最有效的通用攻击平台,已成为当今互联网最大的安全威胁之一。Botnet (Botnet) is a kind of "universal computing platform constructed by invading several non-cooperative user terminals in cyberspace, which can be remotely controlled by attackers". Among them, "non-cooperation" means that the user terminal is not aware of the intrusion; "attacker" refers to the controller (Botmaster) who has the power to control the formed botnet; "remote control" means that the attacker can communicate with The control (command and control, abbreviated as C&C) channel controls non-cooperative user terminals one-to-many. A controlled victim user terminal becomes a node of a botnet, which can be called a "zombie host", commonly known as a "broiler". Common botnet command and control protocols mainly include three types: IRC, HTTP, and P2P. Because the HTTP protocol has good penetration and centralized control, more and more botnet controllers use the HTTP protocol as their communication and control protocol. The controller controls a large number of zombie hosts through the botnet, and can obtain powerful distributed computing capabilities and abundant information resource reserves. It is easier for attackers to launch distributed denial of service attacks (DDoS), online identity theft (Online Identity Theft), spam (Spam), click fraud (Click Fraud), Bitcoin mining (BitCoin Mining) and other malicious behaviors. As the most effective general-purpose attack platform in the hands of attackers, botnets have become one of the biggest security threats to the Internet today.

僵尸网络之所以会有如此大的威胁,主要有以下几点原因:The reason why botnets are such a big threat is mainly due to the following reasons:

僵尸网络是从传统蠕虫和木马衍生的一种新的攻击形式。蠕虫具有利用安全漏洞快速传播扩散的优势但却具有不可控性;木马具有对受害者远程控制的能力,但存在感染速度慢、管理规模小和控制方式简单的缺点。僵尸网络是结合了两者优势、弥补了两者不足而形成的产物,危害性更强。Botnet is a new form of attack derived from traditional worms and Trojan horses. Worms have the advantage of taking advantage of security holes to spread quickly but are uncontrollable; Trojan horses have the ability to remotely control victims, but they have the disadvantages of slow infection speed, small management scale and simple control methods. The botnet is a product formed by combining the advantages of the two and making up for the shortcomings of the two, and is more harmful.

僵尸网络具有高度可控性以及控制逻辑与攻击相分离的特性。僵尸网络中的“肉鸡”通过命令与控制(command and control)信道能被控制者所操纵,能在短时间内对某个特定目标发起大规模攻击(DDoS攻击等),具有高度的可控性。此外,僵尸主机上的僵尸程序负责控制逻辑,真正的攻击任务由控制者按需动态分发。这种方法能将完整的威胁实体分割为多个部分,从而既可以为任务分发提供良好的灵活性,又可以提高僵尸网络的生存性。Botnet has the characteristics of high controllability and separation of control logic and attack. The "broilers" in the botnet can be manipulated by the controller through the command and control (command and control) channel, and can launch large-scale attacks (DDoS attacks, etc.) on a specific target in a short period of time, with a high degree of controllability . In addition, the bot program on the bot host is responsible for the control logic, and the real attack tasks are dynamically distributed by the controller on demand. This method can divide the complete threat entity into multiple parts, which can not only provide good flexibility for task distribution, but also improve the survivability of botnets.

安全措施往往滞后于所对应的新型僵尸网络的出现。基于特征码的检测方法是一种行之有效的方法。然而,传统特征码的生成技术大多只针对蠕虫,且这些技术无法高效、自动地生成高质量的特征码,因此无法在僵尸网络规模扩大初期对其进行有效地控制。Security measures often lag behind the emergence of corresponding new botnets. The detection method based on signature is an effective method. However, most of the traditional signature generation techniques are only for worms, and these technologies cannot efficiently and automatically generate high-quality signatures, so they cannot effectively control the botnet at the initial stage of expansion.

目前针对僵尸网络的检测方法及系统有很多,但这些系统检测大多存在时间开销大、应用部署困难等问题,无法真正意义上的大面积推广;传统的入侵检测系统(IDS)虽然适用范围广,可以用于有效发现特定网络中存在的异常网络行为,然而,由于缺少对应僵尸网络的特征码及相应规则,无法及时发现特定网络中潜在的新型僵尸网络主机。目前特征码的提取技术主要存在以下几种问题:At present, there are many detection methods and systems for botnets, but most of these system detections have problems such as high time consumption and difficulty in application deployment, and cannot be widely promoted in a real sense; although traditional intrusion detection systems (IDS) have a wide range of applications, It can be used to effectively discover abnormal network behaviors in a specific network. However, due to the lack of corresponding botnet signatures and corresponding rules, it is impossible to discover potential new botnet hosts in a specific network in time. There are mainly the following problems in the current feature code extraction technology:

传统特征码生成算法大多只针对蠕虫,缺乏针对HTTP僵尸网络的特征码生成方法。现有的特征码生成方法绝大多数针对的是蠕虫特征码的提取,由于僵尸网络命令与控制通信的特征的不同,这些传统的特征码生成方法并不能很好地适用于HTTP僵尸网络特征码的提取。Most of the traditional signature generation algorithms are only for worms, and there is no signature generation method for HTTP botnets. Most of the existing signature generation methods are aimed at the extraction of worm signatures. Due to the different characteristics of botnet command and control communication, these traditional signature generation methods are not suitable for HTTP botnet signatures. extraction.

现有的特征码生成方法效率低、时间开销大。传统的特征码生成大多依赖人工判断,无法做到大规模自动化。虽然有少数人提出了针对僵尸网络特征码的自动提取方法拟尝试解决该问题,然而这些方法的计算开销十分庞大,无法大规模推广应用。The existing signature generation method has low efficiency and high time consumption. Traditional feature code generation mostly relies on manual judgment and cannot be automated on a large scale. Although a few people have proposed automatic extraction methods for botnet signatures in an attempt to solve this problem, these methods are computationally expensive and cannot be widely applied.

现有方法生成的特征码质量不高、可用性差。传统的特征码生成方法没有针对HTTP僵尸网络的命令与控制通信特征进行考虑,采用的特征码生成方法没有针对性,生成的特征码集合数量大、质量较低。The signatures generated by existing methods are of low quality and poor usability. The traditional signature generation method does not consider the command and control communication characteristics of HTTP botnets, the signature generation method adopted is not targeted, and the generated signature sets are large in number and low in quality.

发明内容Contents of the invention

本发明所要解决的技术问题在于克服现有系统特征码生成时间长和部署困难的问题,提出了一种HTTP网络特征码自动生成方法及其系统。The technical problem to be solved by the present invention is to overcome the problems of long generation time and difficult deployment of existing system feature codes, and propose a method and system for automatically generating HTTP network feature codes.

为达上述目的,本发明提供了一种HTTP网络特征码自动生成方法,其特征在于,所述方法包括:For reaching above-mentioned purpose, the present invention provides a kind of HTTP network feature code automatic generation method, it is characterized in that, described method comprises:

包特征码生成步骤:针对多个网络样本的一问一答包提取出的特征统计和包内容,通过二次聚类生成粗粒度聚类集,进而在所述粗粒度聚类集的基础上二次聚类生成细粒度聚类集,通过所述细粒度聚类集生成所述网络样本的一问一答包特征码集合

Figure BDA0000449500660000031
Packet feature code generation step: for the feature statistics and packet content extracted from the one-question-one-answer packets of multiple network samples, generate a coarse-grained cluster set through secondary clustering, and then based on the coarse-grained cluster set Secondary clustering generates a fine-grained clustering set, and generates a Q&A packet feature code set of the network sample through the fine-grained clustering set
Figure BDA0000449500660000031

URI特征码生成步骤:针对所述网络样本中被划分为单独一类的流量,进行URI路径及参数特征码的补充提取,生成所述URI的特征码集合

Figure BDA0000449500660000032
URI feature code generation step: for the traffic that is divided into a single category in the network sample, perform supplementary extraction of URI path and parameter feature codes, and generate a feature code set of the URI
Figure BDA0000449500660000032

HTTP网络特征码总集合生成步骤:通过所述一问一答包特征码集合

Figure BDA0000449500660000033
和所述URI的特征码集合
Figure BDA0000449500660000034
合并生成特征码总集合Tall。The generation step of the total collection of HTTP network signatures: through the collection of the one-question-and-answer packet signatures
Figure BDA0000449500660000033
and the set of signatures for the URI
Figure BDA0000449500660000034
Combined to generate a total set of feature codes T all .

上述HTTP网络特征码自动生成方法,其特征在于,所述包特征码生成步骤,包含:The above-mentioned HTTP network feature code automatic generation method is characterized in that, the packet feature code generation step includes:

数据提取步骤:对所述网络样本的数据流特征统计和一问一答包内容进行提取;Data extraction step: extracting the statistics of the data flow characteristics of the network sample and the content of the question and answer package;

二次聚类步骤:根据所述网络样本特征统计和所述一问一答包内容分别进行二次聚类,生成所述粗粒度聚类集的基础上,生成所述细粒度聚类集;Secondary clustering step: perform secondary clustering according to the statistics of the characteristics of the network samples and the content of the Q&A packet, and generate the fine-grained clustering set on the basis of generating the coarse-grained clustering set;

一问一答包特征码生成步骤:根据所述细粒度聚类集,分别生成请求包和应答包的特征码集合。The step of generating the feature code of the question-and-answer packet: according to the fine-grained clustering set, respectively generate the feature code sets of the request packet and the response packet.

上述HTTP网络特征码自动生成方法,其特征在于,所述数据提取步骤之前还包含:The above-mentioned HTTP network feature code automatic generation method is characterized in that, before the data extraction step, it also includes:

白名单过滤步骤:过滤去除所述网络样本中访问合法网站的流量。Whitelist filtering step: filtering and removing traffic accessing legitimate websites in the network samples.

上述HTTP网络特征码自动生成方法,其特征在于,所述数据提取步骤,还包括:The above-mentioned HTTP network feature code automatic generation method is characterized in that, the data extraction step also includes:

数据内容提取步骤:提取HTTP会话连接的所述一问一答包的内容;Data content extraction step: extracting the content of the one-question-one-answer packet of the HTTP session connection;

粗粒度聚类属性提取步骤:以所述网络样本为单位,提取所述粗粒度聚类的四维统计值,包括:HTTP数据流总数、每秒发送字节数、HTTP数据包平均大小和HTTP数据包总数,得到粗粒度聚类属性;Coarse-grained clustering attribute extraction step: taking the network sample as a unit, extract the four-dimensional statistical value of the coarse-grained clustering, including: the total number of HTTP data streams, the number of bytes sent per second, the average size of HTTP data packets, and the HTTP data The total number of packages to get the coarse-grained clustering attributes;

细粒度聚类属性提取步骤:以每个HTTP会话为单位,提取所述细粒度聚类的四维统计值,包括:会话请求包个数、会话响应包个数、首个请求包大小、首个响应包大小,得到细粒度聚类属性;Fine-grained clustering attribute extraction step: taking each HTTP session as a unit, extract the four-dimensional statistical value of the fine-grained clustering, including: the number of session request packets, the number of session response packets, the size of the first request packet, the first Respond to the package size to get fine-grained clustering attributes;

汇总数据集步骤:将所述一问一答包的内容、所述粗粒度聚类属性和所述细粒度聚类属性汇总得到五元组数据集

Figure BDA0000449500660000041
所述五元组的格式为:<样本id,会话id,一问一答包内容,粗粒度聚类属性,细粒度聚类属性>。Summarize the data set step: summarize the content of the question and answer package, the coarse-grained clustering attribute and the fine-grained clustering attribute to obtain a five-tuple data set
Figure BDA0000449500660000041
The format of the five-tuple is: <sample id, session id, Q&A package content, coarse-grained clustering attribute, fine-grained clustering attribute>.

上述HTTP网络特征码自动生成方法,其特征在于,所述二次聚类步骤,还包括:The above-mentioned HTTP network feature code automatic generation method is characterized in that, the secondary clustering step also includes:

粗粒度聚类步骤:对所述五元组数据集

Figure BDA0000449500660000042
自动对所述粗粒度聚类属性进行聚类,得到粗粒度聚类集C,如果所述粗粒度聚类集C只属于一个所述网络样本,则执行所述URI特征码生成步骤;Coarse-grained clustering step: on the five-tuple dataset
Figure BDA0000449500660000042
Automatically clustering the coarse-grained clustering attributes to obtain a coarse-grained clustering set C, if the coarse-grained clustering set C only belongs to one of the network samples, then perform the URI signature generation step;

细粒度聚类步骤:以所述粗粒度聚类集C为基础,对每个ci(ci∈C)中的所有会话,自动安装所述细粒度聚类属性进行聚类,得到细粒度聚类集C′(C′∈Ci);Fine-grained clustering step: based on the coarse-grained clustering set C, for all sessions in each c i ( ci ∈ C), automatically install the fine-grained clustering attributes for clustering, and obtain fine-grained Clustering set C′ (C′∈C i );

样本覆盖度判断步骤:如果存在细粒度聚类ci′(ci′∈C′)中的所有会话来源于k个样本,k的数值大于1,小于等于所述网络样本个数,则认为所述细粒度聚类成功,否则执行所述URI特征码生成步骤。Sample coverage judgment step: if there is a fine-grained clustering c i ′ (ci ∈C′) where all the conversations come from k samples, and the value of k is greater than 1 and less than or equal to the number of network samples, then it is considered The fine-grained clustering is successful; otherwise, the step of generating the URI feature code is performed.

上述HTTP网络特征码自动生成方法,其特征在于,所述一问一答包特征码生成步骤,还包括:The above-mentioned HTTP network feature code automatic generation method is characterized in that, the question and answer packet feature code generation step also includes:

HTTP特征码集合生成步骤:对所述每个细粒度聚类ci′(ci′∈C′)中所有会话连接分别进行请求包和响应包的特征码生成,依次自动计算得到令牌特征码,最终每个细粒度聚类ci′分别获取一个请求包的特征码和一个响应包的特征码,形成HTTP特征码集合W;HTTP feature code set generation step: generate the feature codes of request packets and response packets for all session connections in each fine-grained cluster c i ′ (ci ∈C’), and automatically calculate the token features sequentially Finally, each fine-grained cluster c i ′ obtains a signature of a request packet and a signature of a response packet respectively, forming an HTTP signature set W;

特征码过滤步骤:对所述HTTP特征码集合W进行过滤筛选,去除不合格的所述特征码,合并重复的所述特征码,得到所述一问一答包特征码集合

Figure BDA0000449500660000043
Feature code filtering step: filter and screen the HTTP feature code set W, remove the unqualified feature codes, merge the repeated feature codes, and obtain the Q&A packet feature code set
Figure BDA0000449500660000043

本发明还提供了一种HTTP网络特征码自动生成系统采用所述网络特征自动生成方法,其特征在于,所述系统包括:The present invention also provides a system for automatically generating HTTP network feature codes using the method for automatically generating network features, wherein the system includes:

包特征码生成模块:用于针对多个网络样本的一问一答包提取出的特征统计和包内容,通过二次聚类生成粗粒度聚类集,进而在所述粗粒度聚类集的基础上二次聚类生成细粒度聚类集,通过所述细粒度聚类集生成所述网络样本的一问一答包特征码集合 Packet feature code generation module: used for feature statistics and packet content extracted from the one-question-one-answer packets of multiple network samples, generate coarse-grained cluster sets through secondary clustering, and then generate coarse-grained cluster sets in the coarse-grained cluster sets Based on the secondary clustering, a fine-grained clustering set is generated, and a Q&A package feature code set of the network sample is generated through the fine-grained clustering set

URI特征码生成模块:针对所述网络样本中被划分为单独一类的流量,进行URI路径及参数特征码的补充提取,生成所述URI的特征码集合

Figure BDA0000449500660000052
URI feature code generation module: for the traffic that is divided into a single category in the network sample, perform supplementary extraction of URI path and parameter feature codes, and generate a feature code set of the URI
Figure BDA0000449500660000052

HTTP网络特征码总集合生成模块:通过所述一问一答包特征码集合

Figure BDA0000449500660000053
和所述URI的特征码集合
Figure BDA0000449500660000054
合并生成特征码总集合Tall。HTTP network feature code total set generation module: through the question and answer packet feature code set
Figure BDA0000449500660000053
and the set of signatures for the URI
Figure BDA0000449500660000054
Combined to generate a total set of feature codes T all .

上述HTTP网络特征码自动生成系统,其特征在于,所述包特征码生成模块,包含:The above-mentioned HTTP network signature automatic generation system is characterized in that, the packet signature generation module includes:

白名单过滤模块:过滤去除访问合法网站的流量;Whitelist filtering module: filter and remove traffic that visits legitimate websites;

数据提取模块:对所述网络样本的数据流特征统计和一问一答包内容进行提取;Data extraction module: extract the statistics of the data flow characteristics of the network sample and the content of the question and answer package;

二次聚类模块:根据所述网络样本特征统计和所述一问一答包内容分别进行二次聚类,生成所述粗粒度聚类集的基础上,生成所述细粒度聚类集;Secondary clustering module: performing secondary clustering according to the network sample characteristic statistics and the content of the question-and-answer package, and generating the fine-grained clustering set on the basis of generating the coarse-grained clustering set;

一问一答包特征码生成模块:根据所述细粒度聚类集,分别生成请求包和应答包的特征码集合。A question-and-answer packet signature generation module: according to the fine-grained clustering set, respectively generate signature sets of request packets and response packets.

上述HTTP网络特征码自动生成系统,其特征在于,所述数据提取模块之前还包含:The above-mentioned HTTP network feature code automatic generation system is characterized in that, before the data extraction module, it also includes:

白名单过滤模块:过滤去除所述网络样本中访问合法网站的流量。Whitelist filtering module: filtering and removing the traffic of accessing legitimate websites in the network samples.

上述HTTP网络特征码自动生成系统,其特征在于,所述数据提取模块,还包括:The above-mentioned HTTP network feature code automatic generation system is characterized in that, the data extraction module also includes:

数据内容提取模块:提取HTTP会话连接的所述一问一答包的内容;Data content extraction module: extract the content of the said Q&A package of HTTP session connection;

粗粒度聚类属性提取模块:以所述网络样本为单位,提取所述粗粒度聚类的四维统计值,包括:HTTP数据流总数、每秒发送字节数、HTTP数据包平均大小和HTTP数据包总数,得到粗粒度聚类属性;Coarse-grained clustering attribute extraction module: taking the network sample as a unit, extract the four-dimensional statistical value of the coarse-grained clustering, including: the total number of HTTP data streams, the number of bytes sent per second, the average size of HTTP data packets, and the HTTP data The total number of packages to get the coarse-grained clustering attribute;

细粒度聚类属性提取模块:以每个HTTP会话为单位,提取所述细粒度聚类的四维统计值,包括:会话请求包个数、会话响应包个数、首个请求包大小、首个响应包大小,得到细粒度聚类属性;Fine-grained clustering attribute extraction module: taking each HTTP session as a unit, extract the four-dimensional statistical value of the fine-grained clustering, including: the number of session request packets, the number of session response packets, the size of the first request packet, the first Respond to the package size to get fine-grained clustering attributes;

汇总数据集模块:将所述一问一答包的内容、所述粗粒度聚类属性和所述细粒度聚类属性汇总得到五元组数据集

Figure BDA0000449500660000055
所述五元组的格式为:<样本id,会话id,一问一答包内容,粗粒度聚类属性,细粒度聚类属性>。Summarize data set module: summarize the content of the question and answer package, the coarse-grained clustering attribute and the fine-grained clustering attribute to obtain a five-tuple data set
Figure BDA0000449500660000055
The format of the five-tuple is: <sample id, session id, Q&A package content, coarse-grained clustering attribute, fine-grained clustering attribute>.

上述HTTP网络特征码自动生成系统,其特征在于,所述二次聚类模块,还包括:The above-mentioned HTTP network feature code automatic generation system is characterized in that the secondary clustering module also includes:

粗粒度聚类模块:对所述五元组数据集

Figure BDA0000449500660000061
自动对所述粗粒度聚类属性进行聚类,得到粗粒度聚类集C,如果所述粗粒度聚类集C只属于一个所述网络样本,则通过所述URI特征码生成模块生成所述URI特征码;Coarse-grained clustering module: for the five-tuple data set
Figure BDA0000449500660000061
automatically clustering the coarse-grained clustering attributes to obtain a coarse-grained clustering set C, if the coarse-grained clustering set C only belongs to one of the network samples, the URI signature generation module generates the URI signature;

细粒度聚类模块:以所述粗粒度聚类集C为基础,对每个ci(ci∈C)中的所有会话,自动安装所述细粒度聚类属性进行聚类,得到细粒度聚类集C′(C′∈Ci);Fine-grained clustering module: based on the coarse-grained clustering set C, for all sessions in each c i ( ci ∈ C), automatically install the fine-grained clustering attributes for clustering, and obtain fine-grained Clustering set C′ (C′∈C i );

样本覆盖度判断模块:如果存在细粒度聚类ci′(ci′∈C′)中的所有会话来源于k个样本,k的数值大于1,小于等于所述网络样本个数,则认为所述细粒度聚类成功,否则通过所述URI特征码生成模块生成URI特征码。Sample coverage judgment module: if there is a fine-grained clustering c i ′ (ci ∈C’) where all the conversations come from k samples, and the value of k is greater than 1 and less than or equal to the number of network samples, then it is considered The fine-grained clustering is successful; otherwise, the URI feature code is generated by the URI feature code generation module.

上述HTTP网络特征码自动生成系统,其特征在于,所述一问一答包特征码生成模块,还包括:The above-mentioned HTTP network feature code automatic generation system is characterized in that the question and answer packet feature code generation module also includes:

HTTP特征码集合生成模块:对所述每个细粒度聚类ci′(ci′∈C′)中所有会话连接分别进行请求包和响应包的特征码生成,依次自动计算得到令牌特征码,最终每个细粒度聚类ci′分别获取一个请求包的特征码和一个响应包的特征码,形成HTTP特征码集合W;HTTP feature code set generation module: generate request packet and response packet feature codes for all session connections in each fine-grained cluster c i ′ (ci ∈ C ′), and automatically calculate token features sequentially Finally, each fine-grained cluster c i ′ obtains a signature of a request packet and a signature of a response packet respectively, forming an HTTP signature set W;

特征码过滤模块:对所述HTTP特征码集合W进行过滤筛选,去除不合格的所述特征码,合并重复的所述特征码,得到所述一问一答包特征码集合

Figure BDA0000449500660000062
Feature code filtering module: filter and screen the HTTP feature code set W, remove unqualified feature codes, merge repeated feature codes, and obtain the Q&A package feature code set
Figure BDA0000449500660000062

与现有技术相比,本发明针对HTTP僵尸网络命令与控制通信数据的统计相似性和一问一答包含有大多数僵尸网络特征信息的原理,提出了一种基于一问一答包的HTTP僵尸网络特征码自动生成方法。该方法对主机的HTTP通信数据的一问一答包以及相关统计特性进行提取,通过X-means聚类算法对HTTP数据进行二次聚类,利用最长公共子序列算法以及基于URI的特征方法进行特征码的生成。Compared with the prior art, the present invention aims at the statistical similarity of HTTP botnet command and control communication data and the principle that one question and one answer contains most of the characteristic information of botnets, and proposes an HTTP protocol based on one question and one answer packet. A method for automatically generating botnet signatures. This method extracts the Q&A packet and related statistical characteristics of the HTTP communication data of the host, performs secondary clustering on the HTTP data through the X-means clustering algorithm, and utilizes the longest common subsequence algorithm and the feature method based on URI Generate signatures.

本发明具有以下有益效果:The present invention has the following beneficial effects:

1、可以自动地提取HTTP僵尸网络的通信特征码;1. Can automatically extract the communication characteristic code of HTTP botnet;

2、提高了特征码生成效率,缩短了时间和空间的开销;2. Improve the efficiency of feature code generation and shorten the time and space overhead;

3、提高了特征码生成系统的健壮性和适应性,生成的高质量特征码与诸如snort等入侵检测系统配合,可以实现大范围的相应僵尸网络的检测。3. The robustness and adaptability of the signature generation system are improved, and the high-quality signatures generated can cooperate with intrusion detection systems such as snort to realize the detection of corresponding botnets in a wide range.

附图说明Description of drawings

图1为本发明HTTP网络特征码自动生成方法流程示意图;Fig. 1 is the schematic flow chart of HTTP network characteristic code automatic generation method flow chart of the present invention;

图2为本发明HTTP网络特征码自动生成方法详细流程示意图;Fig. 2 is the detailed schematic flow chart of HTTP network feature code automatic generation method of the present invention;

图3为本发明HTTP网络特征码自动生成系统结构示意图。FIG. 3 is a schematic structural diagram of the system for automatically generating HTTP network feature codes according to the present invention.

其中,附图标记:Among them, reference signs:

1包特征码生成模块            2URI特征码生成模块1 pack signature generation module 2 URI signature generation module

3HTTP网络特征码总集合生成模块3HTTP network signature total set generation module

11白名单过滤模块             12数据提取模块11 Whitelist filtering module 12 Data extraction module

13二次聚类模块               14一问一答包特征码生成模块13 Quadratic clustering module 14 One question and one answer package feature code generation module

121一问一答包提取模块        122粗粒度聚类属性提取模块121 Question and answer package extraction module 122 Coarse-grained clustering attribute extraction module

123细粒度聚类属性提取模块    124汇总数据集模块123 Fine-grained clustering attribute extraction module 124 Summary data set module

131粗粒度聚类模块            132细粒度聚类模块131 Coarse-grained clustering module 132 Fine-grained clustering module

133样本覆盖度判断模块133 sample coverage judgment module

141HTTP特征码集合生成模块    142特征码过滤模块141 HTTP signature set generation module 142 Feature code filtering module

S1~S3、S11~S14、S121~S124、S131~S133、S141~S142:本发明各实施例的施行步骤S1~S3, S11~S14, S121~S124, S131~S133, S141~S142: implementation steps of each embodiment of the present invention

具体实施方式Detailed ways

以下结合附图和具体实施例对本发明进行详细描述,但不作为对本发明的限定。The present invention will be described in detail below in conjunction with the accompanying drawings and specific embodiments, but not as a limitation of the present invention.

本发明的目的是对众多的HTTP僵尸网络样本进行分类,自动地产生对应的特征码用于检测。本发明的优势在于:不需要任何先验知识可以生成僵尸网络的通信特征码,甚至可以对通信内容加密的僵尸网络生成特征码。The purpose of the present invention is to classify numerous HTTP botnet samples, and automatically generate corresponding feature codes for detection. The advantage of the present invention is that the communication feature code of the botnet can be generated without any prior knowledge, and even the feature code of the botnet with encrypted communication content can be generated.

本发明的应用领域:1.为实现大范围僵尸网络的检测提出了一种高效自动生成HTTP僵尸网络特征码的方法;2.在僵尸网络的研究中,按照其网络行为对不同样本的僵尸网络进行分类并自动提取特征码。Field of application of the present invention: 1. Propose a kind of efficient method of automatically generating HTTP botnet characteristic code for realizing the detection of large-scale botnet; 2. In the research of botnet, according to its network behavior to different samples Classify and automatically extract signatures.

本发明提出了一种HTTP网络特征码自动生成方法,基于一问一答包、能准确自动化提取HTTP僵尸网络特征码的方法。这种方法基于大量僵尸网络样本的网络行为分析,采用HTTP会话连接中的一问一答包(首个请求和首个响应HTTP数据包)作为特征码提取对象,借鉴最长公共子序列算法(LongestCommon Subsequence简写为LCS)自动化、高效地生成高质量的HTTP僵尸网络特征码。本发明基于HTTP僵尸网络命令与控制通信数据的相似性原理设计了一套基于一问一答包的特征码自动生成系统。The invention proposes a method for automatically generating HTTP network characteristic codes, which is based on a question-and-answer package and can accurately and automatically extract HTTP botnet characteristic codes. This method is based on the network behavior analysis of a large number of botnet samples, using the Q&A packet (the first request and the first response HTTP data packet) in the HTTP session connection as the object of feature code extraction, drawing on the longest common subsequence algorithm ( Longest Common Subsequence abbreviated as LCS) to automatically and efficiently generate high-quality HTTP botnet signatures. The present invention designs a set of feature code automatic generation system based on one question one answer packet based on the similarity principle of HTTP botnet command and control communication data.

如图1和图2所示,本发明提供的网络特征码自动生成方法,具体步骤包括:As shown in Fig. 1 and Fig. 2, the method for automatically generating a network feature code provided by the present invention, the specific steps include:

包特征码生成步骤S1:针对多个网络样本的一问一答包提取出的特征统计和包内容,通过二次聚类生成粗粒度聚类集,进而在粗粒度聚类集的基础上二次聚类生成细粒度聚类集,通过细粒度聚类集生成网络样本的一问一答包特征码集合

Figure BDA0000449500660000081
Packet signature generation step S1: Aiming at the feature statistics and packet content extracted from the Q&A packets of multiple network samples, a coarse-grained cluster set is generated through secondary clustering, and then based on the coarse-grained cluster set, two Sub-clustering generates fine-grained clustering sets, and generates a set of Q&A packet signatures for network samples through fine-grained clustering sets
Figure BDA0000449500660000081

面向一问一答包的特征码生成,根据大量统计发现,僵尸网络的命令与控制通信的连接持续时间短,绝大多数通信中有价值的特征内容(僵尸主机的信息、请求的二进制文件名、攻击命令等)都集中在HTTP会话连接的一问一答包(首次请求和首次响应HTTP包)中。因此,采用HTTP的一问一答包作为特征码生成对象。该方法能极大地减少数据包存储、比较计算开销,能提高特征码生成的效率。The feature code generation for Q&A packets, according to a large number of statistics, found that the connection duration of the command and control communication of the botnet is short, and most of the valuable feature content in the communication (zombie host information, requested binary file name , attack commands, etc.) are all concentrated in the Q&A packet (the first request and the first response HTTP packet) of the HTTP session connection. Therefore, the question-and-answer packet of HTTP is used as the signature generation object. The method can greatly reduce the overhead of data packet storage and comparison calculation, and can improve the efficiency of feature code generation.

与主流的特征码生成技术(Polygraph、Autograph等)相比,本发明针对HTTP僵尸网络的通信特征提出了对一问一答数据包而非所有HTTP数据包进行计算,同传统方法相比该方法提高了特征码的生成效率,减少了运算时间和存储空间双重开销。Compared with the mainstream feature code generation technology (Polygraph, Autograph, etc.), the present invention proposes to calculate the question-and-answer data packets instead of all HTTP data packets for the communication characteristics of HTTP botnets. Compared with traditional methods, this method The generation efficiency of the feature code is improved, and the double overhead of operation time and storage space is reduced.

本发明采取高效的二次聚类,在本发明中,利用经典的X-means算法,对样本数据流统计特性以及会话的一问一答包内容分别进行粗粒度和细粒度的二次聚类。在粗粒度聚类中,以样本为单位,选取HTTP数据流总数、每秒发送字节数、HTTP数据包平均大小、HTTP数据包总数作为粗粒度聚类的四维聚类属性,该聚类可以把网络行为相似的样本聚合在一起(假定它们属于同一类僵尸网络);在细粒度聚类中,以样本的HTTP会话连接为单位,在粗粒度聚类基础上对每个类中所有的会话连接进行细粒度的聚类,选取会话请求包个数、会话响应包个数、首个请求包大小、首个响应包大小作为细粒度聚类的四维聚类属性,细粒度聚类能把相似的数据包聚合到一起,生成高质量特征码;这种二次聚类的方法在无需了解通信内容的情况下能便捷、有效地把内容相似的数据包聚合在一起,减少了大量数据包之间繁琐的比较计算。The present invention adopts high-efficiency secondary clustering. In the present invention, the classic X-means algorithm is used to perform coarse-grained and fine-grained secondary clustering on the statistical characteristics of the sample data stream and the content of the Q&A package of the conversation. . In coarse-grained clustering, the total number of HTTP data streams, the number of bytes sent per second, the average size of HTTP data packets, and the total number of HTTP data packets are selected as the four-dimensional clustering attributes of coarse-grained clustering in units of samples. Aggregate samples with similar network behaviors (assuming they belong to the same class of botnets); in fine-grained clustering, take the HTTP session connection of the sample as a unit, and classify all sessions in each class on the basis of coarse-grained clustering The connection performs fine-grained clustering, and selects the number of session request packets, the number of session response packets, the size of the first request packet, and the size of the first response packet as the four-dimensional clustering attributes of fine-grained clustering. Fine-grained clustering can combine similar aggregate data packets together to generate high-quality feature codes; this secondary clustering method can conveniently and effectively aggregate data packets with similar content together without knowing the communication content, reducing the number of packets between a large number of packets. tedious comparison calculations.

本发明的粗粒度和细粒度的二次聚类方法可以快速将统计特征相似的数据包划分在同一聚类,提高了特征码生成的速度,这种划分方法不需要先验知识,不依赖特定内容,避免了大量数据包两两之间对比所带来的时间开销。The coarse-grained and fine-grained secondary clustering methods of the present invention can quickly divide data packets with similar statistical characteristics into the same cluster, and improve the speed of feature code generation. This division method does not require prior knowledge and does not depend on specific content, avoiding the time overhead caused by pairwise comparison of a large number of data packets.

URI特征码生成步骤S2:针对网络样本中被划分为单独一类的流量,进行URI路径及参数特征码的补充提取,生成URI的特征码集合 URI signature generation step S2: Aiming at traffic that is divided into a single category in the network sample, perform supplementary extraction of URI path and parameter signatures to generate a URI signature set

URI特征码生成步骤S2在众多的样本流量聚类过程中,经常会遇到某一种或者几种样本的流量被单独划分在一个类中,在这种情况下采用一种补充手段:对该样本的请求包起始行的URI进行分析,提取出其中的路径以及请求参数作为该样本的特征码。这样一定程度上提高了特征码提取系统的健壮性和适应性。URI feature code generation step S2 In the process of clustering many sample traffic, it is often encountered that the traffic of one or several samples is divided into a class separately. In this case, a supplementary method is used: Analyze the URI of the start line of the request packet of the sample, and extract the path and request parameters as the feature code of the sample. This improves the robustness and adaptability of the feature code extraction system to a certain extent.

对于单个样本聚类、细粒度聚类失败、生成一问一答包特征码失败的样本数据将会被送入URI特征码生成步骤S2,进行基于HTTP请求包起始行的URI路径(以第一个?号为结束标志)及参数(URI中提交的参数名称)特征码的提取:以样本为单位,对该样本所有的请求包进行检查,提取出起始行的路径以及参数集。例如,起始行内容为GET/weather/getweather.aspx?t=1377511384901&cityno=HTTP/1.1的数据包,提取出路径为/weather/getweather.aspx,参数为t和cityno。令牌特征码记为/weather/getweather.aspx.*t.*cityno。最终将会得到这些样本的URI特征码集合,记为

Figure BDA0000449500660000092
For single sample clustering, sample data that fails in fine-grained clustering, and fails to generate a question-and-answer package feature code will be sent to the URI feature code generation step S2, and the URI path based on the starting line of the HTTP request package (in the first line) A ? sign is the end mark) and parameter (parameter name submitted in the URI) feature code extraction: take a sample as a unit, check all the request packets of the sample, and extract the path of the starting line and the parameter set. For example, if the starting line content is GET/weather/getweather.aspx?t=1377511384901&cityno=HTTP/1.1 data packet, the extracted path is /weather/getweather.aspx, and the parameters are t and cityno. The token feature code is recorded as /weather/getweather.aspx.*t.*cityno. Finally, the URI feature code set of these samples will be obtained, denoted as
Figure BDA0000449500660000092

本发明引入的URI路径及参数特征提取,有效地解决了传统特征码提取方法中单样本聚类失效的情形,一定程度上提高了系统的健壮性及适应性。The URI path and parameter feature extraction introduced by the invention effectively solves the failure of single-sample clustering in the traditional feature code extraction method, and improves the robustness and adaptability of the system to a certain extent.

HTTP网络特征码总集合生成步骤S3:通过一问一答包特征码集合

Figure BDA0000449500660000093
和所述URI的特征码集合
Figure BDA0000449500660000094
合并生成特征码总集合Tall。Step S3 of generating the total collection of HTTP network signatures: through the set of signatures of a question-and-answer packet
Figure BDA0000449500660000093
and the set of signatures for the URI
Figure BDA0000449500660000094
Combined to generate a total set of feature codes T all .

一问一答包特征码集合与URI特征码集合

Figure BDA0000449500660000096
合并得到了最终的特征码集合Tall。同时,在同一粗粒度聚类中,且拥有公共的“代表性细粒度聚类”的样本之间属于同一类僵尸网络。Question and answer package feature code collection Set with URI signature
Figure BDA0000449500660000096
Combined to obtain the final feature code set T all . At the same time, in the same coarse-grained cluster, samples that have a common "representative fine-grained cluster" belong to the same type of botnet.

其中,包特征码生成步骤S1,还包含:Wherein, the packet signature generation step S1 also includes:

白名单过滤步骤S11:过滤去除访问合法网站的流量;Whitelist filtering step S11: filtering and removing the traffic of accessing legitimate websites;

僵尸网络样本的HTTP数据首先进入“白名单过滤模块”。由于存在僵尸网络控制者为了对抗检测,在命令与控制通信流中参杂合法请求数据(例如访问谷歌、百度)意图干扰检测和特征码的生成。因此,为了不影响特征码生成的质量,根据第三方权威的网站排名(例如ALEX网站排名前500)过滤掉访问合法网站的HTTP流量,将过滤后的HTTP数据转交给“数据提取模块”处理。The HTTP data of the botnet sample first enters the "white list filtering module". In order to resist detection, botnet controllers mix legitimate request data (such as accessing Google, Baidu) in the command and control communication flow with the intention of interfering with detection and signature generation. Therefore, in order not to affect the quality of signature code generation, HTTP traffic to legitimate websites is filtered out according to the third-party authoritative website ranking (such as the top 500 ALEX websites), and the filtered HTTP data is transferred to the "data extraction module" for processing.

数据提取步骤S12:对网络样本的数据流特征统计和一问一答包内容进行提取;Data extraction step S12: extract the statistics of the data flow characteristics of the network sample and the content of the question and answer package;

二次聚类步骤S13:根据网络样本特征统计和一问一答包内容分别进行二次聚类,生成粗粒度聚类集的基础上,生成细粒度聚类集;Secondary clustering step S13: Carry out secondary clustering according to the network sample feature statistics and the content of the Q&A packet, and generate a fine-grained clustering set on the basis of generating a coarse-grained clustering set;

一问一答包特征码生成步骤S14:根据细粒度聚类集,分别生成请求包和应答包的特征码集合。Step S14 of generating the feature code of the question-and-answer packet: according to the fine-grained clustering set, respectively generate the feature code sets of the request packet and the response packet.

其中,数据提取步骤S12,还包括:Wherein, the data extraction step S12 also includes:

数据内容提取步骤S121:提取HTTP会话连接的一问一答包的内容;Data content extraction step S121: extract the content of the Q&A package connected by the HTTP session;

粗粒度聚类属性提取步骤S122:以网络样本为单位,提取粗粒度聚类的四维统计值,包括:HTTP数据流总数、每秒发送字节数、HTTP数据包平均大小和HTTP数据包总数,得到粗粒度聚类属性;Coarse-grained clustering attribute extraction step S122: taking the network sample as a unit, extract the four-dimensional statistical value of coarse-grained clustering, including: the total number of HTTP data streams, the number of bytes sent per second, the average size of HTTP data packets, and the total number of HTTP data packets, Get coarse-grained clustering attributes;

细粒度聚类属性提取步骤S123:以每个HTTP会话为单位,提取细粒度聚类的四维统计值,包括:会话请求包个数、会话响应包个数、首个请求包大小、首个响应包大小,得到细粒度聚类属性;Fine-grained clustering attribute extraction step S123: Taking each HTTP session as a unit, extract the four-dimensional statistical value of fine-grained clustering, including: the number of session request packets, the number of session response packets, the size of the first request packet, and the first response Packet size, get fine-grained clustering attributes;

汇总数据集步骤S124:将一问一答包的内容、粗粒度聚类属性和细粒度聚类属性汇总得到五元组数据集五元组的格式为:<样本id,会话id,一问一答包内容,粗粒度聚类属性,细粒度聚类属性>。Summarize the dataset step S124: Summarize the contents of the Q&A package, coarse-grained clustering attributes, and fine-grained clustering attributes to obtain a five-tuple dataset The format of the five-tuple is: <sample id, session id, Q&A package content, coarse-grained clustering attribute, fine-grained clustering attribute>.

在数据提取步骤S12中,对每个样本的HTTP数据进行数据流特征统计和数据包内容提取,主要分为三个部分:一,提取HTTP会话连接的一问一答包(首个请求和首个响应HTTP数据包)的内容;二,以网络样本为单位,提取粗粒度聚类的四维统计值,包括HTTP数据流总数、每秒发送字节数、HTTP数据包平均大小、HTTP数据包总数;三,以会话连接为单位,提取细粒度聚类的四维统计值,包括会话请求包个数、会话响应包个数、首个请求包大小、首个响应包大小。三个部分可以并发地同时进行,最终得到五元组数据集

Figure BDA0000449500660000111
其格式为<样本id,会话id,一问一答包内容,粗粒度聚类属性,细粒度聚类属性>:其中”样本id“唯一标示不同的僵尸网络样本(数据来源),该标示并不代表僵尸网络的种类,例如在同一个局域网中A、B两台主机被同一僵尸网络所控制,两者的样本id不同;会话id用于唯一标示样本数据中的某个HTTP的会话连接。提取完毕后将五元数据集
Figure BDA0000449500660000112
传入二次聚类步骤S13。In the data extraction step S12, data flow feature statistics and data packet content extraction are performed on the HTTP data of each sample, which are mainly divided into three parts: 1. Extract the Q&A packet of the HTTP session connection (the first request and the first content of each response HTTP data packet); second, taking the network sample as a unit, extract the four-dimensional statistical value of coarse-grained clustering, including the total number of HTTP data streams, the number of bytes sent per second, the average size of HTTP data packets, and the total number of HTTP data packets ; 3. Taking the session connection as the unit, extract the four-dimensional statistical value of fine-grained clustering, including the number of session request packets, the number of session response packets, the size of the first request packet, and the size of the first response packet. The three parts can be performed concurrently, resulting in a quintuple dataset
Figure BDA0000449500660000111
Its format is <sample id, session id, Q&A package content, coarse-grained clustering attribute, fine-grained clustering attribute>: where "sample id" uniquely identifies a different botnet sample (data source), and the identifier is not It does not represent the type of botnet. For example, in the same LAN, two hosts A and B are controlled by the same botnet, and the sample ids of the two are different; the session id is used to uniquely identify a certain HTTP session connection in the sample data. After the extraction is complete, the five-element data set
Figure BDA0000449500660000112
Into the secondary clustering step S13.

其中,二次聚类步骤S13,还包括:Wherein, the secondary clustering step S13 also includes:

粗粒度聚类步骤S131:对五元组数据集

Figure BDA0000449500660000113
自动对粗粒度聚类属性进行聚类,得到粗粒度聚类集C,如果粗粒度聚类集C只属于一个网络样本,则执行URI特征码生成步骤S2;Coarse-grained clustering step S131: for the five-tuple data set
Figure BDA0000449500660000113
Automatically cluster the coarse-grained clustering attributes to obtain the coarse-grained clustering set C, if the coarse-grained clustering set C only belongs to one network sample, then execute the URI feature code generation step S2;

细粒度聚类步骤S132:以粗粒度聚类集C为基础,对每个ci(ci∈C)中的所有会话,自动安装细粒度聚类属性进行聚类,得到细粒度聚类集C′(C′∈Ci);Fine-grained clustering step S132: based on the coarse-grained clustering set C, for all sessions in each c i ( ci ∈ C), automatically install fine-grained clustering attributes for clustering to obtain a fine-grained clustering set C'(C'∈C i );

样本覆盖度判断步骤S133:如果存在细粒度聚类ci′(ci′∈C′)中的所有会话来源于k个样本,k的数值大于1,小于等于网络样本个数,则认为细粒度聚类成功,否则执行URI特征码生成步骤S2。Sample coverage judgment step S133: If there is a fine-grained clustering c i ′ (ci ∈C’) where all conversations come from k samples, and the value of k is greater than 1 and less than or equal to the number of network samples, it is considered fine-grained If the granularity clustering is successful, otherwise execute the URI feature code generation step S2.

首先,对数据集

Figure BDA0000449500660000114
进行粗粒度聚类,聚类算法采用公开的X-means算法,根据四维粗粒度属性值(HTTP数据流总数、每秒发送字节数、HTTP数据包平均大小、HTTP数据包总数)对样本进行聚类,得到粗粒度聚类集C。将只存在单一样本的聚类删除,把其对应的五元数据集
Figure BDA0000449500660000115
执行URI特征码生成步骤S2。然后在粗粒度的基础上以聚类ci(ci∈C)为单位,对每个粗粒度聚类中的所有样本的所有会话连接按照四维细粒度属性值(会话请求包个数、会话响应包个数、首个请求包大小、首个响应包大小)进行聚类,聚类算法依旧为X-means。每一个粗粒度聚类中将会产生新的细粒度聚类集C′(C′∈Ci)。检查C′中每个细粒度聚类的会话连接来源情况,假设ci′∈C′,如果ci′中的会话连接来源于至少k个不同的样本(k小于等于该粗粒度聚类ci中样本个数,大于1,小于等于网络样本个数,具体数值可自由设定),则这样的细粒度聚类ci′满足要求。对应的样本具有“代表性的聚类”;否则,由于没有涵盖足够多的样本,这样的细粒度聚类不具有代表性。如果某粗粒度类Ci中某个样本(或者多个样本)不存在任何“具有代表性”的细粒度聚类(即没有涵盖足够多的样本数量),对这些样本的细粒度聚类失败,认为没有找到与它们相似的且数量足够多的样本,将这些样本相关的数据集传入”URI特征码生成模块”。将满足要求的细粒度聚类ci′执行一问一答包特征码生成步骤S14。First, for the dataset
Figure BDA0000449500660000114
Carry out coarse-grained clustering. The clustering algorithm adopts the public X-means algorithm. According to the four-dimensional coarse-grained attribute values (the total number of HTTP data streams, the number of bytes sent per second, the average size of HTTP data packets, and the total number of HTTP data packets), the samples are analyzed. Clustering to obtain a coarse-grained clustering set C. Delete the clusters that only have a single sample, and delete the corresponding five-element data set
Figure BDA0000449500660000115
Execute the step S2 of generating the URI feature code. Then, on the basis of coarse-grainedness, clustering c i ( ci ∈ C) is used as the unit, and all session connections of all samples in each coarse-grained cluster are classified according to the four-dimensional fine-grained attribute values (number of session request packets, session The number of response packets, the size of the first request packet, and the size of the first response packet) are clustered, and the clustering algorithm is still X-means. Each coarse-grained cluster will generate a new fine-grained cluster set C′ (C′∈C i ). Check the source of session connections for each fine-grained cluster in C′, assuming c i ∈ C′, if the session connections in c i ′ come from at least k different samples (k is less than or equal to the coarse-grained cluster c The number of samples in i is greater than 1 and less than or equal to the number of network samples, the specific value can be set freely), then such fine-grained clustering c i ′ meets the requirements. The corresponding samples have a "representative cluster"; otherwise, such fine-grained clusters are not representative because they do not cover enough samples. If there is no "representative" fine-grained cluster for a sample (or multiple samples) in a certain coarse-grained class C i (that is, it does not cover a sufficient number of samples), the fine-grained clustering of these samples fails , it is considered that there are no samples similar to them and a sufficient number, and the data sets related to these samples are passed into the "URI signature generation module". The fine-grained clustering c i ′ that meets the requirements is executed in step S14 of generating a question-and-answer package feature code.

其中,一问一答包特征码生成步骤S14,还包括:Wherein, step S14 of generating the feature code of the question and answer package also includes:

HTTP特征码集合生成S141:对每个细粒度聚类ci′(ci′∈C′)中所有会话连接分别进行请求包和响应包的特征码生成,依次自动计算得到令牌特征码,最终每个细粒度聚类ci′分别获取一个请求包的特征码和一个响应包的特征码,形成HTTP特征码集合W;HTTP feature code set generation S141: generate request packet and response packet feature codes for all session connections in each fine-grained cluster c i ′ (ci ∈C’), and automatically calculate token feature codes in turn, Finally, each fine-grained clustering c i ′ respectively obtains a signature of a request packet and a signature of a response packet to form an HTTP signature set W;

对每个细粒度聚类ci′中所有会话连接进行特征码生成,按照一问一答包分为请求包特征码生成和响应包特征码生成,采用最长公共子序列算法(LCS)作为特征码的生成算法,产生令牌特征码(形如t1.*t2.*t3.*t4,ti代表共同的字符串,.*代表间隔符,表示前后公共字符串中间存在不匹配字符串)。比较计算的流程如下:假定存在a、b、c、d四个会话连接,首先a与b的请求包先通过LCS计算得到令牌特征码t,t去掉所有的.*转换为文本格式再与c的请求包进行计算得到令牌特征码s,s转换为文本格式最后与d的请求包内容进行计算得到最终的请求包特征码w;响应包的特征码计算同理。经过计算每一个细粒度聚类ci′将会产生一条请求包的特征码和一条响应包的特征码,对这些特征码进行汇总整理,标记所涉及的样本id,每一个粗粒度聚类ci将会得到一个特征码集合W。Generate signatures for all session connections in each fine-grained cluster c i ′. According to the question-and-answer packet, it is divided into request packet signature generation and response packet signature generation. The longest common subsequence algorithm (LCS) is used as The feature code generation algorithm generates a token feature code (like t 1 .*t 2 .*t 3 .*t 4 , t i represents a common character string, .* represents a spacer, indicating that there is a does not match the string). The comparison and calculation process is as follows: Assume that there are four session connections a, b, c, and d. First, the request packets of a and b are calculated by LCS to obtain the token feature code t. Remove all .* from t and convert it to text format. The request packet of c is calculated to obtain the token feature code s, and s is converted into a text format, and finally calculated with the content of the request packet of d to obtain the final request packet feature code w; the same is true for the feature code calculation of the response packet. After calculation, each fine-grained clustering c i ′ will generate a feature code of a request packet and a feature code of a response packet. These feature codes will be summarized and sorted, and the sample id involved will be marked. Each coarse-grained cluster c i will get a feature code set W.

特征码过滤步骤S142:对HTTP特征码集合W进行过滤筛选,去除不合格的特征码,合并重复的特征码,得到一问一答包特征码集合 Feature code filtering step S142: filter and screen the HTTP feature code set W, remove unqualified feature codes, merge repeated feature codes, and obtain a question-and-answer packet feature code set

对产生的一问一答包的特征码集合W进行相应的过滤筛选,首先,把令牌特征码中长度过短(例如长度低于4)的公共字符串ti给删除;然后对令牌特征码所包含的公共字符串进行过滤,把常见的、会经常出现在合法数据包中的HTTP头域字段及部分内容进行过滤(例如HTTP/1.1,Cache-Control:no-cache等);最后把重复的令牌特征码进行去重合并,得到了最终一问一答包僵尸网络的特征码集合

Figure BDA0000449500660000132
在过滤过程中可能存在某样本的特征码因为不符合要求(过短或者均为)而被全部删除,这样的样本被认为是生成一问一答特征码失败,同样被执行URI特征码生成步骤S2。Perform corresponding filtering on the feature code set W of the generated question and answer package. First, delete the public string t i in the token feature code that is too short (for example, the length is less than 4); Filter the public strings contained in the signature, and filter the common HTTP header fields and some content that often appear in legitimate data packets (such as HTTP/1.1, Cache-Control: no-cache, etc.); finally Deduplicated and merged the duplicate token signatures to obtain the signature collection of the final question-and-answer package botnet
Figure BDA0000449500660000132
During the filtering process, some sample signatures may be deleted because they do not meet the requirements (too short or both). Such samples are considered to have failed to generate a Q&A signature, and the URI signature generation step will also be performed. S2.

本发明采用了自动化生成特征码,且生成的特征码质量高,可与snort等入侵检测系统结合实现对相应僵尸网络的广泛检测。The invention adopts automatic generation of characteristic codes, and the generated characteristic codes are of high quality, and can be combined with intrusion detection systems such as snort to realize extensive detection of corresponding botnets.

本发明还提供一种HTTP网络特征码自动生成系统,可单独部署在一台服务器或主机中(例如蜜罐主机中),获取僵尸网络样本所产生的所有HTTP数据;或者将本系统部署在指定网络的网关位置,与网络边界上的僵尸网络检测系统联动,读取检测系统后台数据库所存储的僵尸网络HTTP数据。The present invention also provides a system for automatically generating HTTP network signatures, which can be deployed separately in a server or host (such as a honeypot host) to obtain all HTTP data generated by botnet samples; or deploy the system in a specified The gateway position of the network, linked with the botnet detection system on the network border, reads the botnet HTTP data stored in the background database of the detection system.

一种HTTP网络特征码自动生成系统,如图3所示,包括:包特征码生成模块1、URI特征码生成模块2和HTTP网络特征码总集合生成模块3;A kind of HTTP network characteristic code automatic generation system, as shown in Figure 3, comprises: package characteristic code generation module 1, URI characteristic code generation module 2 and HTTP network characteristic code total set generation module 3;

包特征码生成模块1:用于针对多个网络样本的一问一答包提取出的特征统计和包内容,通过二次聚类生成粗粒度聚类集,进而在所述粗粒度聚类集的基础上二次聚类生成细粒度聚类集,通过细粒度聚类集生成网络样本的一问一答包特征码集合 Packet feature code generation module 1: used for feature statistics and packet content extracted from the Q&A packets of multiple network samples, generate coarse-grained clustering sets through secondary clustering, and then generate coarse-grained clustering sets in the coarse-grained clustering sets On the basis of secondary clustering, a fine-grained cluster set is generated, and a Q&A package signature set of network samples is generated through the fine-grained cluster set

URI特征码生成模块2:用于针对所述网络样本中被划分为单独一类的流量,进行URI路径及参数特征码的补充提取,生成URI的特征码集合

Figure BDA0000449500660000134
URI feature code generation module 2: used for supplementary extraction of URI path and parameter feature codes for the traffic that is divided into a single category in the network sample, and generate a URI feature code set
Figure BDA0000449500660000134

HTTP网络特征码总集合生成模块3:通过一问一答包特征码集合

Figure BDA0000449500660000135
和URI的特征码集合
Figure BDA0000449500660000136
合并生成特征码总集合Tall。HTTP network feature code total set generation module 3: through a question and answer packet feature code set
Figure BDA0000449500660000135
and URI signature set
Figure BDA0000449500660000136
Combined to generate a total set of feature codes T all .

其中,包特征码生成模块1,包含:Among them, the packet signature generation module 1 includes:

白名单过滤模块11:过滤去除访问合法网站的流量;Whitelist filtering module 11: filtering and removing the traffic of visiting legitimate websites;

数据提取模块12:对网络样本的数据流特征统计和一问一答包内容进行提取;Data extraction module 12: extract the statistics of the data flow characteristics of the network sample and the content of the question and answer package;

二次聚类模块13:根据网络样本特征统计和一问一答包内容分别进行二次聚类,生成粗粒度聚类集的基础上,生成细粒度聚类集;Secondary clustering module 13: perform secondary clustering according to network sample feature statistics and Q&A package content, and generate fine-grained clustering sets on the basis of generating coarse-grained clustering sets;

一问一答包特征码生成模块14:根据细粒度聚类集,分别生成请求包和应答包的特征码集合。Question-and-answer package feature code generating module 14: according to the fine-grained clustering set, respectively generate feature code sets of request packets and response packets.

其中,数据提取模块12,还包括:Wherein, the data extraction module 12 also includes:

数据内容提取模块121:提取HTTP会话连接的一问一答包的内容;Data content extraction module 121: extract the content of the question-and-answer packet of the HTTP session connection;

粗粒度聚类属性提取模块122:以网络样本为单位,提取粗粒度聚类的四维统计值,包括:HTTP数据流总数、每秒发送字节数、HTTP数据包平均大小和HTTP数据包总数,得到粗粒度聚类属性;Coarse-grained clustering attribute extraction module 122: take the network sample as a unit, extract the four-dimensional statistical value of coarse-grained clustering, including: total number of HTTP data streams, number of bytes sent per second, average size of HTTP data packets and total number of HTTP data packets, Get coarse-grained clustering attributes;

细粒度聚类属性提取模块123:以每个HTTP会话为单位,提取细粒度聚类的四维统计值,包括:会话请求包个数、会话响应包个数、首个请求包大小、首个响应包大小,得到细粒度聚类属性;Fine-grained clustering attribute extraction module 123: taking each HTTP session as a unit, extract the four-dimensional statistical value of fine-grained clustering, including: the number of session request packets, the number of session response packets, the size of the first request packet, and the first response Packet size, get fine-grained clustering attributes;

汇总数据集模块124:将一问一答包的内容、粗粒度聚类属性和细粒度聚类属性汇总得到五元组数据集五元组的格式为:<样本id,会话id,一问一答包内容,粗粒度聚类属性,细粒度聚类属性>。Summarize the data set module 124: summarize the content of the question and answer package, the coarse-grained clustering attribute and the fine-grained clustering attribute to obtain a five-tuple data set The format of the five-tuple is: <sample id, session id, Q&A package content, coarse-grained clustering attribute, fine-grained clustering attribute>.

其中,二次聚类模块13,还包括:Wherein, the secondary clustering module 13 also includes:

粗粒度聚类模块131:对五元组数据集

Figure BDA0000449500660000141
自动对粗粒度聚类属性进行聚类,得到粗粒度聚类集C,如果粗粒度聚类集C只属于一个网络样本,则通过URI特征码生成模块生成URI特征码;Coarse-grained clustering module 131: For quintuple datasets
Figure BDA0000449500660000141
Automatically cluster the coarse-grained clustering attributes to obtain a coarse-grained clustering set C, if the coarse-grained clustering set C only belongs to one network sample, then generate a URI signature through the URI signature generation module;

细粒度聚类模块132:以粗粒度聚类集C为基础,对每个ci(ci∈C)中的所有会话,自动安装细粒度聚类属性进行聚类,得到细粒度聚类集C′(C′∈Ci);Fine-grained clustering module 132: based on the coarse-grained clustering set C, for all sessions in each c i ( ci ∈ C), automatically install fine-grained clustering attributes for clustering, and obtain a fine-grained clustering set C'(C'∈C i );

样本覆盖度判断模块133:如果存在细粒度聚类ci′(ci′∈C′)中的所有会话来源于k个样本,k的数值大于1,小于等于网络样本个数,则认为细粒度聚类成功,否则通过URI特征码生成模块生成URI特征码。Sample coverage judging module 133: if there is a fine-grained clustering c i ′ (ci ∈C’) where all conversations come from k samples, and the value of k is greater than 1 and less than or equal to the number of network samples, it is considered fine-grained The granularity clustering is successful; otherwise, the URI signature code is generated by the URI signature code generation module.

其中,一问一答包特征码生成模块14,还包括:Wherein, the question and answer package feature code generation module 14 also includes:

HTTP特征码集合生成模块141:对每个细粒度聚类ci′(ci′∈C′)中所有会话连接分别进行请求包和响应包的特征码生成,依次自动计算得到令牌特征码,最终每个细粒度聚类ci′分别获取一个请求包的特征码和一个响应包的特征码,形成HTTP特征码集合W;HTTP feature code set generation module 141: generate request packet and response packet feature codes for all session connections in each fine-grained cluster c i ′ (ci ∈C’), and automatically calculate token feature codes in turn , and finally each fine-grained clustering c i ′ respectively obtains a signature of a request packet and a signature of a response packet to form an HTTP signature set W;

特征码过滤模块142:对HTTP特征码集合W进行过滤筛选,去除不合格的特征码,合并重复的特征码,得到一问一答包特征码集合

Figure BDA0000449500660000151
Feature code filtering module 142: filter and screen the HTTP feature code set W, remove unqualified feature codes, merge repeated feature codes, and obtain a question-and-answer packet feature code set
Figure BDA0000449500660000151

当然,本发明还可有其它多种实施例,在不背离本发明精神及其实质的情况下,熟悉本领域的技术人员当可根据本发明做出各种相应的改变和变形,但这些相应的改变和变形都应属于本发明所附的权利要求的保护范围。Of course, the present invention can also have other various embodiments, and those skilled in the art can make various corresponding changes and deformations according to the present invention without departing from the spirit and essence of the present invention. All changes and deformations should belong to the protection scope of the appended claims of the present invention.

Claims (12)

1.一种HTTP网络特征码自动生成方法,其特征在于,所述方法包括:1. an HTTP network feature code automatic generation method, is characterized in that, described method comprises: 包特征码生成步骤:针对多个网络样本的一问一答包提取出的特征统计和包内容,通过二次聚类生成粗粒度聚类集,进而在所述粗粒度聚类集的基础上二次聚类生成细粒度聚类集,通过所述细粒度聚类集生成所述网络样本的一问一答包特征码集合
Figure FDA0000449500650000011
Packet feature code generation step: for the feature statistics and packet content extracted from the one-question-one-answer packets of multiple network samples, generate a coarse-grained cluster set through secondary clustering, and then based on the coarse-grained cluster set Secondary clustering generates a fine-grained clustering set, and generates a Q&A packet feature code set of the network sample through the fine-grained clustering set
Figure FDA0000449500650000011
URI特征码生成步骤:针对所述网络样本中被划分为单独一类的流量,进行URI路径及参数特征码的补充提取,生成所述URI的特征码集合 URI feature code generation step: for the traffic that is divided into a single category in the network sample, perform supplementary extraction of URI path and parameter feature codes, and generate a feature code set of the URI HTTP网络特征码总集合生成步骤:通过所述一问一答包特征码集合
Figure FDA0000449500650000013
和所述URI的特征码集合
Figure FDA0000449500650000014
合并生成特征码总集合Tall
The generation step of the total collection of HTTP network signatures: through the collection of the one-question-and-answer packet signatures
Figure FDA0000449500650000013
and the set of signatures for the URI
Figure FDA0000449500650000014
Combined to generate a total set of feature codes T all .
2.根据权利要求1所述HTTP网络特征码自动生成方法,其特征在于,所述包特征码生成步骤,包含:2. according to the described HTTP network characteristic code automatic generation method of claim 1, it is characterized in that, described package characteristic code generation step comprises: 数据提取步骤:对所述网络样本的数据流特征统计和一问一答包内容进行提取;Data extraction step: extracting the statistics of the data flow characteristics of the network sample and the content of the question and answer package; 二次聚类步骤:根据所述网络样本特征统计和所述一问一答包内容分别进行二次聚类,生成所述粗粒度聚类集的基础上,生成所述细粒度聚类集;Secondary clustering step: perform secondary clustering according to the statistics of the characteristics of the network samples and the content of the Q&A packet, and generate the fine-grained clustering set on the basis of generating the coarse-grained clustering set; 一问一答包特征码生成步骤:根据所述细粒度聚类集,分别生成请求包和应答包的特征码集合。The step of generating the feature code of the question-and-answer packet: according to the fine-grained clustering set, respectively generate the feature code sets of the request packet and the response packet. 3.根据权利要求2所述HTTP网络特征码自动生成方法,其特征在于,所述数据提取步骤之前还包含:3. according to the described HTTP network feature code automatic generation method of claim 2, it is characterized in that, also comprise before described data extracting step: 白名单过滤步骤:过滤去除所述网络样本中访问合法网站的流量。Whitelist filtering step: filtering and removing traffic accessing legitimate websites in the network samples. 4.根据权利要求2所述HTTP网络特征码自动生成方法,其特征在于,所述数据提取步骤,还包括:4. according to the described HTTP network feature code automatic generation method of claim 2, it is characterized in that, described data extracting step, also comprises: 数据内容提取步骤:提取HTTP会话连接的所述一问一答包的内容;Data content extraction step: extracting the content of the one-question-one-answer packet of the HTTP session connection; 粗粒度聚类属性提取步骤:以所述网络样本为单位,提取所述粗粒度聚类的四维统计值,包括:HTTP数据流总数、每秒发送字节数、HTTP数据包平均大小和HTTP数据包总数,得到粗粒度聚类属性;Coarse-grained clustering attribute extraction step: taking the network sample as a unit, extract the four-dimensional statistical value of the coarse-grained clustering, including: the total number of HTTP data streams, the number of bytes sent per second, the average size of HTTP data packets, and the HTTP data The total number of packages to get the coarse-grained clustering attribute; 细粒度聚类属性提取步骤:以每个HTTP会话为单位,提取所述细粒度聚类的四维统计值,包括:会话请求包个数、会话响应包个数、首个请求包大小、首个响应包大小,得到细粒度聚类属性;Fine-grained clustering attribute extraction step: taking each HTTP session as a unit, extract the four-dimensional statistical value of the fine-grained clustering, including: the number of session request packets, the number of session response packets, the size of the first request packet, the first Respond to the package size to get fine-grained clustering attributes; 汇总数据集步骤:将所述一问一答包的内容、所述粗粒度聚类属性和所述细粒度聚类属性汇总得到五元组数据集所述五元组的格式为:<样本id,会话id,一问一答包内容,粗粒度聚类属性,细粒度聚类属性>。Summarize the data set step: summarize the content of the question and answer package, the coarse-grained clustering attribute and the fine-grained clustering attribute to obtain a five-tuple data set The format of the five-tuple is: <sample id, session id, Q&A package content, coarse-grained clustering attribute, fine-grained clustering attribute>. 5.根据权利要求2所述HTTP网络特征码自动生成方法,其特征在于,所述二次聚类步骤,还包括:5. according to the described HTTP network feature code automatic generation method of claim 2, it is characterized in that, described secondary clustering step, also comprises: 粗粒度聚类步骤:对所述五元组数据集
Figure FDA0000449500650000022
自动对所述粗粒度聚类属性进行聚类,得到粗粒度聚类集C,如果所述粗粒度聚类集C只属于一个所述网络样本,则执行所述URI特征码生成步骤;
Coarse-grained clustering step: on the five-tuple dataset
Figure FDA0000449500650000022
Automatically clustering the coarse-grained clustering attributes to obtain a coarse-grained clustering set C, if the coarse-grained clustering set C only belongs to one of the network samples, then perform the URI signature generation step;
细粒度聚类步骤:以所述粗粒度聚类集C为基础,对每个ci(ci∈C)中的所有会话,自动安装所述细粒度聚类属性进行聚类,得到细粒度聚类集C′(C′∈Ci);Fine-grained clustering step: based on the coarse-grained clustering set C, for all sessions in each c i ( ci ∈ C), automatically install the fine-grained clustering attributes for clustering, and obtain fine-grained Clustering set C′ (C′∈C i ); 样本覆盖度判断步骤:如果存在细粒度聚类ci′(ci′∈C′)中的所有会话来源于k个样本,k的数值大于1,小于等于所述网络样本个数,则认为所述细粒度聚类成功,否则执行所述URI特征码生成步骤。Sample coverage judgment step: if there is a fine-grained clustering c i ′ (ci ∈C′) where all the conversations come from k samples, and the value of k is greater than 1 and less than or equal to the number of network samples, then it is considered The fine-grained clustering is successful; otherwise, the step of generating the URI feature code is performed.
6.根据权利要求2所述HTTP网络特征码自动生成方法,其特征在于,所述一问一答包特征码生成步骤,还包括:6. according to the described HTTP network characteristic code automatic generation method of claim 2, it is characterized in that, described one question one answer packet characteristic code generation step, also comprises: HTTP特征码集合生成步骤:对所述每个细粒度聚类ci′(ci′∈C′)中所有会话连接分别进行请求包和响应包的特征码生成,依次自动计算得到令牌特征码,最终每个细粒度聚类ci′分别获取一个请求包的特征码和一个响应包的特征码,形成HTTP特征码集合W;HTTP feature code set generation step: generate the feature codes of request packets and response packets for all session connections in each fine-grained cluster c i ′ (ci ∈C’), and automatically calculate the token features sequentially Finally, each fine-grained cluster c i ′ obtains a signature of a request packet and a signature of a response packet respectively, forming an HTTP signature set W; 特征码过滤步骤:对所述HTTP特征码集合W进行过滤筛选,去除不合格的所述特征码,合并重复的所述特征码,得到所述一问一答包特征码集合 Feature code filtering step: filter and screen the HTTP feature code set W, remove the unqualified feature codes, merge the repeated feature codes, and obtain the Q&A packet feature code set 7.一种HTTP网络特征码自动生成系统,采用如权利要求1-6中任一项所述网络特征自动生成方法,其特征在于,所述系统包括:7. an HTTP network feature code automatic generation system, adopts the network feature automatic generation method as described in any one of claims 1-6, is characterized in that, described system comprises: 包特征码生成模块:用于针对多个网络样本的一问一答包提取出的特征统计和包内容,通过二次聚类生成粗粒度聚类集,进而在所述粗粒度聚类集的基础上二次聚类生成细粒度聚类集,通过所述细粒度聚类集生成所述网络样本的一问一答包特征码集合
Figure FDA0000449500650000024
Packet feature code generation module: used for feature statistics and packet content extracted from the one-question-one-answer packets of multiple network samples, generate coarse-grained cluster sets through secondary clustering, and then generate coarse-grained cluster sets in the coarse-grained cluster sets Based on the secondary clustering, a fine-grained clustering set is generated, and a Q&A package feature code set of the network sample is generated through the fine-grained clustering set
Figure FDA0000449500650000024
URI特征码生成模块:针对所述网络样本中被划分为单独一类的流量,进行URI路径及参数特征码的补充提取,生成所述URI的特征码集合
Figure FDA0000449500650000031
URI feature code generation module: for the traffic that is divided into a single category in the network sample, perform supplementary extraction of URI path and parameter feature codes, and generate a feature code set of the URI
Figure FDA0000449500650000031
HTTP网络特征码总集合生成模块:通过所述一问一答包特征码集合
Figure FDA0000449500650000032
和所述URI的特征码集合
Figure FDA0000449500650000033
合并生成特征码总集合Tall
HTTP network feature code total set generation module: through the question and answer packet feature code set
Figure FDA0000449500650000032
and the set of signatures for the URI
Figure FDA0000449500650000033
Combined to generate a total set of feature codes T all .
8.根据权利要求7所述HTTP网络特征码自动生成系统,其特征在于,所述包特征码生成模块,包含:8. according to the described HTTP network characteristic code automatic generation system of claim 7, it is characterized in that, described package characteristic code generation module comprises: 白名单过滤模块:过滤去除访问合法网站的流量;Whitelist filtering module: filter and remove traffic that visits legitimate websites; 数据提取模块:对所述网络样本的数据流特征统计和一问一答包内容进行提取;Data extraction module: extract the statistics of the data flow characteristics of the network sample and the content of the question and answer package; 二次聚类模块:根据所述网络样本特征统计和所述一问一答包内容分别进行二次聚类,生成所述粗粒度聚类集的基础上,生成所述细粒度聚类集;Secondary clustering module: performing secondary clustering according to the network sample characteristic statistics and the content of the question-and-answer package, and generating the fine-grained clustering set on the basis of generating the coarse-grained clustering set; 一问一答包特征码生成模块:根据所述细粒度聚类集,分别生成请求包和应答包的特征码集合。A question-and-answer packet signature generation module: according to the fine-grained clustering set, respectively generate signature sets of request packets and response packets. 9.根据权利要求8所述HTTP网络特征码自动生成系统,其特征在于,所述数据提取模块之前还包含:9. according to the described HTTP network feature code automatic generation system of claim 8, it is characterized in that, also comprise before described data extraction module: 白名单过滤模块:过滤去除所述网络样本中访问合法网站的流量。Whitelist filtering module: filtering and removing the traffic of accessing legitimate websites in the network samples. 10.根据权利要求8所述HTTP网络特征码自动生成系统,其特征在于,所述数据提取模块,还包括:10. according to the described HTTP network feature code automatic generation system of claim 8, it is characterized in that, described data extraction module, also comprises: 数据内容提取模块:提取HTTP会话连接的所述一问一答包的内容;Data content extraction module: extract the content of the said Q&A package of HTTP session connection; 粗粒度聚类属性提取模块:以所述网络样本为单位,提取所述粗粒度聚类的四维统计值,包括:HTTP数据流总数、每秒发送字节数、HTTP数据包平均大小和HTTP数据包总数,得到粗粒度聚类属性;Coarse-grained clustering attribute extraction module: taking the network sample as a unit, extract the four-dimensional statistical value of the coarse-grained clustering, including: the total number of HTTP data streams, the number of bytes sent per second, the average size of HTTP data packets, and the HTTP data The total number of packages to get the coarse-grained clustering attribute; 细粒度聚类属性提取模块:以每个HTTP会话为单位,提取所述细粒度聚类的四维统计值,包括:会话请求包个数、会话响应包个数、首个请求包大小、首个响应包大小,得到细粒度聚类属性;Fine-grained clustering attribute extraction module: taking each HTTP session as a unit, extract the four-dimensional statistical value of the fine-grained clustering, including: the number of session request packets, the number of session response packets, the size of the first request packet, the first Respond to the package size to get fine-grained clustering attributes; 汇总数据集模块:将所述一问一答包的内容、所述粗粒度聚类属性和所述细粒度聚类属性汇总得到五元组数据集
Figure FDA0000449500650000034
所述五元组的格式为:<样本id,会话id,一问一答包内容,粗粒度聚类属性,细粒度聚类属性>。
Summarize data set module: summarize the content of the question and answer package, the coarse-grained clustering attribute and the fine-grained clustering attribute to obtain a five-tuple data set
Figure FDA0000449500650000034
The format of the five-tuple is: <sample id, session id, Q&A package content, coarse-grained clustering attribute, fine-grained clustering attribute>.
11.根据权利要求8所述HTTP网络特征码自动生成系统,其特征在于,所述二次聚类模块,还包括:11. according to the described HTTP network feature code automatic generation system of claim 8, it is characterized in that, described secondary clustering module, also comprises: 粗粒度聚类模块:对所述五元组数据集
Figure FDA0000449500650000041
自动对所述粗粒度聚类属性进行聚类,得到粗粒度聚类集C,如果所述粗粒度聚类集C只属于一个所述网络样本,则通过所述URI特征码生成模块生成所述URI特征码;
Coarse-grained clustering module: for the five-tuple data set
Figure FDA0000449500650000041
automatically clustering the coarse-grained clustering attributes to obtain a coarse-grained clustering set C, if the coarse-grained clustering set C only belongs to one of the network samples, the URI signature generation module generates the URI signature;
细粒度聚类模块:以所述粗粒度聚类集C为基础,对每个ci(ci∈C)中的所有会话,自动安装所述细粒度聚类属性进行聚类,得到细粒度聚类集C′(C′∈Ci);Fine-grained clustering module: based on the coarse-grained clustering set C, for all sessions in each c i ( ci ∈ C), automatically install the fine-grained clustering attributes for clustering, and obtain fine-grained Clustering set C′ (C′∈C i ); 样本覆盖度判断模块:如果存在细粒度聚类ci′(ci′∈C′)中的所有会话来源于k个样本,k的数值大于1,小于等于所述网络样本个数,则认为所述细粒度聚类成功,否则执行所述URI特征码生成步骤。Sample coverage judgment module: if there is a fine-grained clustering c i ′ (ci ∈C’) where all the conversations come from k samples, and the value of k is greater than 1 and less than or equal to the number of network samples, then it is considered The fine-grained clustering is successful; otherwise, the step of generating the URI feature code is performed.
12.根据权利要求8所述HTTP网络特征码自动生成系统,其特征在于,所述一问一答包特征码生成模块,还包括:12. according to the described HTTP network feature code automatic generation system of claim 8, it is characterized in that, described one question and one answer packet feature code generation module, also comprises: HTTP特征码集合生成模块:对所述每个细粒度聚类ci′(ci′∈C′)中所有会话连接分别进行请求包和响应包的特征码生成,依次自动计算得到令牌特征码,最终每个细粒度聚类ci′分别获取一个请求包的特征码和一个响应包的特征码,形成HTTP特征码集合W;HTTP feature code set generation module: generate request packet and response packet feature codes for all session connections in each fine-grained cluster c i ′ (ci ∈ C ′), and automatically calculate token features sequentially Finally, each fine-grained cluster c i ′ obtains a signature of a request packet and a signature of a response packet respectively, forming an HTTP signature set W; 特征码过滤模块:对所述HTTP特征码集合W进行过滤筛选,去除不合格的所述特征码,合并重复的所述特征码,得到所述一问一答包特征码集合 Feature code filtering module: filter and screen the HTTP feature code set W, remove unqualified feature codes, merge repeated feature codes, and obtain the Q&A package feature code set
CN201310745102.1A 2013-12-30 2013-12-30 A kind of http network condition code automatic generation method and its system Expired - Fee Related CN103746982B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201310745102.1A CN103746982B (en) 2013-12-30 2013-12-30 A kind of http network condition code automatic generation method and its system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201310745102.1A CN103746982B (en) 2013-12-30 2013-12-30 A kind of http network condition code automatic generation method and its system

Publications (2)

Publication Number Publication Date
CN103746982A true CN103746982A (en) 2014-04-23
CN103746982B CN103746982B (en) 2017-05-31

Family

ID=50503969

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201310745102.1A Expired - Fee Related CN103746982B (en) 2013-12-30 2013-12-30 A kind of http network condition code automatic generation method and its system

Country Status (1)

Country Link
CN (1) CN103746982B (en)

Cited By (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105099834A (en) * 2015-09-30 2015-11-25 北京华青融天技术有限责任公司 Method and device for self-defining feature code
WO2016110273A1 (en) * 2015-01-09 2016-07-14 北京京东尚科信息技术有限公司 System and method for limiting access request
CN105978897A (en) * 2016-06-28 2016-09-28 南京南瑞继保电气有限公司 Detection method of electricity secondary system botnet
CN107222511A (en) * 2017-07-25 2017-09-29 深信服科技股份有限公司 Detection method and device, computer installation and the readable storage medium storing program for executing of Malware
CN107592312A (en) * 2017-09-18 2018-01-16 济南互信软件有限公司 A kind of malware detection method based on network traffics
CN108287905A (en) * 2018-01-26 2018-07-17 华南理工大学 A kind of extraction of network flow feature and storage method
CN108897990A (en) * 2018-06-06 2018-11-27 东北大学 Interaction feature method for parallel selection towards extensive higher-dimension sequence data
CN109474452A (en) * 2017-12-25 2019-03-15 北京安天网络安全技术有限公司 Method, system and the storage medium on automatic identification B/S Botnet backstage
CN110472031A (en) * 2019-08-13 2019-11-19 北京知道创宇信息技术股份有限公司 A kind of regular expression preparation method, device, electronic equipment and storage medium
CN111182002A (en) * 2020-02-19 2020-05-19 北京亚鸿世纪科技发展有限公司 Zombie network detection device based on HTTP (hyper text transport protocol) first question-answer packet clustering analysis
CN113381996A (en) * 2021-06-08 2021-09-10 中电福富信息科技有限公司 C & C communication attack detection method based on machine learning

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20100162350A1 (en) * 2008-12-24 2010-06-24 Korea Information Security Agency Security system of managing irc and http botnets, and method therefor
CN102333313A (en) * 2011-10-18 2012-01-25 中国科学院计算技术研究所 Mobile botnet signature generation method and mobile botnet detection method
CN103297433A (en) * 2013-05-29 2013-09-11 中国科学院计算技术研究所 HTTP botnet detection method and system based on net data stream
US8555388B1 (en) * 2011-05-24 2013-10-08 Palo Alto Networks, Inc. Heuristic botnet detection
US8561188B1 (en) * 2011-09-30 2013-10-15 Trend Micro, Inc. Command and control channel detection with query string signature
CN103457909A (en) * 2012-05-29 2013-12-18 中国移动通信集团湖南有限公司 Botnet detection method and device

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20100162350A1 (en) * 2008-12-24 2010-06-24 Korea Information Security Agency Security system of managing irc and http botnets, and method therefor
US8555388B1 (en) * 2011-05-24 2013-10-08 Palo Alto Networks, Inc. Heuristic botnet detection
US8561188B1 (en) * 2011-09-30 2013-10-15 Trend Micro, Inc. Command and control channel detection with query string signature
CN102333313A (en) * 2011-10-18 2012-01-25 中国科学院计算技术研究所 Mobile botnet signature generation method and mobile botnet detection method
CN103457909A (en) * 2012-05-29 2013-12-18 中国移动通信集团湖南有限公司 Botnet detection method and device
CN103297433A (en) * 2013-05-29 2013-09-11 中国科学院计算技术研究所 HTTP botnet detection method and system based on net data stream

Cited By (17)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2016110273A1 (en) * 2015-01-09 2016-07-14 北京京东尚科信息技术有限公司 System and method for limiting access request
US10735501B2 (en) 2015-01-09 2020-08-04 Beijing Jingdong Shangke Information Technology Co., Ltd. System and method for limiting access request
CN105099834B (en) * 2015-09-30 2018-11-13 北京华青融天技术有限责任公司 A kind of method and apparatus of user-defined feature code
CN105099834A (en) * 2015-09-30 2015-11-25 北京华青融天技术有限责任公司 Method and device for self-defining feature code
CN105978897A (en) * 2016-06-28 2016-09-28 南京南瑞继保电气有限公司 Detection method of electricity secondary system botnet
CN107222511A (en) * 2017-07-25 2017-09-29 深信服科技股份有限公司 Detection method and device, computer installation and the readable storage medium storing program for executing of Malware
CN107592312A (en) * 2017-09-18 2018-01-16 济南互信软件有限公司 A kind of malware detection method based on network traffics
CN107592312B (en) * 2017-09-18 2021-04-30 济南互信软件有限公司 Malicious software detection method based on network flow
CN109474452A (en) * 2017-12-25 2019-03-15 北京安天网络安全技术有限公司 Method, system and the storage medium on automatic identification B/S Botnet backstage
CN109474452B (en) * 2017-12-25 2021-09-28 北京安天网络安全技术有限公司 Method, system and storage medium for automatically identifying B/S botnet background
CN108287905B (en) * 2018-01-26 2020-04-21 华南理工大学 A method for extracting and storing network flow features
CN108287905A (en) * 2018-01-26 2018-07-17 华南理工大学 A kind of extraction of network flow feature and storage method
CN108897990A (en) * 2018-06-06 2018-11-27 东北大学 Interaction feature method for parallel selection towards extensive higher-dimension sequence data
CN108897990B (en) * 2018-06-06 2021-10-29 东北大学 Parallel selection of interactive features for large-scale high-dimensional sequence data
CN110472031A (en) * 2019-08-13 2019-11-19 北京知道创宇信息技术股份有限公司 A kind of regular expression preparation method, device, electronic equipment and storage medium
CN111182002A (en) * 2020-02-19 2020-05-19 北京亚鸿世纪科技发展有限公司 Zombie network detection device based on HTTP (hyper text transport protocol) first question-answer packet clustering analysis
CN113381996A (en) * 2021-06-08 2021-09-10 中电福富信息科技有限公司 C & C communication attack detection method based on machine learning

Also Published As

Publication number Publication date
CN103746982B (en) 2017-05-31

Similar Documents

Publication Publication Date Title
CN103746982B (en) A kind of http network condition code automatic generation method and its system
US20220086125A1 (en) Aggregating alerts of malicious events for computer security
Lu et al. Clustering botnet communication traffic based on n-gram feature selection
CN103297433B (en) The HTTP Botnet detection method of data flow Network Based and system
CN107770132B (en) A method and device for detecting a domain name generated by an algorithm
CN103957203B (en) A network security defense system
CN102685145A (en) Domain name server (DNS) data packet-based bot-net domain name discovery method
CN105681250A (en) Botnet distributed real-time detection method and system
CN103532957B (en) A kind of long-range shell behavioral values device and method of wooden horse
CN103457909B (en) A kind of Botnet detection method and device
Cai et al. Detecting HTTP botnet with clustering network traffic
Narang et al. PeerShark: flow-clustering and conversation-generation for malicious peer-to-peer traffic identification
CN116938507A (en) A power Internet of Things security defense terminal and its control system
CN114513325A (en) Unstructured P2P botnet detection method and device based on SAW community discovery
Resende et al. HTTP and contact‐based features for Botnet detection
Gao et al. Anomaly traffic detection in IoT security using graph neural networks
Amini et al. Analysis of network traffic flows for centralized botnet detection
CN113242233B (en) A multi-classification botnet detection device
Qiao et al. Mining of attack models in ids alerts from network backbone by a two-stage clustering method
Niu et al. Using XGBoost to discover infected hosts based on HTTP traffic
Rostami et al. Analysis and detection of P2P botnet connections based on node behaviour
Kassim et al. An analysis on bandwidth utilization and traffic pattern for network security management
Barati et al. Features selection for IDS in encrypted traffic using genetic algorithm
CN115834097A (en) HTTPS malware traffic detection system and method based on multi-view
Qin et al. Computer network security protection system based on genetic algorithm

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
CF01 Termination of patent right due to non-payment of annual fee
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20170531

Termination date: 20191230