CN102833240B

CN102833240B - A kind of malicious code catching method and system

Info

Publication number: CN102833240B
Application number: CN201210294945.XA
Authority: CN
Inventors: 云晓春; 李书豪; 张永铮; 臧天宁; 王一鹏
Original assignee: Institute of Information Engineering of CAS
Current assignee: Institute of Information Engineering of CAS
Priority date: 2012-08-17
Filing date: 2012-08-17
Publication date: 2016-02-03
Anticipated expiration: 2032-08-17
Also published as: CN102833240A

Abstract

The invention relates to a malicious code capturing method and system. The malicious code capture method includes: obtaining email data from various email data sources; analyzing the email data, recording files that cannot be excluded according to the set false negative rate in the email data as suspicious files, and recording the suspicious files The files are stored in the suspicious file database; the suspicious files are detected by using the malicious code feature database and manual detection, and the suspicious files whose detection results are abnormal are stored in the malicious code sample database. The method and system for capturing malicious codes of the present invention can be applied to related honeypots and honeynet systems, can increase the coverage of captured objects, and improve the ability to capture malicious codes.

Description

A malicious code capture method and system

技术领域 technical field

本发明涉及网络信息安全技术领域，尤其涉及一种恶意代码捕获方法及系统。The invention relates to the technical field of network information security, in particular to a malicious code capture method and system.

背景技术 Background technique

网络蠕虫、特洛伊木马、僵尸网络等恶意代码层出不穷，给网络信息安全带来了巨大危害。为了更好地分析与检测恶意代码，防御者首先应该研究获取大量互联网中恶意代码的方法，蜜罐与蜜网技术应运而生、逐步兴起。蜜罐技术是指防御者通过提供虚拟的或真实的主机、服务器以及其他智能终端，或者模拟相关服务，用于被攻击者扫描、入侵，进而达到获取相关恶意代码的目的。蜜网是由若干个有相互关系的蜜罐组成的具有一定拓扑结构的网络，它可以被看作是大规模分布式部署的蜜罐系统。一般来说，蜜罐不作为正常的主机、服务器以及其他智能终端来使用，它主要用于吸引攻击者入侵，根据捕获的攻击信息来分析检测，进而设计相关防御策略，进而阻止或削弱攻击者的危害。Malicious codes such as network worms, Trojan horses, and botnets emerge in an endless stream, which has brought great harm to network information security. In order to better analyze and detect malicious codes, defenders should first research ways to obtain a large amount of malicious codes in the Internet, and honeypot and honeynet technologies emerged and gradually emerged. Honeypot technology means that defenders provide virtual or real hosts, servers, and other intelligent terminals, or simulate related services, to be scanned and invaded by attackers, and then achieve the purpose of obtaining relevant malicious codes. A honeynet is a network with a certain topology composed of several interrelated honeypots, which can be regarded as a large-scale distributed honeypot system. Generally speaking, honeypots are not used as normal hosts, servers, and other intelligent terminals. They are mainly used to attract attackers to invade, analyze and detect based on captured attack information, and then design relevant defense strategies to prevent or weaken attackers. hazards.

传统的蜜罐可分为三类：虚拟蜜罐、虚拟机蜜罐和物理蜜罐。虚拟蜜罐是通过模拟网络拓扑、操作系统以及网络服务等来诱骗攻击者入侵。这类蜜罐虽然占用资源少，但交互能力低，只能捕获低交互的恶意代码，如部分网络蠕虫。虚拟机蜜罐是通过虚拟机设计某些漏洞或弱点，来诱骗攻击者入侵。这类蜜罐的优势在于节省资源，并且支持一定的交互，并能够获取较完整的攻击信息，但容易被攻击者利用虚拟机检测技术发现，失去捕获恶意代码的作用。物理蜜罐是通过使用真实设备，设计某些漏洞或弱点，来诱骗攻击者入侵，这类蜜罐能够与攻击者进行高度交互，不易被察觉，但物理蜜罐成本很高，不宜大规模部署。Traditional honeypots can be divided into three categories: virtual honeypots, virtual machine honeypots and physical honeypots. Virtual honeypots trick attackers into invading by simulating network topology, operating system, and network services. Although this type of honeypot occupies less resources, it has low interaction ability and can only capture low-interaction malicious codes, such as some network worms. A virtual machine honeypot is designed to lure attackers into invading through the design of certain loopholes or weaknesses in the virtual machine. The advantage of this type of honeypot is that it saves resources, supports certain interactions, and can obtain relatively complete attack information, but it is easy to be discovered by attackers using virtual machine detection technology, and loses the role of capturing malicious code. Physical honeypots use real devices and design certain loopholes or weaknesses to trick attackers into invading. This type of honeypot can highly interact with attackers and is not easy to be detected. However, physical honeypots are expensive and not suitable for large-scale deployment. .

在蜜罐与蜜网技术中，如何在单位时间内捕获更多恶意代码是其核心问题之一，而这一问题与恶意代码传播手段有着密切关系。一般来说，恶意代码的传播手段可以分为两大类：一类是利用漏洞传播，另一类是利用社会工程学传播。漏洞传播不需要与受害者进行交互，传统的蜜罐技术多基于此类传播手段设计。社会工程学传播是通过对受害者本能反应、好奇心、信任、贪婪等弱点进行分析利用，达到欺骗入侵的目的，其传播过程需要用户的参与。随着网络服务与应用的发展，社会工程学传播呈现多样化、复杂化的趋势。近些年，越来越多的恶意代码采用此类传播手段（即社会工程学传播），如Koobface、震网(Stuxnet)、Zues等。In honeypot and honeynet technology, how to capture more malicious codes per unit time is one of the core issues, and this problem is closely related to the means of spreading malicious codes. Generally speaking, the means of dissemination of malicious code can be divided into two categories: one is to spread by exploiting vulnerabilities, and the other is to spread by using social engineering. Vulnerability propagation does not require interaction with victims, and traditional honeypot technologies are mostly designed based on such propagation methods. Social engineering communication is to achieve the purpose of deception and intrusion by analyzing and utilizing the victim's instinctive reaction, curiosity, trust, greed and other weaknesses. The communication process requires the participation of users. With the development of network services and applications, social engineering communication presents a trend of diversification and complexity. In recent years, more and more malicious codes have adopted such means of propagation (that is, social engineering propagation), such as Koobface, Stuxnet, Zues, etc.

现有的蜜罐能够很好地捕获基于漏洞传播的恶意代码，而对于某些基于社会工程学传播的恶意代码，还缺乏高效的捕获方法，尤其是基于Email网络的利用社会工程学传播的恶意代码。此类恶意代码沿着Email网络的脉络进行传播，将带有恶意代码或者带有其访问方式的陷阱邮件推送到用户邮箱，诱使受害者执行陷阱邮件中的恶意代码，或者根据其提供的方式访问(如网页链接下载)恶意代码并执行，进而达到入侵受害者计算机的目的。此类恶意代码往往利用受害者邮箱发送陷阱邮件给受害者的Email好友，进而感染更多的Email用户。Existing honeypots can capture malicious codes spread based on vulnerabilities very well, but for some malicious codes spread based on social engineering, there is still a lack of efficient capture methods, especially malicious codes spread using social engineering based on Email networks. code. This type of malicious code spreads along the veins of the Email network, pushes trap emails with malicious codes or access methods to user mailboxes, entices victims to execute the malicious codes in the trap emails, or Access (such as web page link download) malicious code and execute it, and then achieve the purpose of invading the victim's computer. This kind of malicious code often uses the victim's mailbox to send trap emails to the victim's Email friends, thereby infecting more Email users.

可以看出，上述恶意代码可能利用Email网络中用户之间的信任关系进行传播。Email网络是一种由邮箱用户通过邮件联系形成的社交网络，也是复杂网络的一种重要应用类型。研究者通常将复杂网络抽象成图来进行分析，对于上述的Email网络，每个用户邮箱用“点”来表示，用户间的邮件及数量用“边”与“权值”来表示(若某两个用户间没有邮件，则相应的两点间无边)。在社交网络中，网络平均距离较小，聚集系数较大，节点度呈现指数分布。It can be seen that the above-mentioned malicious code may spread by using the trust relationship between users in the Email network. Email network is a social network formed by mailbox users through email contacts, and it is also an important application type of complex networks. Researchers usually abstract the complex network into a graph for analysis. For the above-mentioned Email network, each user mailbox is represented by a "point", and the emails and the number between users are represented by "edges" and "weights" (if a If there is no mail between two users, there is no border between the corresponding two points). In social networks, the average network distance is small, the clustering coefficient is large, and the node degree presents an exponential distribution.

然而，现有的蜜罐与蜜网技术尚未充分考虑基于Email网络利用社会工程学传播的恶意代码，并且没有利用上述社交网络特征进行设计，以捕获更多传播类型的恶意代码。可见，现有的蜜罐与蜜网技术不能对基于Email网络利用社会工程学传播的恶意代码进行大规模的有效的捕获。However, the existing honeypot and honeynet technologies have not fully considered the malicious codes spread by social engineering based on the Email network, and have not made use of the above-mentioned social network features to capture more types of malicious codes. It can be seen that the existing honeypot and honeynet technologies cannot effectively capture large-scale malicious codes spread by social engineering based on the Email network.

发明内容 Contents of the invention

本发明所要解决的技术问题是提供一种恶意代码捕获方法及系统，提高对基于Email网络利用社会工程学传播的恶意代码的捕获能力。The technical problem to be solved by the present invention is to provide a method and system for capturing malicious codes, which can improve the ability to capture malicious codes spread by social engineering based on Email network.

为解决上述技术问题，本发明提出了一种恶意代码捕获方法，包括：In order to solve the above technical problems, the present invention proposes a malicious code capture method, comprising:

从多种电子邮件数据源中获取邮件数据；Get email data from various email data sources;

解析所述邮件数据，将所述邮件数据中根据设定的漏报率无法排除的文件记录为可疑文件，并将该可疑文件保存到可疑文件数据库中；Analyzing the mail data, recording the files in the mail data that cannot be excluded according to the set false negative rate as suspicious files, and saving the suspicious files in the suspicious file database;

利用恶意代码特征数据库和人工检测对所述可疑文件进行检测，将检测结果为异常的可疑文件保存到所述恶意代码样本数据库。The suspicious files are detected by using the malicious code feature database and manual detection, and the suspicious files whose detection results are abnormal are stored in the malicious code sample database.

进一步地，上述恶意代码捕获方法还可具有以下特点，还包括：Further, the above malicious code capture method may also have the following characteristics, including:

从所述恶意代码样本数据库中获取恶意代码样本，在沙箱中运行该恶意代码样本，记录该恶意代码样本的特征信息并保存到所述恶意代码特征数据库。Obtain a malicious code sample from the malicious code sample database, run the malicious code sample in a sandbox, record the characteristic information of the malicious code sample and save it in the malicious code characteristic database.

进一步地，上述恶意代码捕获方法还可具有以下特点，在所述从多种电子邮件数据源中获取邮件数据之前，还包括：Further, the above-mentioned malicious code capturing method may also have the following characteristics, before obtaining the email data from various email data sources, it also includes:

采用电子邮件终端虚拟蜜罐的选择与部署算法来分配和优化虚拟蜜罐资源的使用，所述电子邮件终端虚拟蜜罐的选择与部署算法为：将恶意代码传播的电子邮件网络抽象为一个以点和边组成的具有小世界模型特征的社交网络带权有向图模型，其中，点表示一个电子邮件账号，边表示电子邮件账号之间的通信邮件，边的权值表示一定时间内通信邮件的数量，点的入度表示一定时间内该点的发件人数量，出度表示一定时间内该点的收件人数量。The selection and deployment algorithm of the e-mail terminal virtual honeypot is used to allocate and optimize the use of virtual honeypot resources. The selection and deployment algorithm of the e-mail terminal virtual honeypot is as follows: the e-mail network for malicious code propagation is abstracted into an A weighted directed graph model of a social network with the characteristics of a small-world model composed of points and edges, where a point represents an email account, an edge represents a communication email between email accounts, and the weight of an edge represents a communication email within a certain period of time The in-degree of a point indicates the number of senders at that point within a certain period of time, and the out-degree indicates the number of recipients at that point within a certain period of time.

进一步地，上述恶意代码捕获方法还可具有以下特点，所述从多种电子邮件数据源中获取邮件数据包括：Further, the above malicious code capture method may also have the following characteristics, the acquisition of mail data from various email data sources includes:

根据预设的第一配置信息获取电子邮件数据源信息，并提取电子邮件数据源的种类，所述电子邮件数据源的种类为自动化申请注册的电子邮件终端、志愿者电子邮件账号或相关机构提供的去隐私信息邮件数据；Obtain email data source information according to the preset first configuration information, and extract the type of email data source, the type of email data source is provided by the email terminal for automatic registration, volunteer email account or related institutions email data to remove private information;

若电子邮件数据源为自动化申请注册的电子邮件终端或志愿者电子邮件账号，则在该电子邮件数据源的轮询周期到时，获取该电子邮件数据源的待处理邮件，将该待处理邮件的邮件数据写入海量邮件原始信息数据库，所述邮件数据包括待处理邮件源码的头部信息、根据该头部信息生成的摘要信息、待处理邮件的源码正文和待处理邮件的可访问文件；If the e-mail data source is an e-mail terminal or a volunteer e-mail account for automatic application registration, when the polling period of the e-mail data source is up, obtain the pending e-mails of the e-mail data source, and the pending e-mails The mail data is written into a large amount of mail original information database, and the mail data includes the header information of the source code of the mail to be processed, the summary information generated according to the header information, the source text of the mail to be processed and the accessible files of the mail to be processed;

若电子邮件数据源为相关机构提供的去隐私信息邮件数据，则对该电子邮件数据源的数据进行标准化处理，去除待处理邮件的隐私信息，将待处理邮件的邮件数据写入海量邮件原始信息数据库，所述邮件数据包括待处理邮件源码的头部信息、根据该头部信息生成的摘要信息、待处理邮件的源码正文和待处理邮件的可访问文件。If the e-mail data source is the e-mail data without privacy information provided by the relevant organization, standardize the data of the e-mail data source, remove the private information of the e-mail to be processed, and write the e-mail data of the e-mail to be processed into a large amount of original e-mail information A database, wherein the email data includes header information of the source code of the email to be processed, abstract information generated according to the header information, the source code text of the email to be processed, and accessible files of the email to be processed.

进一步地，上述恶意代码捕获方法还可具有以下特点，所述待处理邮件源码的头部信息包括语种、编码类型、附件类型、发件人IP地址信息、收件人IP地址信息、基于IP的邮件路由信息。Further, the above malicious code capture method may also have the following characteristics, the header information of the source code of the mail to be processed includes language, encoding type, attachment type, sender IP address information, recipient IP address information, IP-based Mail routing information.

进一步地，上述恶意代码捕获方法还可具有以下特点，所述摘要信息的内容包括电子邮件的收件人、发件人、主题、正文长度、有无附件、电子邮件数据源类型。Furthermore, the above malicious code capture method may also have the following features, the content of the summary information includes the recipient, sender, subject, length of the text, whether there are attachments, and the data source type of the email.

进一步地，上述恶意代码捕获方法还可具有以下特点，解析所述邮件数据，将所述邮件数据中根据设定的漏报率无法排除的文件记录为可疑文件，并将该可疑文件保存到可疑文件数据库中；包括：Further, the above-mentioned malicious code capture method may also have the following features: analyze the mail data, record the files in the mail data that cannot be excluded according to the set false negative rate as suspicious files, and save the suspicious files in the suspicious file. in the document database; including:

根据预设的第二配置信息进行初始化；performing initialization according to preset second configuration information;

对所述待处理邮件源码的头部信息和源码正文进行解析，将异常的头部信息和/或源码正文保存到恶意代码样本数据库；Analyzing the header information and source code text of the source code of the mail to be processed, and saving the abnormal header information and/or source code text to the malicious code sample database;

根据所述设定的漏报率对所述待处理邮件可访问文件进行过滤，排除正常文件，将无法排除的文件作为可疑文件保存到可疑文件数据库。Filtering the accessible files of the pending emails according to the set false negative rate, excluding normal files, and storing the files that cannot be excluded as suspicious files in the suspicious file database.

进一步地，上述恶意代码捕获方法还可具有以下特点，利用恶意代码特征库和人工检测对所述可疑文件进行检测，将检测结果为异常的可疑文件保存到所述恶意代码样本数据库包括：Further, the above-mentioned malicious code capture method may also have the following characteristics, using the malicious code feature library and manual detection to detect the suspicious files, and saving the suspicious files whose detection results are abnormal to the malicious code sample database includes:

根据恶意代码特征数据库中保存的恶意代码特征信息对所述可疑文件进行检测，将包含所述恶意代码特征信息的可疑文件保存到恶意代码样本数据库；Detecting the suspicious file according to the malicious code feature information stored in the malicious code feature database, and saving the suspicious file containing the malicious code feature information into the malicious code sample database;

对于根据所述恶意代码特征信息无法检测的可疑文件，根据预设的专家系统进行人工检测，将人工检测过程中判定为恶意代码的可疑文件所产生的新特征信息保存到恶意代码特征数据库。For suspicious files that cannot be detected according to the characteristic information of malicious codes, manual detection is performed according to a preset expert system, and new characteristic information generated by suspicious files judged to be malicious codes during the manual detection process is saved in the malicious code characteristic database.

为解决上述技术问题，本发明还提出了一种恶意代码捕获系统，包括：In order to solve the above-mentioned technical problems, the present invention also proposes a malicious code capture system, comprising:

获取模块，用于从多种电子邮件数据源中获取邮件数据；The acquisition module is used to acquire email data from various email data sources;

解析模块，用于解析所述获取模块所获得的邮件数据，将所述邮件数据中根据设定的漏报率无法排除的文件记录为可疑文件，并将该可疑文件保存到可疑文件数据库中；An analysis module, configured to analyze the mail data obtained by the acquisition module, record the files in the mail data that cannot be excluded according to the set false negative rate as suspicious files, and save the suspicious files in the suspicious file database;

检测模块，用于利用恶意代码特征数据库和人工检测对所述解析模块输出的可疑文件进行检测，将检测结果为异常的可疑文件保存到所述恶意代码样本数据库。The detection module is configured to use the malicious code feature database and manual detection to detect suspicious files output by the parsing module, and save suspicious files whose detection results are abnormal in the malicious code sample database.

进一步地，上述恶意代码捕获系统还可具有以下特点，还包括：Further, the above-mentioned malicious code capture system may also have the following characteristics, including:

沙箱模块，用于从所述恶意代码样本数据库中获取恶意代码样本，在沙箱中运行该恶意代码样本，记录该恶意代码样本的特征信息并保存到所述恶意代码特征数据库。The sandbox module is used to obtain a malicious code sample from the malicious code sample database, run the malicious code sample in the sandbox, record the characteristic information of the malicious code sample and save it in the malicious code characteristic database.

进一步地，上述恶意代码捕获系统还可具有以下特点，还包括与所述获取模块相连的：Further, the above-mentioned malicious code capture system can also have the following characteristics, and also includes the one connected with the acquisition module:

算法选择模块，用于采用电子邮件终端虚拟蜜罐的选择与部署算法来分配和优化虚拟蜜罐资源的使用，所述电子邮件终端虚拟蜜罐的选择与部署算法为：将恶意代码传播的电子邮件网络抽象为一个以点和边组成的具有小世界模型特征的社交网络带权有向图模型，其中，点表示一个电子邮件账号，边表示电子邮件账号之间的通信邮件，边的权值表示一定时间内通信邮件的数量，点的入度表示一定时间内该点的发件人数量，出度表示一定时间内该点的收件人数量。The algorithm selection module is used to allocate and optimize the use of virtual honeypot resources by adopting the selection and deployment algorithm of the virtual honeypot of the email terminal. The selection and deployment algorithm of the virtual honeypot of the email terminal is: The email network is abstracted as a weighted directed graph model of a social network with the characteristics of a small world model composed of points and edges, where a point represents an email account, an edge represents the communication between email accounts, and the weight of the edge Indicates the number of communication emails within a certain period of time, the in-degree of a point indicates the number of senders at that point within a certain period of time, and the out-degree indicates the number of recipients at that point within a certain period of time.

进一步地，上述恶意代码捕获系统还可具有以下特点，所述获取模块包括：Further, the above-mentioned malicious code capture system may also have the following characteristics, and the acquisition module includes:

提取单元，用于根据预设的第一配置信息获取电子邮件数据源信息，并提取电子邮件数据源的种类，所述电子邮件数据源的种类为自动化申请注册的电子邮件终端、志愿者电子邮件账号或相关机构提供的电子邮件数据；The extracting unit is used to obtain the email data source information according to the preset first configuration information, and extract the type of the email data source, the type of the email data source is an email terminal for automatic application registration, a volunteer email Account number or email data provided by related organizations;

第一获取单元，用于在电子邮件数据源为自动化申请注册的电子邮件终端或志愿者电子邮件账号时，在该电子邮件数据源的轮询周期到时，获取该电子邮件数据源的新邮件，将待处理邮件的邮件数据写入海量邮件原始信息数据库，所述邮件数据包括待处理邮件源码的头部信息、根据该头部信息生成的摘要信息、待处理邮件源码正文和待处理邮件的可访问文件；The first acquisition unit is used to acquire new emails of the email data source when the email data source is an email terminal for automatic application registration or a volunteer email account, when the polling period of the email data source expires , write the mail data of the mail to be processed into the massive mail original information database, the mail data includes the header information of the mail source code to be processed, the summary information generated according to the header information, the source code body of the mail to be processed, and the text of the mail to be processed accessible files;

第二获取单元，用于在电子邮件数据源为相关机构提供的电子邮件数据时，对该电子邮件数据源的数据进行标准化处理，去除待处理邮件的隐私信息，将待处理邮件的邮件数据写入海量邮件原始信息数据库，所述邮件数据包括待处理邮件源码的头部信息、根据该头部信息生成的摘要信息、待处理邮件源码正文和待处理邮件的可访问文件。The second acquisition unit is used to standardize the data of the email data source when the email data source is the email data provided by the relevant organization, remove the private information of the email to be processed, and write the email data of the email to be processed Enter a massive mail original information database, the mail data includes the header information of the mail source code to be processed, the summary information generated according to the header information, the text of the mail source code to be processed and the accessible files of the mail to be processed.

进一步地，上述恶意代码捕获系统还可具有以下特点，所述待处理邮件源码的头部信息包括语种、编码类型、附件类型、发件人IP地址信息、收件人IP地址信息、基于IP的邮件路由信息。Further, the above-mentioned malicious code capture system can also have the following characteristics, the header information of the source code of the mail to be processed includes language, encoding type, attachment type, sender IP address information, recipient IP address information, IP-based Mail routing information.

进一步地，上述恶意代码捕获系统还可具有以下特点，所述摘要信息的内容包括电子邮件的收件人、发件人、主题、正文长度、有无附件、电子邮件数据源类型。Further, the above malicious code capturing system may also have the following features, the content of the summary information includes the recipient, sender, subject, length of the text, whether there are attachments, and the data source type of the email.

进一步地，上述恶意代码捕获系统还可具有以下特点，所述解析模块包括：Further, the above-mentioned malicious code capture system may also have the following characteristics, and the parsing module includes:

第一初始化单元，用于根据预设的第二配置信息进行初始化；a first initialization unit, configured to perform initialization according to preset second configuration information;

解析单元，用于对所述待处理邮件源码的头部信息和源码正文进行解析，将异常的头部信息和/或源码正文保存到恶意代码样本数据库；A parsing unit, configured to parse the header information and source code text of the email source code to be processed, and save the abnormal header information and/or source code text to the malicious code sample database;

过滤单元，用于根据所述设定的漏报率对所述待处理邮件的可访问文件进行过滤，排除正常文件，将无法排除的文件作为可疑文件保存到可疑文件数据库。The filtering unit is configured to filter the accessible files of the mail to be processed according to the set false negative rate, exclude normal files, and store files that cannot be excluded as suspicious files in the suspicious file database.

进一步地，上述恶意代码捕获系统还可具有以下特点，所述检测模块包括：Further, the above-mentioned malicious code capture system can also have the following characteristics, and the detection module includes:

第二初始化单元，用于根据预设的第二配置信息进行初始化；a second initialization unit, configured to perform initialization according to preset second configuration information;

第一检测单元，用于根据恶意代码特征数据库中保存的恶意代码特征信息对所述可疑文件进行检测，将包含所述恶意代码特征信息的可疑文件保存到恶意代码样本数据库；The first detection unit is configured to detect the suspicious file according to the malicious code characteristic information stored in the malicious code characteristic database, and save the suspicious file containing the malicious code characteristic information in the malicious code sample database;

第二检测单元，用于对于根据所述恶意代码特征信息无法检测的可疑文件，根据预设的专家系统进行人工检测，将人工检测过程中判定为恶意代码的可疑文件所产生的新特征信息保存到恶意代码特征数据库。The second detection unit is used to manually detect suspicious files that cannot be detected according to the characteristic information of malicious codes according to a preset expert system, and save new characteristic information generated by suspicious files that are judged as malicious codes during the manual detection process. to the malicious code signature database.

本发明的恶意代码捕获方法及系统可以应用于相关的蜜罐与蜜网系统中，能够增加捕获对象的覆盖范围，提升恶意代码的捕获能力。The method and system for capturing malicious codes of the present invention can be applied to related honeypots and honeynet systems, can increase the coverage of captured objects, and improve the ability to capture malicious codes.

附图说明 Description of drawings

图1为本发明实施例中恶意代码捕获方法的获取步骤的流程图；Fig. 1 is the flowchart of the acquisition steps of malicious code capturing method in the embodiment of the present invention;

图2为本发明实施例中恶意代码捕获方法的解析步骤的流程图；Fig. 2 is the flowchart of the analysis steps of malicious code capture method in the embodiment of the present invention;

图3为本发明实施例中恶意代码捕获方法的检测步骤的流程图；Fig. 3 is the flowchart of the detection step of malicious code capture method in the embodiment of the present invention;

图4为本发明实施例中恶意代码捕获系统的结构图。Fig. 4 is a structural diagram of a malicious code capturing system in an embodiment of the present invention.

具体实施方式 detailed description

以下结合附图对本发明的原理和特征进行描述，所举实例只用于解释本发明，并非用于限定本发明的范围。The principles and features of the present invention are described below in conjunction with the accompanying drawings, and the examples given are only used to explain the present invention, and are not intended to limit the scope of the present invention.

本发明采用了Email终端虚拟蜜罐的选择与部署算法来分配和优化虚拟蜜罐资源的使用。该算法（即Email终端虚拟蜜罐的选择与部署算法，下同）把恶意代码传播的Email网络抽象为一个以点和边组成的具有小世界模型特征的社交网络带权有向图模型（记为G＝<V，E>）。其中，点（记为v_i）表示一个Email账号，边（记为e_k）表示Email账号之间的通信邮件，边的权值（记为w(e_k)）表示一定时间内通信邮件的数量，点的入度（记为id(v_i)）表示一定时间内该点的发件人数量，出度（记为od(v_i)）表示一定时间内该点的收件人数量。该算法的主要思想是：对Email网络进行聚类分析，找出网络中的平均集聚系数较高的子网，获取子网中可利用的Email账户，按照点的入度、活跃度（单位时间内发送邮件数）与集聚系数三个标准加权计算出综合评价指标（权值由管理员配置与调整），按指标值降序排列上述Email账户，提取预定数量（由管理员配置与调整）的Email账户，加入虚拟蜜罐集合。例如，管理员可以根据实际Email网络的属性特征，选择入度、活跃度与集聚系数较高的节点（比如，三个指标均为前30%的终端），来计算综合评价指标。The invention adopts the selection and deployment algorithm of the Email terminal virtual honeypot to allocate and optimize the use of virtual honeypot resources. This algorithm (that is, the selection and deployment algorithm of the virtual honeypot of the Email terminal, the same below) abstracts the Email network where the malicious code spreads into a social network weighted directed graph model composed of points and edges with the characteristics of a small-world model (denote is G=<V, E>). Among them, a point (denoted as v _i ) represents an Email account, an edge (denoted as e _k ) represents the communication email between Email accounts, and the weight of the edge (denoted as w(e _k )) represents the number of communication mails within a certain period of time. Quantity, the in-degree of a point (recorded as id(v _i )) indicates the number of senders at this point within a certain period of time, and the out-degree (recorded as od(v _i )) indicates the number of recipients at this point within a certain period of time. The main idea of the algorithm is: to conduct cluster analysis on the Email network, find out the sub-network with a higher average clustering coefficient in the network, obtain the available Email accounts in the sub-network, according to the in-degree and activity of the point (unit time) The number of sent emails) and the clustering coefficient are weighted to calculate the comprehensive evaluation index (the weight is configured and adjusted by the administrator), and the above-mentioned Email accounts are sorted in descending order according to the index value, and the predetermined number (configured and adjusted by the administrator) is extracted. Email account, join the virtual honeypot collection. For example, administrators can select nodes with higher ingress, activity, and clustering coefficients (for example, terminals with the top 30% of the three indicators) according to the attributes and characteristics of the actual Email network to calculate comprehensive evaluation indicators.

本发明提出了一种恶意代码捕获方法，该方法包括如下步骤：The present invention proposes a kind of malicious code capture method, and this method comprises the following steps:

步骤一，从多种电子邮件数据源中获取邮件数据；Step 1, obtaining email data from various email data sources;

步骤一称为获取步骤。Step one is called the acquisition step.

其中，电子邮件数据源的种类可以包括三种：自动化申请注册的Email终端、志愿者Email账号与相关机构（例如邮件服务商）提供的去隐私邮件数据。前两类可归为Email账号信息，最后一类称为协调邮件数据。“去隐私信息”是指在基本满足获取可疑文件的前提下，对邮件数据可能涉及到的真实的人与事等相关信息进行自动化或半自动化替换处理，保护个人隐私和敏感信息。Among them, the types of email data sources can include three types: Email terminals for automatic application registration, email accounts of volunteers, and private email data provided by relevant organizations (such as email service providers). The first two categories can be classified as email account information, and the last category is called coordination email data. "Removing privacy information" refers to the automatic or semi-automatic replacement of real people and events that may be involved in email data on the premise of basically satisfying the need to obtain suspicious files, so as to protect personal privacy and sensitive information.

采用Email终端作为虚拟蜜罐，并利用Email网络的小世界网络特征部署虚拟蜜罐，形成一个具有特定拓扑结构的虚拟蜜网，进而能够更加有效地捕获更多的恶意代码。“小世界网络”是动力学网络中的一种图类型，此类图中的大部分节点之间可以通过少数其他节点形成通信链路。小世界网络特征主要包括：集聚系数、平均路径长度与节点度分布。The Email terminal is used as a virtual honeypot, and the small-world network characteristics of the Email network are used to deploy a virtual honeypot to form a virtual honeynet with a specific topology, which can capture more malicious codes more effectively. A "small-world network" is a type of graph in a dynamical network in which most of the nodes in this graph can form communication links through a small number of other nodes. Small-world network characteristics mainly include: clustering coefficient, average path length and node degree distribution.

本发明实施例中，邮件数据可以包括电子邮件源码的头部信息、根据该头部信息生成的摘要信息、电子邮件源码正文和电子邮件可访问文件。其中，电子邮件可访问文件是指能够直接从电子邮件中提取的附件，例如电子邮件正文内嵌图片、电子邮件中的超链接可下载文件、电子邮件的附件文件。摘要信息是整个邮件的抽象，摘要信息的内容可以包括邮件字数、邮件附件大小、邮件附件存储位置等。In the embodiment of the present invention, the mail data may include header information of email source code, abstract information generated according to the header information, email source code text and email accessible files. Wherein, an email accessible file refers to an attachment that can be directly extracted from an email, such as an embedded picture in an email body, a hyperlink downloadable file in an email, or an attachment file of an email. Summary information is an abstraction of the entire email. The content of the summary information may include the number of words in the email, the size of email attachments, and the storage location of email attachments.

图1为本发明实施例中恶意代码捕获方法的获取步骤的流程图。如图1所示，在本发明实施例中，获取步骤（即步骤一）可以具体包括如下子步骤：FIG. 1 is a flow chart of the acquisition steps of the malicious code capture method in the embodiment of the present invention. As shown in Figure 1, in the embodiment of the present invention, the obtaining step (that is, step 1) may specifically include the following sub-steps:

步骤101，获取电子邮件数据源信息；Step 101, obtaining email data source information;

具体地，可以根据第一配置信息获取电子邮件数据源信息，并提取电子邮件数据源的种类（Email账号信息或协调邮件数据）。。其中，第一配置信息是由管理员预先设定的。第一配置信息的内容可疑包括海量邮件原始信息数据库地址、访问账户密码、Email终端账号信息列表等。Specifically, email data source information may be acquired according to the first configuration information, and the type of email data source (Email account information or coordinated mail data) may be extracted. . Wherein, the first configuration information is preset by the administrator. The content of the first configuration information suspiciously includes the address of the original information database of massive emails, the access account password, the email terminal account information list, and the like.

步骤102，判断是否为Email账号信息，若是执行步骤103，否则执行步骤111；Step 102, judging whether it is email account information, if so, execute step 103, otherwise execute step 111;

步骤103，获取目标账号上次访问时间；Step 103, obtaining the last access time of the target account;

此处，目标账号指步骤102中的Email账号。Here, the target account refers to the email account in step 102.

步骤104，判断是否达到目标账号的轮询周期，若是执行步骤105，否则执行步骤109；Step 104, judging whether the polling cycle of the target account is reached, if so, execute step 105, otherwise execute step 109;

步骤105，访问目标账号；Step 105, accessing the target account;

步骤106，判断目标账号是否还有新邮件，若有执行步骤107，否则执行步骤109；Step 106, judging whether the target account has new emails, if so, go to step 107, otherwise go to step 109;

目标账号的新邮件即为下文提到的目标邮件，也即待处理邮件。The new email of the target account is the target email mentioned below, that is, the pending email.

步骤107，生成目标邮件的摘要信息；Step 107, generating summary information of the target email;

目标邮件即为待处理邮件，下同。The target email is the pending email, the same below.

步骤108，将目标邮件的源码正文及摘要信息写入海量邮件原始信息数据库；Step 108, writing the source text and summary information of the target email into a massive email original information database;

同时，也可以将目标邮件可访问文件存入数据库相关的文件系统中，并将目标邮件可访问文件的文件路径写入海量邮件原始信息数据库。At the same time, the accessible file of the target mail can also be stored in the file system related to the database, and the file path of the accessible file of the target mail can be written into the massive mail original information database.

步骤109，判断是否为最后一个目标账号，若是执行步骤118，否则执行步骤110；Step 109, judging whether it is the last target account, if so, execute step 118, otherwise execute step 110;

步骤110，定位到下一个账号，执行步骤103；Step 110, locate the next account, and execute step 103;

步骤111，判断是否为协调邮件数据，若是执行步骤112，否则执行步骤118；Step 111, judging whether it is coordination mail data, if so, execute step 112, otherwise execute step 118;

步骤112，对多源邮件数据进行标准化处理；Step 112, standardize the multi-source email data;

这里，“标准化处理”是指对不同邮件数据源的邮件数据进行统一处理，提取并生成统一格式并且系统可识别的邮件源码(例如eml文件格式)。Here, "standardized processing" refers to uniform processing of mail data from different mail data sources, extracting and generating mail source codes in a unified format and recognizable by the system (such as eml file format).

步骤113，定位到目标邮件；Step 113, locate the target mail;

步骤114，去除目标邮件隐私信息；Step 114, removing the private information of the target email;

步骤115，生成目标邮件的摘要信息；Step 115, generating summary information of the target email;

摘要信息的内容可疑包括电子邮件的收件人、发件人、主题、正文长度、有无附件、电子邮件数据源类型等。Suspicious content of the summary information includes the email recipient, sender, subject, body length, attachment, email data source type, etc.

步骤116，将目标邮件及摘要信息写入海量邮件原始信息数据库；Step 116, writing the target email and summary information into a massive email original information database;

步骤117，判断是否有未处理邮件，若是执行步骤113，否则执行步骤118；Step 117, judging whether there are unprocessed mails, if so, execute step 113, otherwise execute step 118;

步骤118，结束。Step 118, end.

步骤一利用多种电子邮件数据源作为系统输入，能够大范围地捕获到基于Email网络传播的恶意代码。Step 1 uses a variety of email data sources as system input, which can capture malicious codes spread based on Email networks on a large scale.

步骤二，解析邮件数据，将邮件数据中根据设定的漏报率无法排除的文件记录为可疑文件，并将该可疑文件保存到可疑文件数据库中；Step 2, analyzing the mail data, recording the files in the mail data that cannot be excluded according to the set false negative rate as suspicious files, and saving the suspicious files in the suspicious file database;

步骤二称为解析步骤。Step two is called the parsing step.

图2为本发明实施例中恶意代码捕获方法的解析步骤的流程图。如图2所示，在本发明实施例中，解析步骤（即步骤二）可以具体包括如下子步骤：Fig. 2 is a flow chart of the analysis steps of the malicious code capture method in the embodiment of the present invention. As shown in Figure 2, in the embodiment of the present invention, the parsing step (that is, step 2) may specifically include the following sub-steps:

步骤201，根据第二配置信息初始化；Step 201, initialize according to the second configuration information;

本步骤中的初始化是指参数、资源等的初始化。The initialization in this step refers to initialization of parameters, resources, and the like.

第二配置信息是由管理员预先设定的。第二配置信息的内容可以包括解析邮件数量、可支配服务器数量、数据库地址信息等。第二配置信息中还有一项内容为“爬虫程序数量”，本发明中，爬虫程序数量大于1，因此，本发明所使用的是并行爬虫技术。The second configuration information is preset by the administrator. The content of the second configuration information may include the number of parsed emails, the number of available servers, database address information, and the like. Another item in the second configuration information is "the number of crawler programs". In the present invention, the number of crawler programs is greater than 1. Therefore, the present invention uses parallel crawler technology.

步骤202，解析邮件头部信息；Step 202, analyzing email header information;

邮件头部信息可以包括语种、编码类型、附件类型、发件人IP地址信息、收件人IP地址信息、基于IP的邮件路由信息等内容。Email header information can include language, encoding type, attachment type, sender IP address information, recipient IP address information, IP-based mail routing information, etc.

步骤203，判断头部信息是否异常，若是执行步骤204，否则执行步骤205；Step 203, determine whether the header information is abnormal, if so, execute step 204, otherwise execute step 205;

具体地，按照邮件协议标准格式，凡头部格式不符的邮件，均视为异常。判断头部异常的情况有多种多样，例如，可以根据如下的头部信息异常情况判断头部信息是否异常。头部信息异常的情况包括：头部信息中存在伪造信息、发件人IP被篡改、发件人地址被篡改、发件人姓名被篡改等等。Specifically, according to the standard format of the mail protocol, any mail whose header format does not match is regarded as abnormal. There are various situations for judging that the header is abnormal. For example, whether the header information is abnormal can be judged according to the following abnormality of the header information. Abnormal header information includes: there is forged information in the header information, the sender’s IP has been tampered with, the sender’s address has been tampered with, the sender’s name has been tampered with, and so on.

步骤204，记录异常信息写入可疑文件数据库，执行步骤205；Step 204, record abnormal information into the database of suspicious files, and execute step 205;

步骤205，解析邮件内容信息；Step 205, analyzing email content information;

邮件内容也即邮件正文。The email content is also the email body.

步骤206，判断是否存在可访问文件的链接，若是执行步骤209，否则执行步骤207；Step 206, judging whether there is a link to the accessible file, if so, execute step 209, otherwise execute step 207;

其中，可访问文件包括邮件正文内嵌的图片、超链接可下载文件、附件中的文件等。Among them, the accessible files include pictures embedded in the body of the email, files that can be downloaded by hyperlinks, files in attachments, and so on.

步骤207，判断邮件内容是否异常，若是执行步骤208，否则执行步骤213；Step 207, judging whether the content of the email is abnormal, if so, execute step 208, otherwise execute step 213;

邮件内容异常的情况包括正文有伪造信息、附件有伪造信息等。邮件内容异常的情况主要分为正文伪造与附件伪造，正文伪造包括邮件头部信息的伪造、邮件主体内容信息的伪造；附件伪造种类很多，包括利用附件文件捆绑可执行文件、隐写非法数据、篡改正常文件格式等。Abnormal email content includes forged information in the body and forged information in attachments. Abnormal email contents are mainly divided into forgery of text and forgery of attachments. Forgery of text includes forgery of email header information and forgery of email body content information; there are many types of attachment forgery, including bundling executable files with attachment files, stealing illegal data, Tampering with normal file formats, etc.

步骤208，记录异常信息写入可疑文件数据库，执行步骤213；Step 208, record abnormal information into the database of suspicious files, and execute step 213;

步骤209，提取或爬取可访问文件；Step 209, extracting or crawling accessible files;

由以上的步骤206、步骤207和步骤209可见，本发明中的采集方式是支持交互的主动采集方式。It can be seen from the above steps 206, 207 and 209 that the collection method in the present invention is an active collection method that supports interaction.

步骤210，判断步骤209的可访问文件可否判断为正常文件，若是执行步骤212，否则执行步骤211；Step 210, judging whether the accessible file in step 209 can be judged as a normal file, if so, go to step 212, otherwise go to step 211;

本步骤对可访问文件进行初步过滤，可以以较高的漏报率排除正常文件。不能排除的文件即可疑文件。这些可疑文件有可能是恶意代码，需要进行进一步的检测。“漏报率”由管理员在配置中设置，漏报率的取值范围为(0,1)，一般设置50%以上的漏报率，以保证尽量低的误报率。This step preliminarily filters accessible files, which can exclude normal files with a high rate of false positives. Files that cannot be excluded are suspicious files. These suspicious files may be malicious code and require further detection. The "missing negative rate" is set by the administrator in the configuration. The value range of the false negative rate is (0,1). Generally, a false negative rate of more than 50% is set to ensure the lowest possible false positive rate.

步骤211，存储到可疑文件数据库，执行步骤213；Step 211, store in suspicious file database, execute step 213;

步骤212，删除目标文件；Step 212, delete the target file;

步骤213，结束。Step 213, end.

图2所示的实施例，基于并行爬虫技术与海量邮件解析技术，以支持交互的主动采集方式从相关Email终端获取可疑的恶意代码样本，弥补了现有技术中蜜罐与蜜罐技术被动获取恶意代码的不足，增强了捕获的力度。“并行爬虫技术”是指一台服务器上同时运行若干个爬虫程序，同时启动多台这样的服务器，同享一个数据库。而“支持交互的主动采集”是指能够模拟Email用户的行为获取利用社会工程学传播的恶意代码，比如：识别邮件中的超链接并访问相关文件、与恶意发送源交互获取可疑文件等。The embodiment shown in Fig. 2 is based on parallel crawler technology and mass email analysis technology, and obtains suspicious malicious code samples from relevant Email terminals in an active collection mode that supports interaction, which makes up for the passive acquisition of honeypot and honeypot technology in the prior art The lack of malicious code enhances the capture. "Parallel crawler technology" refers to running several crawler programs on one server at the same time, starting multiple such servers at the same time, and sharing a database. And "active collection that supports interaction" refers to the ability to simulate the behavior of email users to obtain malicious codes spread by social engineering, such as: identifying hyperlinks in emails and accessing related files, interacting with malicious sending sources to obtain suspicious files, etc.

步骤三，利用恶意代码特征数据库和人工检测对步骤二所得的可疑文件进行检测，将检测结果为异常的可疑文件保存到恶意代码样本数据库；Step 3, using the malicious code feature database and manual detection to detect the suspicious files obtained in step 2, and saving the suspicious files whose detection results are abnormal to the malicious code sample database;

步骤三称为检测步骤。Step three is called the detection step.

图3为本发明实施例中恶意代码捕获方法的检测步骤的流程图。如图3所示，在本发明实施例中，检测步骤（即步骤三）可以具体包括如下子步骤：Fig. 3 is a flow chart of the detection steps of the malicious code capture method in the embodiment of the present invention. As shown in Figure 3, in the embodiment of the present invention, the detection step (that is, step 3) may specifically include the following sub-steps:

步骤301，根据第三配置信息初始化；Step 301, initialize according to the third configuration information;

第三配置信息是由管理员预先设定的。第三配置信息的内容可以包括配置归并处理进程数、恶意代码检测服务器数量、专家系统的分析检测时间等。The third configuration information is preset by the administrator. The content of the third configuration information may include configuring the number of merging processing processes, the number of malicious code detection servers, the analysis and detection time of the expert system, and the like.

步骤302，可疑文件归并处理；Step 302, suspicious files are merged and processed;

这里，归并处理是“归纳、合并”处理的简称。归并处理采用针对相似文件的合并处理，基于hash（哈希）算法的文件比较去重合并处理等手段，以减少存储空间，节省后续计算开销。Here, the integration process is an abbreviation of "induction and integration" process. The merging process adopts merging processing for similar files, file comparison deduplication and merging processing based on hash (hash) algorithm, etc., to reduce storage space and save subsequent calculation costs.

步骤303，基于特征库（指恶意代码特征数据库，下同）的恶意代码检测；Step 303, malicious code detection based on the feature database (referring to the malicious code feature database, the same below);

基于特征库检测的具体方式可以是：如果可疑文件中含有恶意代码特征数据库中的特征码，则可疑文件为恶意代码，如果可疑文件中不含有恶意代码特征数据库中的特征码，则可疑文件不是恶意代码。The specific method based on the signature database detection can be: if the suspicious file contains the signature code in the malicious code signature database, then the suspicious file is malicious code; if the suspicious file does not contain the signature code in the malicious code signature database, then the suspicious file is not Malicious code.

步骤304，判断可疑文件是否可判断，若是执行步骤305，否则执行步骤307；Step 304, judging whether the suspicious file can be judged, if so, go to step 305, otherwise go to step 307;

步骤305，判断目标文件是否为恶意代码，若是执行步骤306，否则执行步骤312；Step 305, judge whether the target file is malicious code, if execute step 306, otherwise execute step 312;

步骤306，存储文件至恶意代码样本数据库；Step 306, storing the file to the malicious code sample database;

本步骤中，存储至恶意代码样本数据库的文件是指基于恶意代码特征数据库判断为恶意代码的可疑文件。In this step, the files stored in the malicious code sample database refer to suspicious files judged to be malicious codes based on the malicious code feature database.

步骤307，基于专家系统的分析检测；Step 307, analysis and detection based on expert system;

本发明所提到的“专家系统”是传统意义专家系统的引申，它以若干有恶意代码分析经验的安全专家为核心，以本方法的可疑文件作为输入，通过人工参与的逆向分析技术、行为分析技术进行恶意代码判断，进而弥补自动化检测手段的不足，发现自动化检测手段无法检测的未知恶意代码。The "expert system" mentioned in the present invention is an extension of the traditional expert system. It takes a number of security experts with experience in malicious code analysis as the core, takes the suspicious files of this method as input, and uses reverse analysis techniques and behaviors of manual participation. Analysis technology judges malicious codes, and then makes up for the lack of automatic detection methods, and discovers unknown malicious codes that cannot be detected by automatic detection methods.

步骤308，判断目标文件是否为恶意代码，若是执行步骤309，否则执行步骤312；Step 308, judge whether the target file is malicious code, if execute step 309, otherwise execute step 312;

步骤309，存储文件至恶意代码样本数据库；Step 309, storing the file to the malicious code sample database;

步骤310，判断是否有恶意代码新特征，若是执行步骤311，否则执行步骤312；Step 310, judging whether there are new features of malicious code, if so, execute step 311, otherwise execute step 312;

步骤311，优化恶意代码特征数据库，执行步骤303；Step 311, optimize the malicious code feature database, and execute step 303;

若在基于专家系统检测过程中，目标恶意代码产生新特征码，则将该新特征码存入恶意代码特征数据库，以优化恶意代码特征数据库，提高检测精度与效率。If during the detection process based on the expert system, the target malicious code generates a new feature code, the new feature code is stored in the malicious code feature database to optimize the malicious code feature database and improve detection accuracy and efficiency.

步骤312，结束。Step 312, end.

步骤四，从所述恶意代码样本数据库中获取恶意代码样本，在沙箱中运行该恶意代码样本，记录该恶意代码样本的特征信息并保存到所述恶意代码特征数据库。Step 4: Obtain a malicious code sample from the malicious code sample database, run the malicious code sample in a sandbox, record the characteristic information of the malicious code sample and save it in the malicious code characteristic database.

步骤四中提到的沙箱可以是任何沙箱。在本发明的优选实施例中，可以采用轻量级沙箱。轻量级沙箱能够在一定程度上节省计算资源。The sandbox mentioned in step four can be any sandbox. In a preferred embodiment of the invention, a lightweight sandbox may be employed. The lightweight sandbox can save computing resources to a certain extent.

本发明的恶意代码捕获方法可以用计算机程序实现，这些程序可以使用C/C++、Python语言开发，使用PHP、JavaScript语言开发前台界面，使用Mysql搭建相关数据库，并使用自定义文件存储方式存放相关大数据信息。Malicious code capture method of the present invention can be realized with computer program, and these programs can use C/C ++, Python language development, use PHP, JavaScript language development foreground interface, use Mysql to build relevant database, and use self-defined file storage mode to store relevant data Data information.

本发明的恶意代码捕获方法，具有如下有益效果：The malicious code capturing method of the present invention has the following beneficial effects:

1）选取Email终端作为虚拟蜜罐形成分布式虚拟蜜网，大大降低了蜜网构建与部署的成本，并能够快速、有效地捕获更多的Email网络恶意代码；1) Select Email terminals as virtual honeypots to form a distributed virtual honeynet, which greatly reduces the cost of honeynet construction and deployment, and can quickly and effectively capture more email network malicious codes;

2）采用基于爬虫的深度交互采集方式与海量邮件解析手段，弥补了蜜罐与蜜罐技术被动方式的不足，并能够捕获到更加复杂的Email网络恶意代码；2) Using crawler-based in-depth interactive collection methods and massive email analysis methods, it makes up for the shortcomings of honeypots and passive methods of honeypot technology, and can capture more complex email network malicious codes;

3）采用并处理多邮件数据源作为输入，能够大大增加捕获到Email网络恶意代码的范围以及捕获的全面性。3) Using and processing multiple email data sources as input can greatly increase the scope and comprehensiveness of capturing email network malicious code.

本发明的恶意代码捕获方法可以应用于相关的蜜罐与蜜网系统中，能够增加捕获对象的覆盖范围，提升恶意代码的捕获能力。The malicious code capture method of the present invention can be applied to related honeypots and honeynet systems, can increase the coverage of captured objects, and improve the capture capability of malicious codes.

本发明还提出了一种恶意代码捕获系统，用以实施上述的恶意代码捕获方法。The present invention also proposes a system for capturing malicious codes, which is used to implement the above method for capturing malicious codes.

图4为本发明实施例中恶意代码捕获系统的结构图。如图4所示，本实施例中，恶意代码捕获系统包括顺次相连的获取模块410、解析模块420、检测模块430和沙箱模块440。其中，获取模块410用于从多种电子邮件数据源中获取邮件数据。解析模块420用于解析获取模块410所获取的邮件数据，将邮件数据中根据设定的漏报率无法排除的文件记录为可疑文件，并将该可疑文件保存到可疑文件数据库中。检测模块430用于利用恶意代码特征数据库和人工检测对解析模块420输出的可疑文件进行检测，将检测结果为异常的可疑文件保存到恶意代码样本数据库。沙箱模块440用于从恶意代码样本数据库中获取恶意代码样本，在沙箱中运行该恶意代码样本，记录该恶意代码样本的特征信息并保存到恶意代码特征数据库。Fig. 4 is a structural diagram of a malicious code capturing system in an embodiment of the present invention. As shown in FIG. 4 , in this embodiment, the malicious code capture system includes an acquisition module 410 , an analysis module 420 , a detection module 430 and a sandbox module 440 connected in sequence. Wherein, the acquiring module 410 is used for acquiring mail data from various email data sources. The parsing module 420 is used for parsing the email data acquired by the acquisition module 410, recording files in the email data that cannot be excluded according to the set false negative rate as suspicious files, and saving the suspicious files in the suspicious file database. The detection module 430 is used to detect suspicious files output by the parsing module 420 by using the malicious code feature database and manual detection, and save suspicious files whose detection results are abnormal in the malicious code sample database. The sandbox module 440 is used to obtain a malicious code sample from the malicious code sample database, run the malicious code sample in the sandbox, record the characteristic information of the malicious code sample and save it in the malicious code characteristic database.

在本发明其他实施例中，恶意代码捕获系统中也可以没有沙箱模块440。In other embodiments of the present invention, the malicious code capture system may not have the sandbox module 440 .

在本发明其他实施例中，恶意代码捕获系统中还可以与获取模块410相连的算法选择模块，用于采用电子邮件终端虚拟蜜罐的选择与部署算法来分配和优化虚拟蜜罐资源的使用，所述电子邮件终端虚拟蜜罐的选择与部署算法为：将恶意代码传播的电子邮件网络抽象为一个以点和边组成的具有小世界模型特征的社交网络带权有向图模型，其中，点表示一个电子邮件账号，边表示电子邮件账号之间的通信邮件，边的权值表示一定时间内通信邮件的数量，点的入度表示一定时间内该点的发件人数量，出度表示一定时间内该点的收件人数量。In other embodiments of the present invention, the algorithm selection module that can also be connected with the acquisition module 410 in the malicious code capture system is used to allocate and optimize the use of virtual honeypot resources by using the selection and deployment algorithm of the virtual honeypot of the email terminal, The selection and deployment algorithm of the e-mail terminal virtual honeypot is as follows: the e-mail network for malicious code propagation is abstracted into a social network weighted directed graph model composed of points and edges with the characteristics of a small-world model. Represents an e-mail account, the edge represents the communication between e-mail accounts, the weight of the edge represents the number of communication e-mails within a certain period of time, the in-degree of a point represents the number of senders at that point within a certain period of time, and the out-degree represents a certain The number of recipients at that point in time.

其中，获取模块410可以进一步包括提取单元、第一获取单元和第二获取单元。提取单元用于根据预设的第一配置信息获取电子邮件数据源信息，并提取电子邮件数据源的种类，电子邮件数据源的种类为自动化申请注册的电子邮件终端、志愿者电子邮件账号或相关机构提供的电子邮件数据。第一获取单元用于在电子邮件数据源为自动化申请注册的电子邮件终端或志愿者电子邮件账号时，在该电子邮件数据源的轮询周期到时，获取该电子邮件数据源的新邮件，将待处理邮件的邮件数据写入海量邮件原始信息数据库，邮件数据包括待处理邮件源码的头部信息、根据该头部信息生成的摘要信息、待处理邮件源码正文和待处理邮件可访问文件。第二获取单元用于在电子邮件数据源为相关机构提供的电子邮件数据时，对该电子邮件数据源的数据进行标准化处理，去除待处理邮件的隐私信息，将待处理邮件的邮件数据写入海量邮件原始信息数据库，邮件数据包括待处理邮件源码的头部信息、根据该头部信息生成的摘要信息、待处理邮件源码正文和待处理邮件的可访问文件。Wherein, the acquiring module 410 may further include an extracting unit, a first acquiring unit and a second acquiring unit. The extracting unit is used to obtain email data source information according to the preset first configuration information, and extract the type of email data source, the type of email data source is an email terminal for automatic application registration, volunteer email account or related Email data provided by the institution. The first acquisition unit is used to acquire new emails of the email data source when the email data source is an email terminal for automatic application registration or a volunteer email account, when the polling period of the email data source expires, Write the email data of pending emails into a massive database of original email information. The email data includes the header information of the source code of the email to be processed, the summary information generated according to the header information, the body of the source code of the email to be processed, and the accessible files of the email to be processed. The second acquisition unit is used to standardize the data of the email data source when the email data source is the email data provided by the relevant organization, remove the private information of the mail to be processed, and write the mail data of the mail to be processed into Massive mail original information database, mail data includes the header information of the source code of the mail to be processed, the summary information generated according to the header information, the body of the source code of the mail to be processed and the accessible files of the mail to be processed.

其中，待处理邮件源码的头部信息可以包括语种、编码类型、附件类型、发件人IP地址信息、收件人IP地址信息、基于IP的邮件路由信息等。Wherein, the header information of the mail source code to be processed may include language, encoding type, attachment type, sender's IP address information, recipient's IP address information, IP-based mail routing information, and the like.

其中，摘要信息的内容可以包括电子邮件的收件人、发件人、主题、正文长度、有无附件、电子邮件数据源类型等。Wherein, the content of the summary information may include the recipient, sender, subject, length of the text, whether there are attachments, the type of the data source of the email, etc. of the email.

解析模块420可以进一步包括第一初始化单元、解析单元和过滤单元。第一初始化单元用于根据预设的第二配置信息进行初始化。解析单元用于对待处理邮件源码的头部信息和源码正文进行解析，将异常的头部信息和/或源码正文保存到恶意代码样本数据库。过滤单元用于根据设定的漏报率对待处理邮件的可访问文件进行过滤，排除正常文件，将无法排除的文件作为可疑文件保存到可疑文件数据库。The parsing module 420 may further include a first initialization unit, a parsing unit and a filtering unit. The first initialization unit is configured to perform initialization according to preset second configuration information. The parsing unit is used for parsing the header information and the source code body of the mail source code to be processed, and saving the abnormal header information and/or source code body into the malicious code sample database. The filtering unit is used to filter the accessible files of the mail to be processed according to the set false negative rate, exclude normal files, and save the files that cannot be excluded as suspicious files to the suspicious file database.

检测模块430可以进一步包括第二初始化单元、第一检测单元和第二检测单元。第二初始化单元用于根据预设的第二配置信息进行初始化。第一检测单元用于根据恶意代码特征数据库中保存的恶意代码特征信息对可疑文件进行检测，将包含恶意代码特征信息的可疑文件保存到恶意代码样本数据库。第二检测单元用于对于根据恶意代码特征信息无法检测的可疑文件，根据预设的专家系统进行人工检测，将人工检测过程中判定为恶意代码的可疑文件所产生的新特征信息保存到恶意代码特征数据库。The detection module 430 may further include a second initialization unit, a first detection unit and a second detection unit. The second initialization unit is configured to perform initialization according to preset second configuration information. The first detection unit is used to detect suspicious files according to the malicious code characteristic information stored in the malicious code characteristic database, and save the suspicious files containing the malicious code characteristic information in the malicious code sample database. The second detection unit is used to manually detect suspicious files that cannot be detected based on malicious code feature information according to a preset expert system, and save new feature information generated by suspicious files that are judged to be malicious code in the manual detection process to the malicious code. feature database.

本发明恶意代码捕获系统的工作流程同前述本发明恶意代码捕获方法，此处不再赘述。The workflow of the malicious code capture system of the present invention is the same as the aforementioned malicious code capture method of the present invention, and will not be repeated here.

本发明恶意代码捕获系统中的各个名词的含义与本发明恶意代码捕获方法说明部分的相同名词的含义相同，因此不再对恶意代码捕获系统中出现的名词作重复解释。The meaning of each noun in the malicious code capturing system of the present invention is the same as that of the same nouns in the description of the malicious code capturing method of the present invention, so no repeated explanations will be given for the terms appearing in the malicious code capturing system.

本发明的恶意代码捕获系统，具有如下有益效果：The malicious code capturing system of the present invention has the following beneficial effects:

本发明的恶意代码捕获系统可以应用于相关的蜜罐与蜜网系统中，能够增加捕获对象的覆盖范围，提升恶意代码的捕获能力。The malicious code capture system of the present invention can be applied to related honeypots and honeynet systems, can increase the coverage of captured objects, and improve the capture capability of malicious codes.

以上所述仅为本发明的较佳实施例，并不用以限制本发明，凡在本发明的精神和原则之内，所作的任何修改、等同替换、改进等，均应包含在本发明的保护范围之内。The above descriptions are only preferred embodiments of the present invention, and are not intended to limit the present invention. Any modifications, equivalent replacements, improvements, etc. made within the spirit and principles of the present invention shall be included in the protection of the present invention. within range.

Claims

1. a malicious code catching method, is characterized in that, comprising:

Mail data is obtained from multiple e-mail data source;

Resolve described mail data, be apocrypha by the file record cannot got rid of according to the rate of failing to report of setting in described mail data, and this apocrypha is saved in apocrypha database;

Utilizing malicious code property data base and manual detection to detect described apocrypha, is that abnormal apocrypha is saved in described malicious code sample database by testing result; Specifically comprise:

The second configuration information according to presetting carries out initialization;

Malicious code characteristic information according to preserving in malicious code property data base detects described apocrypha, and the apocrypha comprising described malicious code characteristic information is saved in malicious code sample database;

For the apocrypha that cannot detect according to described malicious code characteristic information, expert system according to presetting carries out manual detection, will be judged to be that the new feature information that the apocrypha of malicious code produces is saved in malicious code property data base in manual detection process.

2. malicious code catching method according to claim 1, is characterized in that, also comprise:

From described malicious code sample database, obtain malicious code sample, in sandbox, run this malicious code sample, record the characteristic information of this malicious code sample and be saved in described malicious code property data base.

3. malicious code catching method according to claim 1, is characterized in that, described from multiple e-mail data source, obtain mail data before, also comprise:

The selection of e-mail terminal Virtual honeypot and Deployment Algorithm is adopted to distribute and optimize the use of Virtual honeypot resource, selection and the Deployment Algorithm of described e-mail terminal Virtual honeypot are: the electronic mail network propagated by malicious code is abstract is a social networks Weighted Directed Graph model with Small World Model feature formed with point and limit, wherein, point expression email accounts, while represent the communication mail between email accounts, the weights on limit represent the quantity of communication mail in certain hour, the in-degree of point represents sender's quantity of this point in certain hour, out-degree represents addressee's quantity of this point in certain hour.

4. malicious code catching method according to claim 1, is characterized in that, describedly from multiple e-mail data source, obtains mail data comprise:

The first configuration information according to presetting obtains e-mail data source information, and extract the kind in e-mail data source, what e-mail terminal, volunteer's email accounts or associated mechanisms that the kind in described e-mail data source is applied for the registration of for automation provided removes privacy information mail data;

The e-mail terminal that if e-mail data source is automation applies for the registration of or volunteer's email accounts, polling cycle then in this e-mail data source then, obtain the pending mail in this e-mail data source, by the mail data of this pending mail write mass mailings original information data storehouse, the header information that described mail data comprises pending mail source code, the summary info, the source code text of pending mail and the accessible file of pending mail that generate according to this header information;

If what e-mail data source provided for associated mechanisms removes privacy information mail data, then standardization is carried out to the data in this e-mail data source, remove the privacy information of pending mail, by the mail data of pending mail write mass mailings original information data storehouse, the header information that described mail data comprises pending mail source code, the summary info, the source code text of pending mail and the accessible file of pending mail that generate according to this header information.

5. malicious code catching method according to claim 4, it is characterized in that, the header information of described pending mail source code comprises languages, type of coding, type of attachment, sender IP address's information, addressee IP address information, IP-based mail routing information.

6. malicious code catching method according to claim 4, is characterized in that, the content of described summary info comprise the addressee of Email, sender, theme, text size, with or without annex, e-mail data Source Type.

7. malicious code catching method according to claim 4, it is characterized in that, resolve described mail data, be apocrypha by the file record cannot got rid of according to the rate of failing to report of setting in described mail data, and this apocrypha is saved in apocrypha database; Comprise:

The header information of described pending mail source code and source code text are resolved, the header information of exception and/or source code text are saved in malicious code sample database.

8. a malicious code capture systems, is characterized in that, comprising:

Acquisition module, for obtaining mail data from multiple e-mail data source;

Parsing module, for resolving the mail data that described acquisition module obtains, being apocrypha by the file record cannot got rid of according to the rate of failing to report of setting in described mail data, and being saved in apocrypha database by this apocrypha;

Testing result, for utilizing malicious code property data base and manual detection to detect the apocrypha that described parsing module exports, is that abnormal apocrypha is saved in described malicious code sample database by detection module; Described detection module comprises:

Second initialization unit, for carrying out initialization according to the second configuration information preset;

First detecting unit, for detecting described apocrypha according to the malicious code characteristic information preserved in malicious code property data base, is saved in malicious code sample database by the apocrypha comprising described malicious code characteristic information;

Second detecting unit, for for the apocrypha that cannot detect according to described malicious code characteristic information, expert system according to presetting carries out manual detection, will be judged to be that the new feature information that the apocrypha of malicious code produces is saved in malicious code property data base in manual detection process.

9. malicious code capture systems according to claim 8, is characterized in that, also comprise:

Sandbox module, for obtaining malicious code sample from described malicious code sample database, runs this malicious code sample in sandbox, records the characteristic information of this malicious code sample and is saved in described malicious code property data base.

10. malicious code capture systems according to claim 8, is characterized in that, also comprises being connected with described acquisition module:

Algorithms selection module, distribute for adopting the selection of e-mail terminal Virtual honeypot and Deployment Algorithm and optimize the use of Virtual honeypot resource, selection and the Deployment Algorithm of described e-mail terminal Virtual honeypot are: the electronic mail network propagated by malicious code is abstract is a social networks Weighted Directed Graph model with Small World Model feature formed with point and limit, wherein, point expression email accounts, while represent the communication mail between email accounts, the weights on limit represent the quantity of communication mail in certain hour, the in-degree of point represents sender's quantity of this point in certain hour, out-degree represents addressee's quantity of this point in certain hour.

11. malicious code capture systems according to claim 8, it is characterized in that, described acquisition module comprises:

Extraction unit, for obtaining e-mail data source information according to the first configuration information preset, and extract the kind in e-mail data source, the e-mail data that e-mail terminal, volunteer's email accounts or associated mechanisms that the kind in described e-mail data source is applied for the registration of for automation provide;

First acquiring unit, for in e-mail data source be automation apply for the registration of e-mail terminal or volunteer's email accounts time, polling cycle in this e-mail data source then, obtain the new mail in this e-mail data source, by the mail data of pending mail write mass mailings original information data storehouse, the header information that described mail data comprises pending mail source code, the accessible file of summary info, pending mail source code text and pending mail generated according to this header information;

Second acquisition unit, for provide for associated mechanisms in e-mail data source e-mail data time, standardization is carried out to the data in this e-mail data source, remove the privacy information of pending mail, by the mail data of pending mail write mass mailings original information data storehouse, the header information that described mail data comprises pending mail source code, the accessible file of summary info, pending mail source code text and pending mail generated according to this header information.

12. malicious code capture systems according to claim 11, it is characterized in that, the header information of described pending mail source code comprises languages, type of coding, type of attachment, sender IP address's information, addressee IP address information, IP-based mail routing information.

13. malicious code capture systems according to claim 11, is characterized in that, the content of described summary info comprise the addressee of Email, sender, theme, text size, with or without annex, e-mail data Source Type.

14. malicious code capture systems according to claim 11, it is characterized in that, described parsing module comprises:

Resolution unit, for resolving the header information of described pending mail source code and source code text, is saved in malicious code sample database by the header information of exception and/or source code text.