CN102970283B

CN102970283B - Document scanning system

Info

Publication number: CN102970283B
Application number: CN201210428845.1A
Authority: CN
Inventors: 于春功; 贺超
Original assignee: Beijing Qihoo Technology Co Ltd; Qizhi Software Beijing Co Ltd
Current assignee: Beijing Qihoo Technology Co Ltd; Qizhi Software Beijing Co Ltd
Priority date: 2012-10-31
Filing date: 2012-10-31
Publication date: 2015-08-12
Anticipated expiration: 2032-10-31
Also published as: CN102970283A

Abstract

The embodiment of the invention discloses a file scanning system to solve the problem of low file scanning efficiency. The system includes a client and a server, wherein the client includes a file upload module; the server includes a storage server, a file download server and a scanning server, the storage server includes a database: a file download module; the scanning server includes a file scanning The device, the file scanning device includes: a probability calculation module; a sorting module; an extraction module, adapted to obtain the number K of scanned files, and extract K sample files to be scanned with high suspicious probability from the sorted sample files to be scanned, K is a positive integer; the scanning module is adapted to scan the K sample files to be scanned, and identify suspicious sample files among them. The invention improves scanning efficiency, can identify as many suspicious sample files as possible, and improves the accuracy of scanning sample files.

Description

file scanning system

技术领域 technical field

本发明涉及网络安全技术领域，具体涉及一种文件扫描系统。The invention relates to the technical field of network security, in particular to a file scanning system.

背景技术 Background technique

恶意程序是一个概括性的术语，指任何故意创建用来执行未经授权并通常是有害行为的软件程序。计算机病毒、后门程序、键盘记录器、密码盗取者、Word和Excel宏病毒、引导区病毒、脚本病毒(batch，windows shell，java等)、木马、犯罪软件、间谍软件和广告软件等等，都是一些可以称之为恶意程序的例子。Malicious program is an umbrella term for any software program intentionally created to perform unauthorized and often harmful acts. Computer viruses, backdoor programs, keyloggers, password stealers, Word and Excel macro viruses, boot sector viruses, script viruses (batch, windows shell, java, etc.), Trojan horses, crimeware, spyware and adware, etc., These are examples of what could be called malicious programs.

为了防止恶意程序对计算机的攻击，一般都需要在计算机上安装杀毒软件对系统中的文件进行扫描，以鉴别出恶意程序并进行查杀。In order to prevent malicious programs from attacking the computer, it is generally necessary to install anti-virus software on the computer to scan files in the system to identify and kill malicious programs.

为了快速地识别和查杀恶意程序，同时为了减轻客户端的资源消耗，目前的安全防护软件越来越多地使用云安全技术。云安全技术即把客户端的文件传给服务器端，在服务器端中存储了大量样本文件，服务器端通过将客户端上传的文件与其存储的样本文件进行比对，从而对客户端文件的安全性做出判定，然后客户端安全软件根据服务器端传回的信息对恶意程序进行报告和处理。In order to quickly identify and kill malicious programs, and at the same time reduce the resource consumption of the client, more and more current security protection software uses cloud security technology. The cloud security technology transmits the files from the client to the server, stores a large number of sample files in the server, and compares the files uploaded by the client with the sample files stored in the server to ensure the security of the client files. Then, the client security software reports and processes the malicious program according to the information sent back from the server.

由于恶意程序的种类和数量不断地增加，服务器端中的样本文件也要不断地更新，因此客户端每天需要将数以万计的样本文件上传到服务器端，云安全中心利用定期升级的第三方杀毒软件(即除云安全中心之外的其他杀毒软件)每天对全部的样本文件进行扫描，以鉴别出其中的可疑样本文件。但是，第三方杀毒软件的扫描能力是有限的，随着样本文件数量的增多，这种方式显然会降低文件扫描效率。As the types and quantities of malicious programs are constantly increasing, the sample files in the server are also constantly updated. Therefore, the client needs to upload tens of thousands of sample files to the server every day. The antivirus software (that is, other antivirus software except the cloud security center) scans all sample files every day to identify suspicious sample files therein. However, the scanning capability of the third-party antivirus software is limited. As the number of sample files increases, this method will obviously reduce the efficiency of file scanning.

发明内容Contents of the invention

鉴于上述问题，提出了本发明以便提供一种克服上述问题或者至少部分地解决上述问题的文件扫描系统。In view of the above problems, the present invention is proposed to provide a file scanning system that overcomes the above problems or at least partially solves the above problems.

依据本发明，提供了一种文件扫描系统，包括：客户端和服务器端，According to the present invention, a file scanning system is provided, including: a client and a server,

其中，in,

客户端包括：Clients include:

文件上传模块，适于将样本文件上传至存储服务器中；A file upload module, suitable for uploading sample files to a storage server;

服务器端包括：存储服务器、文件下载服务器和扫描服务器，The server side includes: storage server, file download server and scanning server,

所述存储服务器包括：The storage server includes:

数据库，适于存储所述文件上传模块上传的样本文件；A database adapted to store sample files uploaded by the file upload module;

所述文件下载服务器包括：The file download server includes:

文件下载模块，适于从所述数据库中下载样本文件并传输至扫描服务器中；A file download module, adapted to download sample files from the database and transmit them to the scanning server;

所述扫描服务器包括文件扫描装置，该文件扫描装置包括：The scanning server includes a file scanning device, and the file scanning device includes:

概率计算模块，适于针对待扫描样本文件，分别计算每个待扫描样本文件被鉴别为可疑的概率；The probability calculation module is adapted to calculate the probability that each sample file to be scanned is identified as suspicious for the sample file to be scanned;

排序模块，适于对所述待扫描样本文件按照其被鉴别为可疑的概率进行排序；A sorting module, adapted to sort the sample files to be scanned according to the probability of being identified as suspicious;

抽取模块，适于获取扫描文件的个数K，从排序后的待扫描样本文件中抽取可疑概率高的K个待扫描样本文件，K为正整数；The extraction module is suitable for obtaining the number K of scanned files, and extracts K sample files to be scanned with high suspicious probability from the sorted sample files to be scanned, and K is a positive integer;

扫描模块，适于对所述K个待扫描样本文件进行扫描，鉴别出其中的可疑样本文件。The scanning module is adapted to scan the K sample files to be scanned, and identify suspicious sample files among them.

本发明实施例中，该文件扫描装置还包括：In the embodiment of the present invention, the document scanning device further includes:

等级检测模块，适于在概率计算模块分别计算每个待扫描样本文件被鉴别为可疑的概率之前，检测全部样本文件的等级，所述样本文件的等级包括安全等级、未知等级、可疑/高度可疑等级、以及恶意等级；The level detection module is adapted to detect the levels of all sample files before the probability calculation module calculates the probability that each sample file to be scanned is identified as suspicious, and the levels of the sample files include security level, unknown level, suspicious/highly suspicious level, and malicious level;

获取模块，适于获取未知等级的样本文件，将获取到的未知等级的样本文件作为待扫描样本文件。The obtaining module is adapted to obtain sample files of unknown grades, and uses the obtained sample files of unknown grades as sample files to be scanned.

本发明实施例中，排序模块按照待扫描样本文件被鉴别为可疑的概率从大到小进行排序；In the embodiment of the present invention, the sorting module sorts the sample files to be scanned according to the probability that they are identified as suspicious from large to small;

所述K个待扫描样本文件为排序后的待扫描样本文件中的前K个待扫描样本文件。The K sample files to be scanned are the first K sample files to be scanned in the sorted sample files to be scanned.

本发明实施例中，概率计算模块包括：In the embodiment of the present invention, the probability calculation module includes:

时间点获取子模块，适于针对每个待扫描样本文件，获取该待扫描样本文件对应的本次扫描的时间点n₂以及上次扫描的时间点n₁；The time point acquisition submodule is adapted to obtain the time point _n2 of this scan and the time point n1 _of the last scan corresponding to the sample file to be scanned for each sample file to be scanned;

概率计算子模块，适于计算从时间点n₁开始到时间点n₂为止，所述待扫描样本文件在本次扫描中被鉴别为可疑的概率Pr(N≥n₁，N≤n₂|α，β)：The probability calculation sub-module is adapted to calculate the probability Pr( _{N≥n 1} _, _{N≤n 2} _| α, β):

其中，参数α和β为通过对待扫描样本文件数据进行最大似然估计得到的参数。Wherein, parameters α and β are parameters obtained by performing maximum likelihood estimation on the sample file data to be scanned.

建立模块，适于在概率计算模块分别计算每个待扫描样本文件被鉴别为可疑的概率之前，为每个待扫描样本文件建立一个信息库，所述信息库中包括该待扫描样本文件对应的上次扫描的时间点n₁。The establishment module is suitable for establishing an information library for each sample file to be scanned before the probability calculation module calculates the probability that each sample file to be scanned is identified as suspicious, and the information library includes the corresponding The time point n ₁ of the last scan.

本发明实施例中，概率计算子模块包括：In the embodiment of the present invention, the probability calculation submodule includes:

概率计算单元，适于计算每个待扫描样本文件前n-1次没有被鉴别为可疑，第n次被鉴别为可疑的概率Pr(N≥n|α，β)：The probability calculation unit is adapted to calculate the probability Pr(N≥n|α, β) that each sample file to be scanned is not identified as suspicious for the first n-1 times, and is identified as suspicious for the nth time:

$Pr PR ((N N &GreaterEqual; &Greater Equal; n no | | α α,, β β)) = = \{\begin{matrix} 11,, & n no = = 11 \\ \frac{β β + + n no - - 22}{α α + + β β + + n no - - 22} P P ((N N &GreaterEqual; &Greater Equal; n no - - 11 | | α α,, β β)),, & n no > > 11 \end{matrix};;$

第一替换单元，适于将所述Pr(N≥n|α，β)中的n替换为n₁，计算Pr(N≥n₁|α，β)；The first replacement unit is adapted to replace n in the Pr(N≥n|α, β) with n ₁ , and calculate Pr(N≥n ₁ |α, β);

第二替换单元，适于将所述Pr(N≥n|α，β)中的n替换为n₂+1，计算Pr(N≥n₂+1|α，β)；The second replacement unit is adapted to replace n in the Pr(N≥n|α, β) with n ₂ +1, and calculate Pr(N≥n ₂ +1|α, β);

根据本发明实施例的文件扫描系统，可以针对待扫描样本文件，分别计算每个待扫描样本文件被鉴别为可疑的概率，然后对所述待扫描样本文件按照其被鉴别为可疑的概率进行排序，并从排序后的待扫描样本文件中抽取可疑概率高的K个待扫描样本文件，最后对所述K个待扫描样本文件进行扫描，鉴别出其中的可疑样本文件。由此解决了现有技术中由于每天需要对全部的样本文件进行扫描而导致的文件扫描效率低的问题，取得了提高扫描效率的有益效果。并且，由于本发明通过抽取可疑概率高的K个待扫描样本文件进行扫描，因此能够尽可能多地鉴别出可疑样本文件，提高扫描样本文件的准确性。According to the file scanning system of the embodiment of the present invention, for the sample files to be scanned, the probability that each sample file to be scanned is identified as suspicious can be calculated respectively, and then the sample files to be scanned are sorted according to the probability of being identified as suspicious , and extract K sample files to be scanned with high suspicious probability from the sorted sample files to be scanned, and finally scan the K sample files to be scanned to identify suspicious sample files among them. This solves the problem of low file scanning efficiency in the prior art due to the need to scan all sample files every day, and achieves the beneficial effect of improving scanning efficiency. Moreover, since the present invention scans by extracting K sample files to be scanned with a high suspicious probability, it can identify as many suspicious sample files as possible and improve the accuracy of scanning sample files.

上述说明仅是本发明技术方案的概述，为了能够更清楚了解本发明的技术手段，而可依照说明书的内容予以实施，并且为了让本发明的上述和其它目的、特征和优点能够更明显易懂，以下特举本发明的具体实施方式。The above description is only an overview of the technical solution of the present invention. In order to better understand the technical means of the present invention, it can be implemented according to the contents of the description, and in order to make the above and other purposes, features and advantages of the present invention more obvious and understandable , the specific embodiments of the present invention are enumerated below.

附图说明 Description of drawings

通过阅读下文优选实施方式的详细描述，各种其他的优点和益处对于本领域普通技术人员将变得清楚明了。附图仅用于示出优选实施方式的目的，而并不认为是对本发明的限制。而且在整个附图中，用相同的参考符号表示相同的部件。在附图中：Various other advantages and benefits will become apparent to those of ordinary skill in the art upon reading the following detailed description of the preferred embodiment. The drawings are only for the purpose of illustrating a preferred embodiment and are not to be considered as limiting the invention. Also throughout the drawings, the same reference numerals are used to designate the same parts. In the attached picture:

图1示出了根据本发明一个实施例的一种扫描文件的方法的流程图；Fig. 1 shows a flow chart of a method for scanning files according to an embodiment of the present invention;

图2示出了根据本发明一个实施例的一种扫描文件的方法的流程图；以及Fig. 2 shows a flow chart of a method for scanning files according to an embodiment of the present invention; and

图3示出了根据本发明一个实施例的一种文件扫描装置的结构框图；Fig. 3 shows a structural block diagram of a document scanning device according to an embodiment of the present invention;

图4示出了根据本发明一个实施例的一种文件扫描系统的结构框图。Fig. 4 shows a structural block diagram of a file scanning system according to an embodiment of the present invention.

具体实施方式 Detailed ways

下面将参照附图更详细地描述本公开的示例性实施例。虽然附图中显示了本公开的示例性实施例，然而应当理解，可以以各种形式实现本公开而不应被这里阐述的实施例所限制。相反，提供这些实施例是为了能够更透彻地理解本公开，并且能够将本公开的范围完整的传达给本领域的技术人员。Exemplary embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. Although exemplary embodiments of the present disclosure are shown in the drawings, it should be understood that the present disclosure may be embodied in various forms and should not be limited by the embodiments set forth herein. Rather, these embodiments are provided for more thorough understanding of the present disclosure and to fully convey the scope of the present disclosure to those skilled in the art.

本发明实施例可以应用于计算机系统/服务器，其可与众多其它通用或专用计算系统环境或配置一起操作。适于与计算机系统/服务器一起使用的众所周知的计算系统、环境和/或配置的例子包括但不限于：个人计算机系统、服务器计算机系统、瘦客户机、厚客户机、手持或膝上设备、基于微处理器的系统、机顶盒、可编程消费电子产品、网络个人电脑、小型计算机系统、大型计算机系统和包括上述任何系统的分布式云计算技术环境，等等。Embodiments of the invention may be applied to computer systems/servers that are operational with numerous other general purpose or special purpose computing system environments or configurations. Examples of well-known computing systems, environments and/or configurations suitable for use with computer systems/servers include, but are not limited to: personal computer systems, server computer systems, thin clients, thick clients, handheld or laptop devices, Microprocessor-based systems, set-top boxes, programmable consumer electronics, networked personal computers, minicomputer systems, mainframe computer systems, and distributed cloud computing technology environments including any of the foregoing, among others.

计算机系统/服务器可以在由计算机系统执行的计算机系统可执行指令(诸如程序模块)的一般语境下描述。通常，程序模块可以包括例程、程序、目标程序、组件、逻辑、数据结构等等，它们执行特定的任务或者实现特定的抽象数据类型。计算机系统/服务器可以在分布式云计算环境中实施，分布式云计算环境中，任务是由通过通信网络链接的远程处理设备执行的。在分布式云计算环境中，程序模块可以位于包括存储设备的本地或远程计算系统存储介质上。Computer systems/servers may be described in the general context of computer system-executable instructions, such as program modules, being executed by the computer system. Generally, program modules may include routines, programs, objects, components, logic, data structures, etc., that perform particular tasks or implement particular abstract data types. The computer system/server can be practiced in distributed cloud computing environments where tasks are performed by remote processing devices that are linked through a communications network. In a distributed cloud computing environment, program modules may be located in both local and remote computing system storage media including storage devices.

参照图1，示出了根据本发明一个实施例的一种扫描文件的方法的流程图。Referring to FIG. 1 , it shows a flowchart of a method for scanning files according to an embodiment of the present invention.

由于恶意程序的种类和数量不断地增加，服务器端中的样本文件也要不断地更新，因此客户端每天需要将数以万计的样本文件上传到服务器端，服务器端利用定期升级的第三方杀毒软件每天对全部的样本文件进行扫描，以鉴别出其中的可疑样本文件。但是，第三方杀毒软件的扫描能力是有限的，随着样本文件数量的增多，这种方式显然会降低文件扫描效率。As the types and numbers of malicious programs continue to increase, the sample files in the server must also be updated continuously, so the client needs to upload tens of thousands of sample files to the server every day, and the server uses regularly updated third-party antivirus The software scans all sample files every day to identify suspicious sample files. However, the scanning capability of the third-party antivirus software is limited. As the number of sample files increases, this method will obviously reduce the efficiency of file scanning.

因此，为了提高文件扫描效率，本发明实施例提出一种根据第三方杀毒软件的扫描能力(即能够扫描的文件个数的最大值)从样本文件中抽取部分满足条件的样本文件，只对抽取出的部分样本文件进行扫描的方法。Therefore, in order to improve file scanning efficiency, the embodiment of the present invention proposes a method of extracting part of the sample files that meet the conditions from the sample files according to the scanning capability of the third-party anti-virus software (that is, the maximum number of files that can be scanned). The method of scanning out some sample files.

具体的，本实施例的扫描文件的方法包括以下步骤：Specifically, the method for scanning files in this embodiment includes the following steps:

步骤S101，针对待扫描样本文件，分别计算每个待扫描样本文件被鉴别为可疑的概率。Step S101, for the sample files to be scanned, respectively calculate the probability that each sample file to be scanned is identified as suspicious.

本实施例提出从待扫描样本文件中抽取部分样本文件进行扫描，因此首先需要确定具体抽取哪些样本文件进行扫描。为了尽可能多地鉴别出可疑样本文件，提高文件扫描的准确性，本实施例中提出依据待扫描样本文件被鉴别为可疑的概率进行样本文件的抽取。This embodiment proposes to extract some sample files from the sample files to be scanned for scanning. Therefore, it is first necessary to determine which sample files are specifically extracted for scanning. In order to identify as many suspicious sample files as possible and improve the accuracy of file scanning, it is proposed in this embodiment to extract sample files based on the probability that the sample files to be scanned are identified as suspicious.

因此，在该步骤S101中可以针对待扫描样本文件，分别计算每个待扫描样本文件被鉴别为可疑的概率，对于具体的计算过程，将在下面的实施例中详细介绍。Therefore, in this step S101 , for the sample files to be scanned, the probability that each sample file to be scanned is identified as suspicious can be calculated respectively, and the specific calculation process will be described in detail in the following embodiments.

在本发明实施例中，为了查杀不断增多的恶意程序，杀毒软件可以定期对样本文件进行扫描，相应的，在该步骤中定期进行概率的计算，并且，为了能够更加准确地确定出待扫描样本文件，每次计算时可以针对全部的待扫描样本文件进行计算，以使扫描更加全面。In the embodiment of the present invention, in order to check and kill the ever-increasing malicious programs, the antivirus software can periodically scan the sample files, correspondingly, in this step, the calculation of the probability is performed regularly, and, in order to more accurately determine the Sample files, each calculation can be calculated for all sample files to be scanned, so as to make the scan more comprehensive.

在本实施例中，考虑到样本文件随时都有可能受到病毒感染，因此对于每一个样本文件来说，即使该样本文件前n-1次没有被鉴别为可疑，这只是说明之前在扫描时其没有感染病毒，但是在后续也可能感染病毒，因此在第n次扫描时也需要对其进行概率计算。所以，为了使扫描更加全面、准确，本实施例提出每次对全部的待扫描样本文件进行计算。In this embodiment, considering that the sample file may be infected by viruses at any time, for each sample file, even if the sample file is not identified as suspicious for the first n-1 times, this just shows that other samples were previously scanned. It is not infected with a virus, but it may also be infected with a virus later, so the probability calculation of it also needs to be performed at the nth scan. Therefore, in order to make the scanning more comprehensive and accurate, this embodiment proposes to calculate all the sample files to be scanned each time.

当然，在本实施例中，每次也可以针对部分的样本文件进行计算，本发明实施例对此并不加以限制。Of course, in this embodiment, the calculation may also be performed on some sample files each time, which is not limited in this embodiment of the present invention.

其中，对于计算的时间间隔，可以根据杀毒软件升级的时间间隔确定，例如，杀毒软件每隔时间t进行升级，那么可以设定每隔时间t针对全部的待扫描样本文件，分别计算每个待扫描样本文件被鉴别为可疑的概率。当然，所述计算的时间间隔还可以设定为其他的值，本实施例对此并不加以限制。Wherein, for the time interval of calculation, it can be determined according to the time interval of antivirus software upgrade. For example, if the antivirus software is upgraded every time t, it can be set that every time t is aimed at all sample files to be scanned, and each to-be-scanned sample file can be calculated respectively. The probability that a scanned sample file is identified as suspicious. Of course, the calculation time interval may also be set to other values, which is not limited in this embodiment.

步骤S102，对所述待扫描样本文件按照其被鉴别为可疑的概率进行排序。Step S102, sorting the sample files to be scanned according to the probability of being identified as suspicious.

步骤S103，获取扫描文件的个数K，从排序后的待扫描样本文件中抽取可疑概率高的K个待扫描样本文件，K为正整数。In step S103, the number K of scanned files is acquired, and K sample files to be scanned with a high suspicious probability are extracted from the sorted sample files to be scanned, where K is a positive integer.

步骤S104，对所述K个待扫描样本文件进行扫描，鉴别出其中的可疑样本文件。Step S104, scanning the K sample files to be scanned, and identifying suspicious sample files among them.

在步骤S101中计算出每个待扫描样本文件被鉴别为可疑的概率之后，首先按照计算出的概率对所述待扫描样本文件进行排序，然后依据第三方杀毒软件能够扫描的文件的个数K从排序后的待扫描样本文件中抽取可疑概率高的K个待扫描样本文件，最后第三方杀毒软件只需对抽取出的K个待扫描样本文件进行扫描，进一步鉴别出其中的可疑样本文件即可，而不需要再对全部的样本文件进行扫描。对于具体的过程，将在下面的实施例中详细介绍。After calculating the probability that each sample file to be scanned is identified as suspicious in step S101, the sample file to be scanned is first sorted according to the calculated probability, and then according to the number K of files that can be scanned by a third-party antivirus software Extract K sample files to be scanned with high suspicious probability from the sorted sample files to be scanned. Finally, the third-party antivirus software only needs to scan the extracted K sample files to be scanned, and further identify the suspicious sample files. Yes, instead of scanning all sample files. The specific process will be described in detail in the following embodiments.

其中，K的取值可以根据第三方杀毒软件的扫描能力而定，可以将第三方杀毒软件能够扫描的文件个数的最大值作为K的值，例如，如果杀毒软件一天能扫描1000个样本文件，那么K＝1000。Among them, the value of K can be determined according to the scanning capability of the third-party anti-virus software, and the maximum number of files that the third-party anti-virus software can scan can be used as the value of K. For example, if the anti-virus software can scan 1000 sample files a day , then K=1000.

当然，K也可以取其他的值，本发明实施例对此并不加以限制。Certainly, K may also take other values, which are not limited in this embodiment of the present invention.

本发明实施例解决了现有技术中由于每天需要对全部的样本文件进行扫描而导致的文件扫描效率低的问题，取得了提高扫描效率的有益效果。并且由于本发明通过抽取可疑概率高的K个待扫描样本文件进行扫描，因此能够尽可能多地鉴别出可疑样本文件，提高扫描的准确性。The embodiment of the present invention solves the problem of low file scanning efficiency in the prior art due to the need to scan all sample files every day, and achieves the beneficial effect of improving the scanning efficiency. And because the present invention scans by extracting K sample files to be scanned with a high suspicious probability, it can identify as many suspicious sample files as possible and improve the scanning accuracy.

参照图2，示出了根据本发明一个实施例的一种扫描文件的方法的流程图，所述方法包括：Referring to FIG. 2 , it shows a flow chart of a method for scanning files according to an embodiment of the present invention, the method comprising:

步骤S201，服务器接收客户端上传的全部样本文件。In step S201, the server receives all sample files uploaded by the client.

通过云安全技术进行病毒查杀的过程即把客户端的文件传给服务器端，在服务器端中存储了大量样本文件，服务器端通过将客户端上传的文件与其存储的样本文件进行比对，从而对客户端文件的安全性做出判定，然后客户端安全软件根据服务器端传回的信息对客户端文件进行报告和处理。The process of virus detection and killing through cloud security technology is to transmit the client’s files to the server, and store a large number of sample files in the server, and the server compares the files uploaded by the client with the stored sample files, so as to The security of the client file is judged, and then the client security software reports and processes the client file according to the information sent back from the server.

因此，首先需要确定出存储在云安全中心服务器的样本文件以及鉴别出这些样本文件是否可疑，然后才能将客户端上传的文件与服务器存储的样本文件进行比对，以判断客户端文件的安全性。Therefore, it is first necessary to determine the sample files stored in the cloud security center server and identify whether these sample files are suspicious, and then compare the files uploaded by the client with the sample files stored in the server to judge the security of the client files .

首先，客户端将全部的样本文件上传至服务器，然后由服务器进行后续的处理。需要说明的是，这里所述的客户端上传的样本文件并不是要与服务器存储的样本文件进行比对，而是要从这些样本文件中查找出需要存储在服务器中的样本文件，并鉴别这些样本文件是否可疑。First, the client uploads all sample files to the server, and then the server performs subsequent processing. It should be noted that the sample files uploaded by the client described here are not to be compared with the sample files stored in the server, but to find out the sample files that need to be stored in the server from these sample files, and identify these Whether the sample file is suspicious.

步骤S202，检测全部样本文件的等级。Step S202, detecting the grades of all sample files.

服务器在接收到客户端上传的全部样本文件之后，首先检测这些样本文件的等级。After receiving all the sample files uploaded by the client, the server first detects the grades of these sample files.

在本实施例中，所述样本文件的等级包括：安全等级、未知等级、可疑/高度可疑等级、以及恶意等级。对于等级的设置，可以设置等级为10-20时为安全等级，等级为30-40时为未知等级，等级为50-60时为可疑/高度可疑等级，等级大于或等于70时为恶意等级。当然，还可以设置所述等级为其他形式，本发明对此并不加以限制。In this embodiment, the levels of the sample files include: security level, unknown level, suspicious/highly suspicious level, and malicious level. For the level setting, you can set the security level when the level is 10-20, the unknown level when the level is 30-40, the suspicious/highly suspicious level when the level is 50-60, and the malicious level when the level is greater than or equal to 70. Of course, the levels may also be set in other forms, which are not limited in the present invention.

步骤S203，获取未知等级的样本文件，将获取到的未知等级的样本文件作为待扫描样本文件。Step S203, acquiring a sample file of an unknown level, and using the acquired sample file of an unknown level as a sample file to be scanned.

在本实施例中，设定只将可疑级别未知的样本文件作为待扫描样本文件。对于上述步骤S202中检测出的样本文件的等级来说，等级为安全等级的样本文件不是可疑样本文件，等级为可疑/高度可疑等级、以及恶意等级的样本文件，对于这些样本文件不需要再进行扫描；等级为未知等级的样本文件即为可疑级别未知的样本文件，因此还需要进一步对这些未知等级的样本文件进行扫描，以鉴别其是否为可疑样本文件。In this embodiment, it is set that only sample files with unknown suspicious levels are used as sample files to be scanned. For the grades of the sample files detected in the above-mentioned step S202, the sample files whose grades are security grades are not suspicious sample files, and the grades are suspicious/highly suspicious grades and malicious grade sample files, for these sample files no further processing is required. Scanning; sample files with an unknown level are sample files with an unknown level of suspiciousness, so it is necessary to further scan these sample files with an unknown level to identify whether they are suspicious sample files.

步骤S204，针对待扫描样本文件，分别计算每个待扫描样本文件被鉴别为可疑的概率。Step S204, for the sample files to be scanned, respectively calculate the probability that each sample file to be scanned is identified as suspicious.

为了提高扫描效率，本申请实施例并不是对全部的待扫描样本文件进行扫描，而是要从待扫描样本文件中选择部分待扫描样本文件进行扫描。因此，在步骤S203中确定出待扫描样本文件之后，还需要进一步对这些待扫描样本文件进行分析，以确定出实际需要扫描的样本文件。In order to improve the scanning efficiency, the embodiment of the present application does not scan all the sample files to be scanned, but selects some sample files to be scanned from the sample files to be scanned for scanning. Therefore, after the sample files to be scanned are determined in step S203, the sample files to be scanned need to be further analyzed to determine the sample files that actually need to be scanned.

在本发明实施例中，依据待扫描样本文件被鉴别为可疑的概率抽取满足条件的样本文件，因此，在本步骤S204中，需要针对待扫描样本文件，分别计算每个待扫描样本文件被鉴别为可疑的概率。In the embodiment of the present invention, the sample files satisfying the conditions are extracted according to the probability that the sample files to be scanned are identified as suspicious. Therefore, in this step S204, it is necessary to calculate the number of identified samples for each sample file to be scanned. is suspicious probability.

与上述实施例一相似，为了查杀不断增多的恶意程序，杀毒软件可以定期对样本文件进行扫描，相应的，在该步骤中定期进行概率的计算，并且，为了能够更加准确地确定出待扫描样本文件，每次计算时可以针对全部的待扫描样本文件进行计算，以使扫描更加全面。当然，本发明实施例并不限定于该种方式，本领域技术人员根据实际经验采用其他方式也是可行的。Similar to the first embodiment above, in order to check and kill the ever-increasing malicious programs, the anti-virus software can regularly scan the sample files, correspondingly, in this step, the probability calculation is performed regularly, and, in order to more accurately determine the Sample files, each calculation can be calculated for all sample files to be scanned, so as to make the scan more comprehensive. Of course, the embodiments of the present invention are not limited to this manner, and it is also feasible for those skilled in the art to adopt other manners based on actual experience.

具体的，可以通过以下子步骤计算每个待扫描样本文件被鉴别为可疑的概率：Specifically, the probability that each sample file to be scanned is identified as suspicious can be calculated through the following sub-steps:

子步骤a1，针对每个待扫描样本文件，获取该待扫描样本文件对应的本次扫描的时间点n₂以及上次扫描的时间点n₁。In sub-step a1, for each sample file to be scanned, the time point n ₂ of the current scan and the time point n ₁ of the last scan corresponding to the sample file to be scanned are obtained.

其中，时间点n₂即为该待扫描样本文件对应的本次扫描的时间，该时间点n₂可以通过直接读取当前时间获取。Wherein, the time point n ₂ is the current scanning time corresponding to the sample file to be scanned, and the time point n ₂ can be obtained by directly reading the current time.

时间点n₁为该待扫描样本文件对应的上次扫描的时间。在本实施例中，可以在本步骤S204分别计算每个待扫描样本文件被鉴别为可疑的概率之前，为每个待扫描样本文件建立一个信息库，在所述信息库中包括该待扫描样本文件对应的上次扫描的时间点n₁。信息库中各个待扫描样本文件的ID为主键，通过ID即可在信息库中查找到对应的待扫描样本文件，进一步获取到该扫描样本文件对应的上次扫描的时间点n₁。Time point _n1 is the last scanning time corresponding to the sample file to be scanned. In this embodiment, before calculating the probability that each sample file to be scanned is identified as suspicious in step S204, an information base is established for each sample file to be scanned, and the sample file to be scanned is included in the information base. Time point n ₁ of the last scan corresponding to the file. The ID of each sample file to be scanned in the information base is the primary key, and the corresponding sample file to be scanned can be found in the information base through the ID, and the time point n ₁ of the last scan corresponding to the sample file to be scanned can be further obtained.

子步骤a2，通过以下公式计算从时间点n₁开始到时间点n₂为止，所述待扫描样本文件在本次扫描中被鉴别为可疑的概率Pr(N≥n₁，N≤n₂|α，β)：Sub-step a2, calculate the probability Pr( _{N≥n 1} _, _{N≤n 2} _| α, β):

该子步骤a2具体可以包括：The sub-step a2 may specifically include:

(1)通过以下公式计算每个待扫描样本文件前n-1次没有被鉴别为可疑，第n次被鉴别为可疑的概率Pr(N≥n|α，β)：(1) Calculate the probability Pr(N≥n|α, β) that each sample file to be scanned is not identified as suspicious for the first n-1 times, and is identified as suspicious for the nth time by the following formula:

$Pr PR ((N N &GreaterEqual; &Greater Equal; n no | | α α,, β β)) = = \{\begin{matrix} 11,, & n no = = 11 \\ \frac{β β + + n no - - 22}{α α + + β β + + n no - - 22} P P ((N N &GreaterEqual; &Greater Equal; n no - - 11 | | α α,, β β)),, & n no > > 11 \end{matrix}$

在本实施例中，考虑到样本文件随时都有可能受到病毒感染，因此在第n次扫描时，对于前n-1次没有被鉴别为可疑的待扫描文件也需要进行计算(因为即使这些待扫描样本文件前n-1次没有被鉴别为可疑，只是说明之前在扫描时其没有感染病毒，但是在后续也可能感染病毒，因此在第n次扫描时也需要对其进行概率计算)。In this embodiment, considering that the sample files may be infected by viruses at any time, during the nth scan, calculations are also required for the files to be scanned that are not identified as suspicious for the first n-1 times (because even these files to be scanned Scanning the sample file for the first n-1 times has not been identified as suspicious, which just means that it was not infected with a virus during the previous scan, but it may also be infected with a virus later, so the probability calculation also needs to be performed at the nth scan).

下面，具体分析如何计算待扫描样本文件前n-1次没有被鉴别为可疑，第n次被鉴别为可疑的概率Pr(N≥n|α，β)。Next, specifically analyze how to calculate the probability Pr(N≥n|α, β) that the sample file to be scanned is not identified as suspicious for the first n-1 times, but is identified as suspicious for the nth time.

假设每隔时间t对每个待扫描样本文件被鉴别为可疑的概率进行计算，样本文件被鉴别为可疑是一个随机事件，例如以θ概率表示样本被鉴别为可疑，则待扫描样本文件前n-1次没有被鉴别为可疑，第n次被鉴别为可疑的概率为：Assuming that the probability of each sample file to be scanned is identified as suspicious is calculated every time t. The identification of a sample file as suspicious is a random event. For example, the probability θ indicates that a sample is identified as suspicious. -1 time is not identified as suspicious, the probability of being identified as suspicious for the nth time is:

Pr(N＝n|θ)＝(1-θ)^n-1θPr(N=n|θ)=(1-θ) ^n-1 θ

上述概率Pr(N≥n|θ)服从几何分布，即The above probability Pr(N≥n|θ) obeys the geometric distribution, namely

Pr(N≥n|θ)＝(1-θ)^n-1 Pr(N≥n|θ)=(1-θ) ^n-1

对于不同的样本文件，参数θ的值是不一样的，假设参数θ服从参数为α和β的贝塔分布，即For different sample files, the value of the parameter θ is different, assuming that the parameter θ obeys the beta distribution with parameters α and β, that is

$Pr PR ((θ θ | | α α,, β β)) = = \frac{{θ θ}^{α α - - 11} {((11 - - θ θ))}^{β β - - 11}}{B B ((α α,, β β))}$

其中， $B (α, β) = {&Integral;}_{0}^{1} t^{α - 1} (1 - t^{β-1}) dt = \frac{Γ (α) Γ (β)}{Γ (α + β)},$ B(α，β)是贝塔函数，Γ(x)是伽马函数，满足Γ(x+1)＝xΓ(x)的性质。in, $B (α, β) = {&Integral;}_{0}^{1} t^{α - 1} (1 - t^{β-1}) dt = \frac{Γ (α) Γ (β)}{Γ (α + β)},$ B(α, β) is a beta function, Γ(x) is a gamma function, which satisfies the property of Γ(x+1)=xΓ(x).

因此，可以得出Therefore, it can be concluded that

$Pr PR ((N N &GreaterEqual; &Greater Equal; n no | | α α,, β β)) = = {&Integral; &Integral;}_{00}^{11} Pr PR ((N N &GreaterEqual; &Greater Equal; n no | | θ θ)) Pr PR ((θ θ | | α α,, β β)) dθ dθ$

$= = {&Integral; &Integral;}_{00}^{11} {((11 - - θ θ))}^{n no - - 11} \frac{{θ θ}^{α α - - 11} {((11 - - θ θ))}^{θ θ - - 11}}{B B ((α,β α,β))} dθ dθ$

$= = {&Integral; &Integral;}_{00}^{11} \frac{{θ θ}^{α α - - 11} {((11 - - θ θ))}^{θ θ + + n no - - 22}}{B B ((α α,, β β))} dθ dθ$

$= = \frac{B B ((α α,, β β + + n no - - 11))}{B B ((α α,, β β))} {&Integral; &Integral;}_{00}^{11} \frac{{θ θ}^{α α - - 11} {((11 - - θ θ))}^{((θ θ + + n no - - 11)) - - 11}}{B B ((α α,, β β + + n no - - 11))} dθ dθ$

$= = \frac{B B ((α α,, β β + + n no - - 11))}{B B ((α α,, β β))}$

进一步，对进行计算可疑得出：further, yes Doing the calculation suspiciously yields:

$Pr PR ((N N &GreaterEqual; &Greater Equal; n no | | α α,, β β)) = = \frac{B B ((α α,, β β + + n no - - 11))}{B B ((α α,, β β))}$

$= = \frac{Γ Γ ((α α)) Γ Γ ((β β + + n no - - 11))}{Γ Γ ((α α + + β β + + n no - - - - 11))} \frac{11}{B B ((α α,, β β))}$

$= = \frac{β β + + n no - - 22}{α α + + β β + + n no - - 22} \frac{Γ Γ ((α α)) Γ Γ ((β β + + n no - - 22))}{Γ Γ ((α α + + β β + + n no - - 22))} \frac{11}{B B ((α α,, β β))}$

$= = \frac{β β + + n no - - 22}{α α + + β β + + n no - - 22} \frac{B B ((α α,, β β + + n no - - 22))}{B B ((α α,, β β))}$

$= = \frac{β β + + n no - - 22}{α α + + β β + + n no - - 22} Pr PR ((N N &GreaterEqual; &Greater Equal; n no - - 11 | | α α,, β β))$

因此，最终得出待扫描样本文件前n-1次没有被鉴别为可疑，第n次被鉴别为可疑的概率Pr(N≥n|α，β)为：Therefore, it is finally concluded that the sample file to be scanned has not been identified as suspicious for the first n-1 times, and the probability Pr(N≥n|α, β) of being identified as suspicious for the nth time is:

(2)将所述Pr(N≥n|α，β)中的n替换为n₁，计算Pr(N≥n₁|α，β)；(2) replace n in the Pr(N≥n|α, β) with n ₁ , and calculate Pr(N≥n ₁ |α, β);

(3)将所述Pr(N≥n|α，β)中的n替换为n₂+1，计算Pr(N≥n₂+1|α，β)；(3) Replace n in the Pr(N≥n|α, β) with n ₂ +1, and calculate Pr(N≥n ₂ +1|α, β);

由于待扫描样本文件在本次扫描中被鉴别为可疑的概率为：The probability that the sample file to be scanned is identified as suspicious in this scan is:

步骤S205，对所述待扫描样本文件按照其被鉴别为可疑的概率进行排序。Step S205, sorting the sample files to be scanned according to the probability of being identified as suspicious.

在步骤S204中计算出每个待扫描样本文件被鉴别为可疑的概率之后，对所述待扫描样本文件按照其被鉴别为可疑的概率进行排序。After the probability that each sample file to be scanned is identified as suspicious is calculated in step S204, the sample files to be scanned are sorted according to the probability of being identified as suspicious.

优选的，本实施例对所述待扫描样本文件按照待扫描样本文件被鉴别为可疑的概率从大到小进行排序，其中排序靠前的待扫描样本文件即为可疑概率高的样本文件。Preferably, in this embodiment, the to-be-scanned sample files are sorted according to the probability that the to-be-scanned sample files are identified as suspicious in descending order, wherein the to-be-scanned sample files ranked higher are the sample files with high suspicious probability.

步骤S206，获取扫描文件的个数K，从排序后的待扫描样本文件中抽取可疑概率高的K个待扫描样本文件，K为正整数。在本实施例中，可以依据第三方杀毒软件的扫描能力选择待扫描样本文件进行扫描，所述扫描能力即为该杀毒软件能够扫描的文件个数的最大值，因此，可以将第三方杀毒软件能够扫描的文件个数的最大值作为K的值，然后抽取可疑概率高的K个待扫描样本文件进行扫描即可。In step S206, the number K of scanned files is obtained, and K sample files to be scanned with a high suspicious probability are extracted from the sorted sample files to be scanned, where K is a positive integer. In this embodiment, the sample files to be scanned can be selected for scanning according to the scanning capability of the third-party anti-virus software. The scanning capability is the maximum number of files that the anti-virus software can scan. Therefore, the third-party anti-virus software can be The maximum number of files that can be scanned is used as the value of K, and then K sample files to be scanned with a high suspicious probability are selected for scanning.

具体的，如果在步骤S205中对所述待扫描样本文件按照待扫描样本文件被鉴别为可疑的概率从大到小进行排序，则在该步骤S206中直接抽取排序后的待扫描样本文件中的前K个待扫描样本文件即可。Specifically, if in step S205, the sample files to be scanned are sorted according to the probability that the sample files to be scanned are identified as suspicious from large to small, then in step S206, the sample files to be scanned are directly extracted The first K sample files to be scanned are sufficient.

需要说明的是，对于上述本步骤S204-S206，可以在服务器确定出待扫描样本文件之后，直接通过服务器计算每个待扫描样本文件被鉴别为可疑的概率，并依据所述概率进行排序，然后从排序后的待扫描样本文件中抽取可疑概率高的K个待扫描样本文件。当然，服务器还可以将确定出的待扫描样本文件传给本地客户端，通过本地客户端执行上述概率计算、排序和抽取过程，然后客户端再将抽取的K个待扫描样本文件上传给服务器。本实施例对此并不加以限制。It should be noted that, for the above-mentioned steps S204-S206, after the server determines the sample files to be scanned, the probability that each sample file to be scanned is identified as suspicious can be directly calculated by the server, and sorted according to the probability, and then K sample files to be scanned with high suspicious probability are extracted from the sorted sample files to be scanned. Of course, the server can also transmit the determined sample files to be scanned to the local client, and perform the above probability calculation, sorting and extraction process through the local client, and then the client uploads the extracted K sample files to be scanned to the server. This embodiment does not limit it.

步骤S207，对所述K个待扫描样本文件进行扫描，鉴别出其中的可疑样本文件。Step S207, scanning the K sample files to be scanned, and identifying suspicious sample files among them.

在确定出K个待扫描样本文件之后，云安全中心则利用第三方杀毒软件对所述K个待扫描样本文件进行扫描，以鉴别出其中的可疑样本文件，并存储至服务器端，以供后续进行杀毒时与客户端文件进行比对，从而对客户端文件的安全性做出判定。After determining the K sample files to be scanned, the cloud security center uses third-party anti-virus software to scan the K sample files to be scanned to identify suspicious sample files and store them on the server for subsequent use. When antivirus is performed, it is compared with the client file, so as to make a judgment on the security of the client file.

对于具体的扫描过程和鉴别过程，本领域技术人员根据实际经验处理即可，本发明实施例对此并不加以限制。As for the specific scanning process and identification process, those skilled in the art can handle it according to actual experience, which is not limited in this embodiment of the present invention.

本发明实施例具体描述了如何确定实际需要扫描的样本文件的过程，根据第三方杀毒软件的扫描能力K，确定出可疑概率高的K个待扫描样本文件，然后只需对这K个待扫描样本文件进行扫描即可，从而提高了文件扫描效率，并且由于本发明实施例通过抽取可疑概率高的K个待扫描样本文件进行扫描，因此能够尽可能多地鉴别出可疑样本文件，提高扫描样本文件的准确性。The embodiment of the present invention specifically describes the process of how to determine the sample files that actually need to be scanned. According to the scanning capability K of the third-party antivirus software, K sample files to be scanned with a high suspicious probability are determined, and then only the K sample files to be scanned are determined. The sample file can be scanned, thereby improving the file scanning efficiency, and because the embodiment of the present invention scans by extracting K sample files to be scanned with a high suspicious probability, it can identify as many suspicious sample files as possible, and improve the scanning sample rate. Document Accuracy.

需要说明的是，对于前述的方法实施例，为了简单描述，故将其都表述为一系列的动作组合，但是本领域技术人员应该知悉，本申请并不受所描述的动作顺序的限制，因为依据本申请，某些步骤可以采用其他顺序或者同时进行。其次，本领域技术人员也应该知悉，说明书中所描述的实施例均属于优选实施例，所涉及的动作并不一定是本申请所必需的。It should be noted that, for the foregoing method embodiments, for the sake of simple description, they are expressed as a series of action combinations, but those skilled in the art should know that the present application is not limited by the described action sequence, because Depending on the application, certain steps may be performed in other orders or simultaneously. Secondly, those skilled in the art should also know that the embodiments described in the specification belong to preferred embodiments, and the actions involved are not necessarily required by this application.

参照图3，示出了根据本发明一个实施例的一种文件扫描装置的结构框图，所述装置包括：接收模块301、等级检测模块302、获取模块303、建立模块304、概率计算模块305、排序模块306、抽取模块307和扫描模块308。Referring to FIG. 3 , it shows a structural block diagram of a file scanning device according to an embodiment of the present invention, the device includes: a receiving module 301, a grade detection module 302, an acquisition module 303, an establishment module 304, a probability calculation module 305, sorting module 306 , extracting module 307 and scanning module 308 .

其中，in,

接收模块301，适于接收客户端上传的全部样本文件；The receiving module 301 is adapted to receive all sample files uploaded by the client;

等级检测模块302，适于在概率计算模块分别计算每个待扫描样本文件被鉴别为可疑的概率之前，检测全部样本文件的等级；The level detection module 302 is adapted to detect the levels of all sample files before the probability calculation module calculates the probability that each sample file to be scanned is identified as suspicious;

所述样本文件的等级包括安全等级、未知等级、可疑/高度可疑等级、以及恶意等级。The levels of the sample files include security level, unknown level, suspicious/highly suspicious level, and malicious level.

获取模块303，适于获取未知等级的样本文件，将获取到的未知等级的样本文件作为待扫描样本文件。The acquiring module 303 is adapted to acquire a sample file of an unknown level, and use the acquired sample file of an unknown level as a sample file to be scanned.

建立模块304，适于在概率计算模块分别计算每个待扫描样本文件被鉴别为可疑的概率之前，为每个待扫描样本文件建立一个信息库；The establishment module 304 is adapted to establish an information base for each sample file to be scanned before the probability calculation module separately calculates the probability that each sample file to be scanned is identified as suspicious;

所述信息库中包括该待扫描样本文件对应的上次扫描的时间点n₁。The information base includes the time point n ₁ of the last scan corresponding to the sample file to be scanned.

概率计算模块305，适于针对待扫描样本文件，分别计算每个待扫描样本文件被鉴别为可疑的概率；The probability calculation module 305 is adapted to calculate the probability that each sample file to be scanned is identified as suspicious for the sample file to be scanned;

具体的，所述概率计算模块可以定期针对待扫描样本文件，分别计算每个待扫描样本文件被鉴别为可疑的概率，每次可以针对全部的待扫描样本文件进行计算，本实施例对此并不加以限制。Specifically, the probability calculation module can regularly calculate the probability that each sample file to be scanned is identified as suspicious for the sample files to be scanned, and the calculation can be performed for all sample files to be scanned each time, which is not included in this embodiment. Not limited.

所述概率计算模块305具体可以包括以下子模块：The probability calculation module 305 may specifically include the following submodules:

概率计算子模块，适于通过以下公式计算从时间点n₁开始到时间点n₂为止，所述待扫描样本文件在本次扫描中被鉴别为可疑的概率Pr(N≥n₁，N≤n₂|α，β)：The probability calculation sub-module is adapted to calculate the probability Pr( _N≥n ₁ , _N≤ n ₂ |α,β):

所述概率计算子模块具体可以包括以下单元：The probability calculation submodule may specifically include the following units:

概率计算单元，适于通过以下公式计算每个待扫描样本文件前n-1次没有被鉴别为可疑，第n次被鉴别为可疑的概率Pr(N≥n|α，β)：The probability calculation unit is adapted to calculate the probability Pr(N≥n|α, β) that each sample file to be scanned is not identified as suspicious for the first n-1 times and is identified as suspicious for the nth time by the following formula:

排序模块306，适于对所述待扫描样本文件按照其被鉴别为可疑的概率进行排序；A sorting module 306, adapted to sort the sample files to be scanned according to the probability of being identified as suspicious;

优选的，在本实施例中，所述排序模块306按照待扫描样本文件被鉴别为可疑的概率从大到小进行排序。Preferably, in this embodiment, the sorting module 306 sorts the sample files to be scanned according to the probability that the sample files to be scanned are identified as suspicious from the largest to the smallest.

抽取模块307，适于获取扫描文件的个数K，从排序后的待扫描样本文件中抽取可疑概率高的K个待扫描样本文件，K为正整数；The extraction module 307 is adapted to obtain the number K of scanned files, extract K sample files to be scanned with high suspicious probability from the sorted sample files to be scanned, and K is a positive integer;

在本实施例中，所述扫描文件的个数K可以依据第三方杀毒软件的扫描能力确定，即可以将第三方杀毒软件能够扫描的文件个数的最大值作为K的值。In this embodiment, the number K of scanned files can be determined according to the scanning capability of the third-party antivirus software, that is, the maximum number of files that can be scanned by the third-party antivirus software can be used as the value of K.

如果上述排序模块306按照待扫描样本文件被鉴别为可疑的概率从大到小进行排序，则该抽取模块307直接抽取排序后的待扫描样本文件中的前K个待扫描样本文件即可。If the sorting module 306 sorts the sample files to be scanned according to the probability that the sample files to be scanned are identified as suspicious in descending order, then the extraction module 307 can directly extract the first K sample files to be scanned from the sorted sample files to be scanned.

扫描模块308，适于对所述K个待扫描样本文件进行扫描，鉴别出其中的可疑样本文件。The scanning module 308 is adapted to scan the K sample files to be scanned, and identify suspicious sample files among them.

最后，需要说明的是，上述的接收模块301、等级检测模块302、获取模块303、建立模块304、概率计算模块305、排序模块306和抽取模块307可以为服务器中的功能模块，由于在扫描时是通过第三方杀毒软件进行扫描，因此扫描模块308可以为第三方杀毒软件中的功能模块。Finally, it should be noted that the above-mentioned receiving module 301, grade detection module 302, acquisition module 303, establishment module 304, probability calculation module 305, sorting module 306 and extraction module 307 can be functional modules in the server, because when scanning The scanning is performed by third-party anti-virus software, so the scanning module 308 may be a functional module in the third-party anti-virus software.

另外，其中的概率计算模块305、排序模块306和抽取模块307也可以为本地客户端中的功能模块，即在服务器确定出待扫描样本文件之后，将这些待扫描样本文件传给本地客户端，由本地客户端的概率计算模块305、排序模块306和抽取模块307确定出K个待扫描样本文件，再将这K个待扫描样本文件传递给服务器，本发明实施例对此并不加以限制。In addition, the probability calculation module 305, the sorting module 306 and the extraction module 307 may also be functional modules in the local client, that is, after the server determines the sample files to be scanned, these sample files to be scanned are transmitted to the local client, The probability calculation module 305 , the sorting module 306 and the extraction module 307 of the local client determine K sample files to be scanned, and then transmit the K sample files to be scanned to the server, which is not limited in the embodiment of the present invention.

根据本发明实施例的扫描文件的装置可以针对待扫描样本文件，分别计算每个待扫描样本文件被鉴别为可疑的概率，然后对所述待扫描样本文件按照其被鉴别为可疑的概率进行排序，并从排序后的待扫描样本文件中抽取可疑概率高的K个待扫描样本文件，最后对所述K个待扫描样本文件进行扫描，鉴别出其中的可疑样本文件。由此解决了现有技术中由于每天需要对全部的样本文件进行扫描而导致的文件扫描效率低的问题，取得了提高扫描效率的有益效果。并且，由于本发明实施例通过抽取可疑概率高的K个待扫描样本文件进行扫描，因此能够尽可能多地鉴别出可疑样本文件，提高扫描样本文件的准确性。The device for scanning files according to the embodiment of the present invention can separately calculate the probability that each sample file to be scanned is identified as suspicious for the sample files to be scanned, and then sort the sample files to be scanned according to the probability of being identified as suspicious , and extract K sample files to be scanned with high suspicious probability from the sorted sample files to be scanned, and finally scan the K sample files to be scanned to identify suspicious sample files among them. This solves the problem of low file scanning efficiency in the prior art due to the need to scan all sample files every day, and achieves the beneficial effect of improving scanning efficiency. Moreover, since the embodiment of the present invention scans by extracting K sample files to be scanned with a high suspicious probability, as many suspicious sample files as possible can be identified to improve the accuracy of scanning sample files.

对于上述扫描文件的装置实施例而言，由于其与方法实施例基本相似，所以描述的比较简单，相关之处参见图1和图2所示方法实施例的部分说明即可。As for the above embodiment of the device for scanning files, since it is basically similar to the method embodiment, the description is relatively simple, and for the related parts, please refer to the part of the description of the method embodiment shown in FIG. 1 and FIG. 2 .

基于上述扫描文件的装置实施例，本发明实施例还提供了一种文件扫描系统，该系统包括客户端和服务器端，Based on the above embodiment of the device for scanning files, the embodiment of the present invention also provides a file scanning system, the system includes a client and a server,

其中，in,

客户端包括：Clients include:

文件上传模块401，适于将样本文件上传至存储服务器中；The file upload module 401 is suitable for uploading the sample file to the storage server;

服务器端包括：存储服务器402、文件下载服务器403和扫描服务器404，The server side includes: a storage server 402, a file download server 403 and a scanning server 404,

所述存储服务器402包括：The storage server 402 includes:

数据库4021，适于存储所述文件上传模块上传的样本文件；The database 4021 is suitable for storing the sample files uploaded by the file upload module;

所述文件下载服务器403包括：Described file download server 403 comprises:

文件下载模块4031，适于从所述存储服务器的数据库中下载样本文件并传输至扫描服务器中；The file download module 4031 is adapted to download sample files from the database of the storage server and transmit them to the scanning server;

所述扫描服务器404包括上述实施例所述的文件扫描装置4041，具体参照上述实施例的相关描述即可。The scanning server 404 includes the document scanning device 4041 described in the above-mentioned embodiment, for details, please refer to the related description of the above-mentioned embodiment.

另外，需要说明的是，在扫描服务器404中还可以包括输入接口4042和输出接口4043，文件下载模块4031将下载的样本文件通过输入接口4042传输至扫描服务器的文件扫描装置4041中，文件扫描装置4041对所述样本文件进行处理，然后将处理结果通过输出接口4043输出。对于具体的处理过程，本实施例在此不再详细论述。In addition, it should be noted that the scanning server 404 may also include an input interface 4042 and an output interface 4043, and the file download module 4031 transmits the downloaded sample file to the file scanning device 4041 of the scanning server through the input interface 4042, and the file scanning device 4041 processes the sample file, and then outputs the processing result through the output interface 4043 . As for the specific processing process, this embodiment will not discuss in detail here.

本实施例提出的扫描文件的系统中的扫描服务器可以从待扫描样本文件中选择部分满足条件的样本文件进行扫描，提高了扫描效率。并且，由于本发明通过抽取可疑概率高的K个待扫描样本文件进行扫描，因此能够尽可能多地鉴别出可疑样本文件，提高扫描样本文件的准确性。The scanning server in the system for scanning files proposed in this embodiment can select some sample files satisfying the conditions from the sample files to be scanned for scanning, which improves the scanning efficiency. Moreover, since the present invention scans by extracting K sample files to be scanned with a high suspicious probability, it can identify as many suspicious sample files as possible and improve the accuracy of scanning sample files.

本说明书中的各个实施例均采用递进的方式描述，每个实施例重点说明的都是与其他实施例的不同之处，各个实施例之间相同相似的部分互相参见即可。Each embodiment in this specification is described in a progressive manner, each embodiment focuses on the difference from other embodiments, and the same and similar parts of each embodiment can be referred to each other.

本领域技术人员易于想到的是：上述各个实施例的任意组合应用都是可行的，故上述各个实施例之间的任意组合都是本申请的实施方案，但是由于篇幅限制，本说明书在此就不一一详述了。It is easy for those skilled in the art to think that: any combination of the above-mentioned embodiments is feasible, so any combination of the above-mentioned embodiments is the embodiment of the present application, but due to space limitations, this description will be limited here Not detailed one by one.

在此提供的算法和显示不与任何特定计算机、虚拟系统或者其它设备固有相关。各种通用系统也可以与基于在此的示教一起使用。根据上面的描述，构造这类系统所要求的结构是显而易见的。此外，本发明也不针对任何特定编程语言。应当明白，可以利用各种编程语言实现在此描述的本发明的内容，并且上面对特定语言所做的描述是为了披露本发明的最佳实施方式。The algorithms and displays presented herein are not inherently related to any particular computer, virtual system, or other device. Various generic systems can also be used with the teachings based on this. The structure required to construct such a system is apparent from the above description. Furthermore, the present invention is not specific to any particular programming language. It should be understood that various programming languages can be used to implement the contents of the present invention described herein, and the above description of specific languages is for disclosing the best mode of the present invention.

在此处所提供的说明书中，说明了大量具体细节。然而，能够理解，本发明的实施例可以在没有这些具体细节的情况下实践。在一些实例中，并未详细示出公知的方法、结构和技术，以便不模糊对本说明书的理解。In the description provided herein, numerous specific details are set forth. However, it is understood that embodiments of the invention may be practiced without these specific details. In some instances, well-known methods, structures and techniques have not been shown in detail in order not to obscure the understanding of this description.

类似地，应当理解，为了精简本公开并帮助理解各个发明方面中的一个或多个，在上面对本发明的示例性实施例的描述中，本发明的各个特征有时被一起分组到单个实施例、图、或者对其的描述中。然而，并不应将该公开的方法解释成反映如下意图：即所要求保护的本发明要求比在每个权利要求中所明确记载的特征更多的特征。更确切地说，如下面的权利要求书所反映的那样，发明方面在于少于前面公开的单个实施例的所有特征。因此，遵循具体实施方式的权利要求书由此明确地并入该具体实施方式，其中每个权利要求本身都作为本发明的单独实施例。Similarly, it should be appreciated that in the foregoing description of exemplary embodiments of the invention, in order to streamline this disclosure and to facilitate an understanding of one or more of the various inventive aspects, various features of the invention are sometimes grouped together in a single embodiment, figure, or its description. This method of disclosure, however, is not to be interpreted as reflecting an intention that the claimed invention requires more features than are expressly recited in each claim. Rather, as the following claims reflect, inventive aspects lie in less than all features of a single foregoing disclosed embodiment. Thus, the claims following the Detailed Description are hereby expressly incorporated into this Detailed Description, with each claim standing on its own as a separate embodiment of this invention.

本领域那些技术人员可以理解，可以对实施例中的设备中的模块进行自适应性地改变并且把它们设置在与该实施例不同的一个或多个设备中。可以把实施例中的模块或单元或组件组合成一个模块或单元或组件，以及此外可以把它们分成多个子模块或子单元或子组件。除了这样的特征和/或过程或者单元中的至少一些是相互排斥之外，可以采用任何组合对本说明书(包括伴随的权利要求、摘要和附图)中公开的所有特征以及如此公开的任何方法或者设备的所有过程或单元进行组合。除非另外明确陈述，本说明书(包括伴随的权利要求、摘要和附图)中公开的每个特征可以由提供相同、等同或相似目的的替代特征来代替。Those skilled in the art can understand that the modules in the device in the embodiment can be adaptively changed and arranged in one or more devices different from the embodiment. Modules or units or components in the embodiments may be combined into one module or unit or component, and furthermore may be divided into a plurality of sub-modules or sub-units or sub-assemblies. All features disclosed in this specification (including accompanying claims, abstract and drawings) and any method or method so disclosed may be used in any combination, except that at least some of such features and/or processes or units are mutually exclusive. All processes or units of equipment are combined. Each feature disclosed in this specification (including accompanying claims, abstract and drawings) may be replaced by alternative features serving the same, equivalent or similar purpose, unless expressly stated otherwise.

此外，本领域的技术人员能够理解，尽管在此所述的一些实施例包括其它实施例中所包括的某些特征而不是其它特征，但是不同实施例的特征的组合意味着处于本发明的范围之内并且形成不同的实施例。例如，在下面的权利要求书中，所要求保护的实施例的任意之一都可以以任意的组合方式来使用。Furthermore, those skilled in the art will understand that although some embodiments described herein include some features included in other embodiments but not others, combinations of features from different embodiments are meant to be within the scope of the invention. and form different embodiments. For example, in the following claims, any of the claimed embodiments may be used in any combination.

本发明的各个部件实施例可以以硬件实现，或者以在一个或者多个处理器上运行的软件模块实现，或者以它们的组合实现。本领域的技术人员应当理解，可以在实践中使用微处理器或者数字信号处理器(DSP)来实现根据本发明实施例的文件扫描系统中的一些或者全部部件的一些或者全部功能。本发明还可以实现为用于执行这里所描述的方法的一部分或者全部的设备或者装置程序(例如，计算机程序和计算机程序产品)。这样的实现本发明的程序可以存储在计算机可读介质上，或者可以具有一个或者多个信号的形式。这样的信号可以从因特网网站上下载得到，或者在载体信号上提供，或者以任何其他形式提供。The various component embodiments of the present invention may be implemented in hardware, or in software modules running on one or more processors, or in a combination thereof. Those skilled in the art should understand that a microprocessor or a digital signal processor (DSP) can be used in practice to implement some or all functions of some or all components in the document scanning system according to the embodiment of the present invention. The present invention can also be implemented as an apparatus or an apparatus program (for example, a computer program and a computer program product) for performing a part or all of the methods described herein. Such a program for realizing the present invention may be stored on a computer-readable medium, or may be in the form of one or more signals. Such a signal may be downloaded from an Internet site, or provided on a carrier signal, or provided in any other form.

应该注意的是上述实施例对本发明进行说明而不是对本发明进行限制，并且本领域技术人员在不脱离所附权利要求的范围的情况下可设计出替换实施例。在权利要求中，不应将位于括号之间的任何参考符号构造成对权利要求的限制。单词“包含”不排除存在未列在权利要求中的元件或步骤。位于元件之前的单词“一”或“一个”不排除存在多个这样的元件。本发明可以借助于包括有若干不同元件的硬件以及借助于适当编程的计算机来实现。在列举了若干装置的单元权利要求中，这些装置中的若干个可以是通过同一个硬件项来具体体现。单词第一、第二、以及第三等的使用不表示任何顺序。可将这些单词解释为名称。It should be noted that the above-mentioned embodiments illustrate rather than limit the invention, and that those skilled in the art will be able to design alternative embodiments without departing from the scope of the appended claims. In the claims, any reference signs placed between parentheses shall not be construed as limiting the claim. The word "comprising" does not exclude the presence of elements or steps not listed in a claim. The word "a" or "an" preceding an element does not exclude the presence of a plurality of such elements. The invention can be implemented by means of hardware comprising several distinct elements, and by means of a suitably programmed computer. In a unit claim enumerating several means, several of these means can be embodied by one and the same item of hardware. The use of the words first, second, and third, etc. does not indicate any order. These words can be interpreted as names.

Claims

1. A file scanning system, comprising: a client and a server, wherein,

Clients include:

A file upload module, suitable for uploading sample files to a storage server;

The server side includes: storage server, file download server and scanning server,

The storage server includes:

A database adapted to store sample files uploaded by the file upload module;

The file download server includes:

A file download module, adapted to download sample files from the database of the storage server and transmit them to the scanning server;

The scanning server includes a file scanning device, and the file scanning device includes:

The probability calculation module is adapted to calculate the probability that each sample file to be scanned is identified as suspicious for the sample file to be scanned; the probability calculation module includes: a time point acquisition submodule, which is suitable for each sample file to be scanned, Obtain the time point n _{2 of this scan and the time point n 1} _of the last scan corresponding to the sample file to be scanned; the probability calculation sub-module is suitable for calculating from time point n ₁ to time point n ₂ , the to-be The probability that the scanned sample file is identified as suspicious in this scan Pr(N≥n ₁ , N≤n ₂ │α, β)=Pr(N≥n ₁ │α, β)-Pr(N≥n ₂ + 1│α, β), where the parameters α and β are parameters obtained by performing maximum likelihood estimation on the sample file data to be scanned;

A sorting module, adapted to sort the sample files to be scanned according to the probability of being identified as suspicious;

The extraction module is suitable for obtaining the number K of scanned files, and extracts K sample files to be scanned with high suspicious probability from the sorted sample files to be scanned, and K is a positive integer;

The scanning module is adapted to scan the K sample files to be scanned, and identify suspicious sample files among them.

2. The system according to claim 1, wherein the document scanning device further comprises:

The level detection module is adapted to detect the levels of all sample files before the probability calculation module calculates the probability that each sample file to be scanned is identified as suspicious, and the levels of the sample files include security level, unknown level, suspicious/highly suspicious level, and malicious level;

The obtaining module is adapted to obtain sample files of unknown grades, and uses the obtained sample files of unknown grades as sample files to be scanned.

3. The system of claim 1, wherein,

The sorting module is sorted according to the probability that the sample files to be scanned are identified as suspicious from large to small;

The K sample files to be scanned are the first K sample files to be scanned in the sorted sample files to be scanned.

4. The system according to claim 1, wherein the document scanning device further comprises:

The establishment module is suitable for establishing an information library for each sample file to be scanned before the probability calculation module calculates the probability that each sample file to be scanned is identified as suspicious, and the information library includes the corresponding The time point n ₁ of the last scan.

5. The system according to claim 1, wherein the probability calculation submodule comprises:

The probability calculation unit is adapted to calculate the probability Pr(N≥n│α, β) that each sample file to be scanned is not identified as suspicious for the first n-1 times, and is identified as suspicious for the nth time:

Pr PR ((N N &GreaterEqual; &Greater Equal; n no | | α α,, β β)) = = \{\begin{matrix} 11,, & n no = = 11 \\ \frac{β β + + n no - - 22}{α α + + β β + + n no - - 22} P P ((N N &GreaterEqual; &Greater Equal; n no - - 11 | | α α,, β β)),, & n no > > 11 \end{matrix};;

The first replacement unit is adapted to replace n in the Pr(N≥n│α, β) with n ₁ to calculate Pr(N≥n ₁ │α, β);

The second replacement unit is adapted to replace n in the Pr(N≥n│α, β) with n ₂ +1, and calculate Pr(N≥n ₂ +1│α, β);

A difference calculation unit, adapted to calculate the difference between Pr(N≥n ₁ │α, β) and Pr(N≥n ₂ +1│α, β), to obtain the probability Pr(N≥n│α, β ).