CN116821903A

CN116821903A - Detection rule determination and malicious binary file detection method, device and medium

Info

Publication number: CN116821903A
Application number: CN202310597956.3A
Authority: CN
Inventors: 孟雷
Original assignee: Alibaba Cloud Computing Ltd
Current assignee: Alibaba Cloud Computing Ltd
Priority date: 2023-05-24
Filing date: 2023-05-24
Publication date: 2023-09-29

Abstract

The embodiment of the application provides a method, equipment and medium for determining a detection rule and detecting a malicious binary file. In the embodiment of the application, a first decision tree with a malicious binary file detection function and a second decision tree with a virus type detection function of a malicious binary file are trained, and a plurality of decision rules such as a malicious binary file detection rule, a virus type detection rule and the like are automatically generated based on a tree traversal mode. Because the sample data of the sample binary files participating in the training of the decision tree fuses the virus detection results of a plurality of antivirus engines, the generated detection rule can carry out association analysis on the virus name character strings output by the antivirus engines so as to determine whether the binary files to be detected are malicious binary files or not and can identify the virus types. Compared with a single anti-virus engine, the detection capability of a plurality of anti-virus engines is fused, the overall virus detection capability is improved, and the defects of false alarm or missing report of the single anti-virus engine are overcome.

Description

Detection rule determination and malicious binary file detection method, device and medium

Technical Field

The present application relates to the field of computer technologies, and in particular, to a method, an apparatus, and a medium for determining a detection rule and detecting a malicious binary file.

Background

At present, a plurality of antivirus engines with malicious binary file detection functions appear on the market. However, binary files are relatively many in virus type, including, for example, but not limited to: backdoor programs, high risk programs, trojan programs, hacking tools, infectious viruses, leucaviruses, mining programs, and agents, among others. Because of the relatively large number of virus types in binary files, it is difficult to cover most of the detection of binary files of virus types by an antivirus engine.

Disclosure of Invention

Aspects of the application provide a method, a device and a medium for determining a detection rule and detecting a malicious binary file, which are used for improving virus detection capability for the malicious binary file.

The embodiment of the application provides a detection rule determining method, which comprises the following steps: model training is carried out by using the first sample file set, and a first decision tree is obtained; performing model training by using a second sample file set corresponding to each of the plurality of virus types to obtain a second decision tree corresponding to each of the plurality of virus types; determining a malicious binary file detection rule according to the first decision tree; respectively determining virus type detection rules corresponding to the virus types according to second decision trees corresponding to the virus types; wherein the sample data of the sample binary file in the first sample file set includes: splicing virus name character strings and malicious result labels for indicating whether the virus name character strings are malicious binary files, wherein the splicing virus name character strings are obtained by splicing a plurality of virus name character strings, and the plurality of virus name character strings are obtained by respectively carrying out virus detection on corresponding sample binary files by utilizing a plurality of antivirus engines; sample data of a sample binary file in the second sample file set includes: the virus name string and the virus type tag indicating the virus type are spliced.

The embodiment of the application also provides a malicious binary file detection method, which comprises the following steps: acquiring a binary file to be detected; respectively carrying out virus detection on the binary files to be detected by using a plurality of antivirus engines to obtain a plurality of virus name character strings; splicing the plurality of virus name character strings to obtain a target virus name character string; detecting the target virus name character string by using a malicious binary file detection rule to determine whether the binary file to be detected is a malicious binary file; and under the condition that the binary file to be detected is a malicious binary file, detecting the virus type of the target virus name character string by utilizing a virus type detection rule corresponding to each of the plurality of virus types so as to determine the virus type of the binary file to be detected.

The embodiment of the application also provides electronic equipment, which comprises: a memory and a processor; a memory for storing a computer program; the processor is coupled to the memory for executing the computer program for performing the steps in the detection rule determination method or the malicious binary detection method.

Embodiments of the present application also provide a computer-readable storage medium storing a computer program which, when executed by a processor, causes the processor to implement steps in a detection rule determination method or a malicious binary file detection method.

Drawings

The accompanying drawings, which are included to provide a further understanding of the application and are incorporated in and constitute a part of this specification, illustrate embodiments of the application and together with the description serve to explain the application and do not constitute a limitation on the application. In the drawings:

FIG. 1 is an application scenario diagram provided in an embodiment of the present application;

FIG. 2 is a flowchart of a method for determining a detection rule according to an embodiment of the present application;

FIG. 3 is a process diagram of an exemplary detection rule generation phase provided by an embodiment of the present application;

FIG. 4 is a flowchart of a malicious binary file detection method according to an embodiment of the present application;

FIG. 5 is another application scenario diagram provided by an embodiment of the present application;

fig. 6 is a schematic structural diagram of a detection rule determining apparatus according to an embodiment of the present application;

FIG. 7 is a schematic structural diagram of a malicious binary file detecting device according to an embodiment of the present application;

fig. 8 is a schematic structural diagram of an electronic device according to an embodiment of the present application.

Detailed Description

In order to make the objects, technical solutions and advantages of the present application more apparent, the technical solutions of the present application will be clearly and completely described below with reference to specific embodiments of the present application and corresponding drawings. It will be apparent that the described embodiments are only some, but not all, embodiments of the application. All other embodiments, which can be made by those skilled in the art based on the embodiments of the application without making any inventive effort, are intended to be within the scope of the application.

In embodiments of the present application, "at least one" means one or more, and "a plurality" means two or more. "and/or" describes the access relationship of the associated object, meaning that there may be three relationships, e.g., A and/or B, which may represent: there are three cases, a alone, a and B together, and B alone, wherein a, B may be singular or plural. In the text description of the present application, the character "/" generally indicates that the front-rear associated object is an or relationship. In addition, in the embodiments of the present application, "first", "second", "third", etc. are only for distinguishing the contents of different objects, and have no other special meaning.

First, some words related to the embodiments of the present application will be described:

antivirus engine: refers to an engine having a malicious binary detection function, for example, detecting whether a binary is a malicious binary and detecting a virus type of the malicious binary. Virus types include, for example, but are not limited to: backdoor programs, high risk programs, trojan programs, hacking tools, infectious viruses, leucaviruses, mining programs, and agents, among others.

Multiple engine platform: refers to a platform composed of a plurality of antivirus engines.

Binary (binary) files: is a computer file format in which data is stored in binary form. Binary files on Windows system refer to Portable Executable (PE, executable) format files, binary files on Linux system and Unix system refer to Executable and Linkable Format (ELF, executable linkable file format) format files. The Windows system is an operating system developed based on a graphical user interface. The Linux system is a free-to-use and freely-spread Unix-like operating system, and is a multi-user, multi-tasking, multi-thread supporting and multi-CPU (Central Processing Unit) operating system. Unix systems are a powerful multi-user, multi-tasking operating system that supports multiple processor architectures. Malicious result tag: refers to a tag capable of indicating whether or not a binary file to be detected is a malicious binary file.

Virus type tag: refers to a tag that is capable of indicating the type of virus to which the binary file to be detected belongs.

Virus name string: refers to a String (String) of characters representing a virus name, a String of characters consisting of numbers, letters, underlines, etc.

Splicing virus name character strings: the method is obtained by splicing a plurality of virus name character strings.

Knowledge graph: the knowledge graph includes at least two entities (which may also be referred to as nodes), and whether there is an association relationship between different entities.

Classification accuracy (accuracy): the number of samples divided into pairs divided by the number of all samples, in general, the higher the accuracy, the better the classifier.

The Decision Tree (Decision Tree) is a Decision analysis method for determining the probability that the expected value of the net present value is greater than or equal to zero by constructing the Decision Tree on the basis of knowing the occurrence probability of various situations, evaluating the risk of the project and judging the feasibility of the project. A decision tree is a tree-like structure whose decision branches are drawn in a pattern much like the branches of a tree, and is called a decision tree. The decision tree consists of root nodes, internal nodes and leaf nodes. Each decision tree has only one root node, each internal node represents a test on an attribute, each branch represents a test output, and each leaf node represents a class. The decision tree is generally generated by starting from a root node, selecting a corresponding attribute, selecting a partition point of the node corresponding attribute, and splitting the node according to the partition point. The decision tree generates a plurality of sub-nodes by selecting features and corresponding dividing points, and when the value in a certain node only belongs to a certain category (or the variance is smaller), the sub-nodes are not split any more.

In practical applications, the virus types of the antivirus engines are different, the detection capability is uneven, in addition, the virus types of the binary files are relatively large, and the detection of the binary files with one antivirus engine covering most of the virus types is difficult. At present, a plurality of antivirus engines with malicious binary file detection functions appear on the market. However, binary files are relatively many in virus type, including, for example, but not limited to: backdoor programs, high risk programs, trojan programs, hacking tools, infectious viruses, leucaviruses, mining programs, and agents, among others. Because of the relatively large number of virus types in binary files, it is difficult to cover most of the detection of binary files of virus types by an antivirus engine.

For this reason, the embodiment of the application provides a method, a device and a medium for determining a detection rule and detecting a malicious binary file. In the embodiment of the application, a first decision tree with a malicious binary file detection function and a second decision tree with a virus type detection function of a malicious binary file are trained, and a plurality of decision rules such as a malicious binary file detection rule, a virus type detection rule and the like are automatically generated based on a tree traversal mode. Because the sample data of the sample binary files participating in the training of the decision tree fuses the virus detection results of a plurality of antivirus engines, the generated detection rule can carry out association analysis on the virus name character strings output by the antivirus engines so as to determine whether the binary files to be detected are malicious binary files or not and can identify the virus types. Compared with a single anti-virus engine, the method has the advantages that the detection capability of a plurality of anti-virus engines is fused, more binary files of virus types can be detected, the detection advantage of the single anti-virus engine can be used for improving the overall virus detection capability through joint decision of the plurality of anti-virus engines, the enhancement effect can be achieved through good and bad complementation, and the defects of misinformation or missing report of the single anti-virus engine are overcome.

Fig. 1 is an application scenario diagram provided in an embodiment of the present application. Referring to fig. 1, a user may input a binary file to be detected to a multi-engine platform, and the multi-engine platform performs virus detection on the binary file to be detected by using a plurality of different types of antivirus engines, and outputs a virus detection result. The multi-engine platform includes, for example: antivirus engine 1, antivirus engine 2, antivirus engine 3, antivirus engine 4 and … … antivirus engine n, n being an integer greater than 1. And in the case that the binary file to be detected is a malicious binary file, the virus detection result output by the antivirus engine comprises a virus name character string and a virus type. For example, the virus name strings are Win32/Rozena. AUW, trojan: win32/Swrort. A, deepscan: generic. Exploid. Shelcode. 1.319DAA74, respectively. Of course, in the case where the binary file to be detected is not a malicious binary file, the virus detection result output by the antivirus engine indicates that the binary file to be detected is a normal binary file.

In practical application, the multi-engine platform can also learn in advance virus detection results obtained by detecting massive malicious binary files, and automatically generate various detection rules such as malicious binary file detection rules, virus type detection rules and the like. For an introduction to the detection rules see below.

Referring to (1) in fig. 1, the cloud server is preloaded with various detection rules such as malicious binary file detection rules and virus type detection rules. If the multi-engine platform requests the cloud server to detect the binary file to be detected based on the virus name character strings output by the anti-virus engines, the cloud server firstly splices the virus name character strings to obtain spliced virus name character strings. Then, referring to (2) in fig. 1, the cloud server detects the spliced virus name string using a detection rule that is preloaded. Specifically, the cloud server first detects the spliced virus name character string by using a malicious binary file detection rule to determine whether the binary file to be detected is a malicious binary file. And then, if the binary file to be detected is a malicious binary file, detecting the spliced virus name character string by utilizing a virus type detection rule so as to determine the virus type of the binary file to be detected. So far, through the interaction between the multi-engine platform and the cloud server, the virus detection task of the binary file is completed.

Compared with a single anti-virus engine, the method has the advantages that the detection capability of a plurality of anti-virus engines is fused, more binary files of virus types can be detected, the detection advantage of the single anti-virus engine can be used for improving the overall virus detection capability through joint decision of the plurality of anti-virus engines, the enhancement effect can be achieved through good and bad complementation, and the defects of misinformation or missing report of the single anti-virus engine are overcome. In addition, the detection is performed by utilizing the automatically constructed detection rule, so that the detection efficiency can be improved, the resource cost is reduced, and the problem of manually compiling the rule cost is solved. The automatically constructed detection rule can solve the problem that the traditional detection mode based on the model can not repair false alarm; the method can also quickly modify the detection rule aiming at the false alarm or missing alarm condition, is quickly on line, has good interpretation capability, and solves the defects of slow iteration, no interpretability and the like of the traditional model-based detection mode.

It should be noted that, the application scenario shown in fig. 1 is only an exemplary application scenario, and the embodiment of the present application is not limited to the application scenario. The embodiment of the present application does not limit the devices included in fig. 1, nor does it limit the positional relationship between the devices in fig. 1.

In addition, the antivirus engine may be hardware or software. When the antivirus engine is hardware, the antivirus engine is, for example, a mobile phone, a tablet computer, a desktop computer, a wearable intelligent device, an intelligent home device and the like. When the antivirus engine is software, it may be installed in the above-listed hardware device, where the antivirus engine is, for example, a plurality of software modules or a single software module, etc., embodiments of the present application are not limited. The cloud server may be hardware or software. When the cloud server is hardware, the cloud server is a single server or a distributed server cluster formed by a plurality of servers. When the cloud server is software, the cloud server may be a plurality of software modules or a single software module, and the embodiment of the application is not limited.

The following describes in detail the technical solutions provided by the embodiments of the present application with reference to the accompanying drawings.

Fig. 2 is a flowchart of a detection rule determining method according to an embodiment of the present application. The method may be executable by a detection rule determining apparatus, which may be composed of software and/or hardware, and may be generally configured in various electronic devices such as a cloud server.

Referring to fig. 2, the method may include the steps of:

201. and performing model training by using the first sample file set to obtain a first decision tree.

202. And performing model training by using a second sample file set corresponding to each of the plurality of virus types to obtain a second decision tree corresponding to each of the plurality of virus types.

203. And determining a malicious binary file detection rule according to the first decision tree.

204. And respectively determining virus type detection rules corresponding to the virus types according to the second decision trees corresponding to the virus types.

In this embodiment, a decision tree having a malicious binary file detection function is trained, and a decision tree having a virus type detection function of a malicious binary file is also trained. For ease of understanding and distinction, a decision tree with a malicious binary file detection function is referred to as a first decision tree, and a decision tree with a virus type detection function of a malicious binary file is referred to as a second decision tree. The number of the second decision trees may be multiple, and different virus types correspond to different second decision trees, and each second decision tree is used for judging whether the virus type of the binary file to be detected is a virus type identifiable by the second decision tree.

In this embodiment, the binary file used in the model training stage is referred to as a sample binary file. Referring to fig. 2, in the detection rule generation stage, a huge amount of sample data of the sample binary file needs to be collected. The sample binary file has been subjected to virus detection by a plurality of antivirus engines respectively, first, virus detection results of the sample binary file output by the plurality of antivirus engines are collected, and any antivirus engine outputs virus detection results, for example, include but are not limited to: a virus name string and a virus type. For any sample binary file, it is assumed that the virus name strings of the sample binary file output by the n antivirus engines are respectively denoted as s1, s2, s3 … … sn, and the like.

Then, the sample binary file is manually marked or automatically marked, for example, the virus type and the malicious score corresponding to the sample binary file are marked. The malicious score reflects the security risk degree brought by the sample binary file, and the higher the malicious score is, the higher the security risk degree brought by the malicious score is, for example [0,100].

When the virus type of the sample binary file is marked, the virus type of the sample binary file output by each antivirus engine can be synthesized for marking. For example, among the virus types of the sample binary file outputted from the respective antivirus engines, the virus type having the largest number of occurrences is selected as the final virus type of the sample binary file. For another example, among the virus types of the sample binary file output by each antivirus engine, the virus type with the highest malicious score is selected and marked as the final virus type of the sample binary file. For another example, among the virus types of the sample binary file outputted from each antivirus engine, one virus type is optionally labeled as the final virus type of the sample binary file from the virus types having the largest occurrence number as required. For another example, among the virus types of the sample binary file output by each antivirus engine, the virus type output by the antivirus engine with better detection performance is selected and marked as the final virus type of the sample binary file. For another example, in combination with expert experience, the virus type of the sample binary file output from each antivirus engine is optionally labeled as the final virus type of the sample binary file, which is not limited.

And then, after marking the virus type and the malicious score of the sample binary file, performing data arrangement on the sample binary file. Sample data of the collated sample binary file includes, for example, but is not limited to: file identification, virus name character strings output by each of the plurality of antivirus engines, tagged virus types and tagged malicious scores. Wherein the file identification is used to identify the uniqueness of the sample binary file, including, for example, but not limited to: file name, random number, timestamp or file MD5 (Message Digest Algorithm, fifth edition of message digest algorithm), etc.

In practical applications, sample data of the sample binary file may also be encapsulated using a data format defined as needed. For example, the data format of sample data of a sample binary file is: [ files MD5: s1, s2, s3 … … sn|virus type|malicious score ]. Where i is a separator. Based on the data format described above, the sample data of the consolidated sample binary file may include the following information: file MD5, virus name string of the sample binary file output by n antivirus engines, tagged virus type and malicious score, etc.

In this embodiment, sample data of a massive sample binary file is collected to construct an initial sample file set. Sample data for a sample binary file in the initial sample file set includes one or more of: file identification, concatenation of virus name strings, virus type and malicious score. The spliced virus name character strings are obtained by splicing a plurality of virus name character strings of the sample binary file. Let the concatenation virus name string be denoted S, S being { S: s1+s2+s … … +sn }. In practical application, the sequence of the virus name character strings output by the antivirus engines participating in the splicing can be set, and the plurality of virus name character strings are spliced in sequence according to the sequence.

In this embodiment, a set of sample files for training the first decision tree may be obtained from the initial set of sample files. For ease of distinction and understanding, the set of sample files used to train the first decision tree is referred to as the first set of sample files. Sample data of a sample binary file in a first sample file set includes at least: the virus name string and the malicious result tag indicating whether it is a malicious binary file are spliced, but not limited thereto. It should be noted that, when the first decision tree is trained with a view to determining whether the binary file to be detected is a malicious binary file, the sample binary file indicating the malicious result tag of the malicious binary file in the first sample file set is taken as a positive sample (may also be referred to as a black sample), and the sample binary file indicating the malicious result tag of the non-malicious binary file in the first sample file set is taken as a negative sample (may also be referred to as a white sample).

Further optionally, in order to improve the discrimination accuracy of the malicious binary file of the first decision tree, a score threshold may be flexibly set as required, for example, 80 scores. Dividing a sample binary file with a malicious score greater than or equal to a score threshold in an initial sample file set into sample binary files with malicious result tags indicating malicious binary files; sample binaries in the initial sample file set having malicious scores less than the score threshold are partitioned into sample binaries having malicious result tags indicating that they are not malicious binaries.

In this embodiment, a set of sample files for training the second decision tree may be obtained from the initial set of sample files. For ease of understanding and distinction, the set of sample files used to train the second decision tree is referred to as the second set of sample files. Sample data of the sample binary file in the second sample file set includes at least: the virus name string and the virus type tag indicating the virus type are spliced, but not limited thereto.

It is noted that different virus types require exactly corresponding second sample file sets. I.e. under multiple virus types, multiple second sample file sets need to be prepared. Taking a plurality of virus types as a backdoor program, a high-risk program, a Trojan horse program and the like as examples, wherein a sample binary file in a second sample file set of the backdoor program is provided with a backdoor program label; sample binary files in the second sample file set of the high-risk program have high-risk program labels; the sample binary files in the second sample file set of Trojan programs have Trojan program tags.

It should be noted that, in the training of the second decision tree corresponding to any target virus type in the plurality of virus types with the goal of identifying the virus type to which the binary file to be detected belongs, the sample binary file in the second sample file set corresponding to the target virus type is used as a positive sample, and the sample binary file in the second sample file set corresponding to the other virus types is used as a negative sample. Other virus types refer to virus types other than the target virus type among the plurality of virus types. For example, when training the second decision tree corresponding to the back gate program, the sample binary file of the back gate program is a positive sample, the sample binary file of other virus types is a negative sample, and other virus types include, for example, but are not limited to: high risk programs, trojan programs, hacking tools, infectious viruses, lesoviruses, mining programs, agent tools, and the like.

Further optionally, in order to improve the efficiency of sample construction, the sample binary files in the first sample file set with the malicious result labels indicated as malicious binary files may be classified, so as to obtain a second sample file set of each of the plurality of virus types.

In practical application, the spliced virus name strings S corresponding to different malicious binary files may have the same number, and at this time, if the malicious scores corresponding to the spliced virus name strings S are different, ambiguity problem may be caused. For example, when the same spliced virus name string S corresponds to two malicious scores G1 and G2, g1=90, g2=60, the sample binary file corresponding to the same spliced virus name string S may be divided into a first sample file set and a second sample file set, for example, the sample binary file corresponding to g1=60 of the same spliced virus name string S is divided into the first sample file set, and the sample binary file corresponding to g1=90 of the same spliced virus name string S is divided into the second sample file set, which is unreasonable and needs to be corrected.

Based on this, further optionally, referring to fig. 3, an ambiguous data correction may also be performed prior to the acquisition of the first set of sample files from the initial set of sample files. As one example, determining whether the same spliced virus name character in the initial sample file set has a plurality of malicious scores; if the same spliced virus name character string in the initial sample file set has a plurality of malicious scores, acquiring the sample number corresponding to each of the plurality of malicious scores; and correcting the malicious scores of the sample binary files with the same spliced virus name character strings in the initial sample file set according to the sample numbers corresponding to the malicious scores. For example, the most numerous malicious scores of samples are determined as the final malicious score of the sample binary file having the same spliced virus name string. For another example, the final malicious score of the sample binary file having the same spliced virus name string is determined by selecting one malicious score from the malicious scores with the largest sample number, which is not limited.

For example, the same spliced virus name string S corresponds to two malicious scores G1 and G2, g1=90, g2=60. The number of samples of the sample binary files in the initial sample file set g1=90 is 10, and the number of samples of the sample binary files in the initial sample file set g2=60 is 30, and then the malicious score of the same spliced virus name string S is corrected to g2=60.

In practical application, the spliced virus name strings S corresponding to different malicious binary files may be the same, and at this time, if the virus types corresponding to the spliced virus name strings S are different, ambiguity problem may be caused. For example, when the same spliced virus name string S corresponds to two virus types, the mining procedure and the lux procedure are respectively performed. At this time, the sample binary files corresponding to the same spliced virus name string S may be divided into a third sample file set of different virus types, for example, the sample binary files of the mining program of the same spliced virus name string S are divided into the sample file set of the mining program, and the sample binary files of the lux program of the same spliced virus name string S are divided into the sample file set of the lux program, which is unreasonable and needs to be corrected.

Based on the above, before the second sample file set is acquired, it can also be determined whether the same spliced virus name character string in the initial sample file set has a plurality of virus types; if the same spliced virus name character string in the initial sample file set has a plurality of virus types, acquiring the sample number corresponding to each of the plurality of virus types; and correcting the virus types of the sample binary files with the same spliced virus name character strings in the initial sample file set according to the sample numbers corresponding to the virus types. For example, the virus type with the largest number of samples is determined as the final virus type of the sample binary file with the same spliced virus name string. For another example, the final virus type of the sample binary file having the same spliced virus name string is determined by selecting one virus type from the virus types having the largest sample number, without limitation.

Referring to fig. 3, model training is performed by using the mined first sample file set to obtain a first decision tree. As an alternative implementation manner, when training the first decision tree, constructing a first word segmentation word library according to a plurality of spliced virus name character strings acquired from the first sample file set; aiming at any sample binary file in the first sample file set, word segmentation processing is carried out on spliced virus name character strings of the sample binary file; vectorizing word segmentation results of the sample binary file by using the first word segmentation word stock to obtain a word vector sequence of the sample binary file; and performing model training by using word vector sequences of a plurality of sample binary files and malicious result labels to obtain a first decision tree.

Specifically, a large number of spliced virus name character strings are extracted from a first sample file set, and word segmentation processing is performed on each of a plurality of virus name character string sequences to obtain a plurality of segmented words (also referred to as virus name words); and constructing a word stock according to the plurality of word segments to obtain a first word segment word stock. Further optionally, in order to improve word segmentation effect, word segmentation may be performed based on an N-Gram algorithm (also referred to as an N-Gram model), which is a word-level window word-taking algorithm.

The first word segmentation word stock is used for vectorizing spliced virus name character strings of the sample binary file, and the length of the vector is the length of the first word segmentation word stock. When vectorizing, if the spliced virus name character string contains the word of the first word segmentation word stock, the word vector of the corresponding character position of the spliced virus name character string is 1, otherwise, the word vector is 0. For example, the spliced virus name string S is Win32/Rozena.AUW, the word segmentation is Win32, rozena and AUW respectively, the word vector sequence corresponding to the S is [1,1,1,0,0,0] in the first word segmentation word library (Win 32, rozena, AUW, TTT, TTT1 and TTT 2). For more description of the vectorization processing of the character strings by using the word stock, reference may be made to One-Hot encoding (One-Hot encoding) mode, which is not described herein.

And during model training, performing model training by using word vector sequences of a plurality of sample binary files and malicious result labels thereof to obtain a first decision tree. Each sample binary file has a sequence of word vectors, which can also be understood as a sequence of features. A value of 1 for a word vector in the sequence of word vectors represents that the corresponding word exists in the word segmentation thesaurus, i.e. that the feature (word) exists. A value of 0 for a word vector in the word vector sequence represents that the corresponding word does not exist in the word segmentation thesaurus, i.e., the feature does not exist. Training word vector sequences of all sample binary files to a decision tree model, training a decision tree according to decision tree decision logic, wherein the decision tree decision logic is referred to the above, and more description can be referred to the prior art. For example, by analyzing the word vector sequence of each sample binary file, the splitting index (such as information entropy, keni index, etc.) of each feature relative to the whole sample binary file is calculated, and the preferred feature is selected as the dividing node of the tree branch, and the cycle is continued until the whole decision tree is constructed.

In this embodiment, the first decision tree may be regarded as a classification model, i.e. to distinguish whether the binary file to be detected is a malicious binary file. Referring to fig. 3, a root node and an internal node of the first decision tree correspond to a virus name term, and leaf nodes correspond to malicious labels indicating malicious binary files or non-malicious labels indicating binary files that are not malicious, that is, binary files are classified.

Referring to fig. 3, model training is further performed by using the second sample file set corresponding to each mined virus type, so as to obtain a second decision tree corresponding to each virus type.

In practical application, a second word segmentation word library is constructed by a plurality of spliced virus name character strings obtained from a second sample file set of all or part of virus types, and word segmentation processing is performed on each virus name character string sequence in a plurality of virus name character string sequences to obtain a plurality of segmented words; and constructing a word stock according to the plurality of word segments to obtain a second word stock. Regarding the construction method of the second word stock, reference may be made to the construction method of the first word stock, which is not limited thereto.

Training a second decision tree corresponding to any target virus type in the plurality of virus types, using a sample binary file in a second sample file set corresponding to the target virus type as a positive sample, and using sample binary files in a second sample file set corresponding to other virus types as a negative sample. Aiming at any sample binary file in the positive sample or the negative sample, word segmentation processing is carried out on spliced virus name character strings of the sample binary file; vectorizing the word segmentation result by using the second word segmentation word stock to obtain a word vector sequence of the sample binary file; and performing model training by using word vector sequences of the plurality of sample binary files and malicious type labels to obtain a second decision tree. It is noted that the second decision tree of the trained target virus type is a binary model, and can distinguish whether the virus type of the malicious binary file is the target virus type or other virus types. Other virus types, i.e. not the target virus type. Since the plurality of virus types construct the corresponding second decision tree, the plurality of second decision trees form a multi-classification model. Referring to the example of fig. 3, the second decision tree of the back door program classifies whether the virus type of the malicious binary file is the back door program, and if the virus type of the malicious binary file is not the back door program, the second decision tree of the back door program outputs the identification result of other programs. If the virus type of the malicious binary file is a backdoor program, a second decision tree of the backdoor program outputs an identification result of the backdoor program. The second decision tree of the Trojan horse program classifies whether the virus type of the malicious binary file is the Trojan horse program, and if the virus type of the malicious binary file is not the Trojan horse program, the second decision tree of the Trojan horse program outputs the identification results of other programs. If the virus type of the malicious binary file is Trojan horse program, a second decision tree of the Trojan horse program outputs an identification result of the Trojan horse program.

In this embodiment, after obtaining a first decision tree capable of distinguishing whether the binary file to be detected is a malicious binary file and obtaining a plurality of second decision trees for distinguishing the virus type to which the binary file to be detected belongs, referring to fig. 3, an automatic rule generation link may be entered.

Specifically, the decision tree is a set of a series of decision rules, one decision rule corresponds to each leaf node from the root node of the decision tree, the number of the decision rules is the number of the leaf nodes, and each decision rule is a detection rule for deciding whether the decision result corresponding to the leaf node exists or not.

In this embodiment, the paths from the root node to each leaf node in the first decision tree are traversed to obtain decision rules corresponding to each path, and the decision rules corresponding to each path are combined to obtain a malicious binary file detection rule for judging whether the binary file to be detected is a malicious binary file.

Further optionally, in order to obtain a more superior malicious binary file detection rule, when determining the malicious binary file detection rule according to the first decision tree, whether the classification accuracy of the leaf nodes meets the specified condition can be judged according to the leaf nodes in the first decision tree; the first decision tree further comprises a root node; if the specified condition is met, determining a regular expression corresponding to the leaf node according to the sub-regular expressions of a plurality of branches on the path from the root node to the leaf node; and combining regular expressions corresponding to at least one leaf node in the first decision tree to obtain a malicious binary file detection rule.

Specifically, the classification accuracy of a leaf node reflects the situation in which all samples (including positive and negative samples) involved in training the first decision tree are correctly distinguished under the malicious result label corresponding to that leaf node. The specified conditions are flexibly set according to the actual application requirements, for example, the specified conditions are for example, the classification accuracy is 100%. If all positive samples and all negative samples are correctly distinguished (namely, correctly classified) under the malicious result labels corresponding to the leaf nodes, the classification accuracy of the leaf nodes is 100%, and the classification accuracy of the leaf nodes meets the specified conditions; if some positive samples or negative samples are classified incorrectly under the malicious result labels corresponding to the leaf nodes, the classification accuracy of the leaf nodes is lower than 100%, and the classification accuracy of the leaf nodes does not meet the specified conditions.

In this embodiment, if the classification accuracy of the leaf node does not meet the specified condition, the decision rule corresponding to the leaf node is abandoned. If the classification accuracy of the leaf node meets the specified condition, the decision rule corresponding to the leaf node can be selected as a part of the malicious binary file detection rule.

In this embodiment, if the classification accuracy of the leaf node meets the specified condition, determining a regular expression corresponding to the leaf node according to sub-regular expressions of a plurality of branches on a path from the root node to the leaf node; and combining regular expressions corresponding to at least one leaf node in the first decision tree to obtain a malicious binary file detection rule.

In this embodiment, a sub-regular expression may be understood as an expression included in a regular expression, where the sub-regular expression is used to indicate whether a virus name term corresponding to a target node connected to a branch is hit, and the target node is a node with a higher hierarchy level of two nodes connected to the branch. Assuming that the virus name word is denoted as feature, (; the. And combining the sub-regular expressions of a plurality of branches on the paths from the root node to the leaf node in sequence from the root node to obtain the regular expression corresponding to the leaf node. For example, a regular expression corresponding to one leaf node is (= (. Wherein (= (. (: matching a spliced virus name string, wherein the string must not contain three strings of SSHDoor, redXor and Ladvix at the same time, otherwise, the matching cannot be successfully performed.

In practical application, the classification accuracy of one or more leaf nodes in the first decision tree may meet the specified condition, so that regular expressions of one or more leaf nodes can be combined to obtain a malicious binary file detection rule, so that the application range of the malicious binary file detection rule is expanded, and virus detection is performed on binary files of more virus types. It is understood that any regular expression in the combined malicious binary detection rules is an independent decision rule.

In combining regular expressions of multiple leaf nodes, a separator may be employed, for example, a malicious binary file detection rule is (= (.

In this embodiment, for a second decision tree of any target virus type of the plurality of virus types, a corresponding virus type detection rule of the target virus type is determined according to the second decision tree.

In this embodiment, the paths from the root node to each leaf node in the second decision tree are traversed to obtain decision rules corresponding to each path, and the decision rules corresponding to each path are combined to obtain virus type detection rules for judging whether the malicious binary file has the target virus type.

Further optionally, in order to obtain a more superior virus type detection rule, when determining the virus type detection rule according to the second decision tree, it may be determined, for a leaf node in the second decision tree, whether a classification accuracy of the leaf node meets a specified condition; the second decision tree further comprises a root node; if the specified condition is met, determining a regular expression corresponding to the leaf node according to the sub-regular expressions of a plurality of branches on the path from the root node to the leaf node; and combining regular expressions corresponding to at least one leaf node in the second decision tree to obtain a virus type detection rule of the target virus type. The operation principle of the method for traversing the second decision tree to obtain the virus type detection rule is the same as that of the method for traversing the first decision tree to obtain the malicious binary file detection rule, and will not be described herein.

Further optionally, modification of malicious binary detection rules or virus type detection rules is also supported. For example, after a malicious binary file detection rule or a virus type detection rule is online, an online operation result is analyzed, whether a false report or a missing report of the virus detection result exists or not is judged, and when the false report or the missing report exists, the detection rule is quickly modified, and the online operation is quickly restarted.

According to the detection rule determining method provided by the embodiment of the application, the first decision tree with the malicious binary file detection function and the second decision tree with the virus type detection function of the malicious binary file are trained, and multiple decision rules such as the malicious binary file detection rule, the virus type detection rule and the like are automatically generated based on a tree traversal mode. Because the sample data of the sample binary files participating in the training of the decision tree fuses the virus detection results of a plurality of antivirus engines, the generated detection rule can carry out association analysis on the virus name character strings output by the antivirus engines so as to determine whether the binary files to be detected are malicious binary files or not and can identify the virus types. Compared with a single anti-virus engine, the method has the advantages that the detection capability of a plurality of anti-virus engines is fused, more binary files of virus types can be detected, the detection advantage of the single anti-virus engine can be used for improving the overall virus detection capability through joint decision of the plurality of anti-virus engines, the enhancement effect can be achieved through good and bad complementation, and the defects of misinformation or missing report of the single anti-virus engine are overcome.

In addition, the detection is performed by utilizing the automatically constructed detection rules, so that the detection efficiency can be improved, the resource cost can be reduced, and the problem of manually compiling rules can be solved. The automatically constructed detection rule can solve the problem that the traditional detection mode based on the model can not repair false alarm; the method can also quickly modify the detection rule aiming at the false alarm or missing alarm condition, is quickly on line, has good interpretation capability, and solves the defects of slow iteration, no interpretability and the like of the traditional model-based detection mode.

In practical application, after various decision rules such as malicious binary file detection rules and virus type detection rules are obtained, the decision rules can be used for detecting the malicious binary file. Therefore, the embodiment of the application also provides a malicious binary file detection method based on the detection rule.

Fig. 4 is a flowchart of a malicious binary file detection method according to an embodiment of the present application. The method may be executable by a malicious binary detection device, which may be comprised of software and/or hardware, and which may be generally configured in various electronic devices such as cloud servers. Referring to fig. 4, the method may include the steps of:

401. and obtaining the binary file to be detected.

402. And respectively carrying out virus detection on the binary files to be detected by using a plurality of antivirus engines to obtain a plurality of virus name character strings.

403. And splicing the plurality of virus name character strings to obtain a target virus name character string.

404. And detecting the target virus name character string by using a malicious binary file detection rule to determine whether the binary file to be detected is a malicious binary file.

405. And under the condition that the binary file to be detected is a malicious binary file, detecting the virus type of the target virus name character string by utilizing a virus type detection rule corresponding to each of the plurality of virus types so as to determine the virus type of the binary file to be detected.

The determination manner of the malicious binary file detection rule and the virus type detection rule corresponding to each of the plurality of virus types may be referred to the related description of the foregoing embodiment, and will not be repeated herein.

In this embodiment, the binary file to be detected refers to a binary file that needs to be subjected to virus detection. Firstly, virus detection is carried out on binary files to be detected by utilizing a plurality of antivirus engines respectively, virus detection results output by the antivirus engines are collected, and a plurality of virus name character strings are extracted from the virus detection results. And then, splicing the extracted virus name character strings to obtain spliced virus name character strings corresponding to the binary files to be detected, wherein the spliced virus name character strings corresponding to the binary files to be detected are called target virus name character strings. Then, as the malicious binary file detection rule is a regular expression, the target virus name character string is subjected to regular matching with the malicious binary file detection rule so as to judge whether the binary file to be detected is a malicious binary file or not. And then, under the condition that the binary file to be detected is not a malicious binary file, outputting prompt information that the binary file to be detected is not the malicious binary file. And under the condition that the binary file to be detected is a malicious binary file, respectively carrying out regular matching on the target virus name character strings by utilizing virus type detection rules of all virus types, and determining the final virus type of the binary file to be detected according to a regular matching result. Taking a back door program and a Trojan horse program as examples of multiple virus types, when the regular matching result of the back door program is unmatched, the method indicates that the virus type of the file to be detected is other programs (namely, the file is not the back door program). The regular matching result of the Trojan horse program is in matching, namely the virus type of the file to be detected is indicated to be the Trojan horse program. Of course, if the malicious binary file detection rules cannot be matched, that is, if all regular matching results indicate other programs, the virus type of "suspicious program" can be output by default.

Compared with a single anti-virus engine, the malicious binary file detection method provided by the embodiment of the application fuses the detection capability of a plurality of anti-virus engines, can cover the detection of binary files with more virus types, can use the detection advantage of the single anti-virus engine to promote the overall virus detection capability through the joint decision of the plurality of anti-virus engines, can achieve the enhancement effect through the complementation of the advantages and the disadvantages of false alarm or missing alarm of the single anti-virus engine, and the like. In addition, the detection is performed by utilizing the automatically constructed detection rule, so that the detection efficiency can be improved, the resource overhead is reduced, and the cost problem of manually compiling the rule is reduced. The automatically constructed detection rule can solve the problem that the traditional detection mode based on the model can not repair false alarm; the method can also quickly modify the detection rule aiming at the false alarm or missing alarm condition, is quickly on line, has good interpretation capability, and solves the defects of slow iteration, no interpretability and the like of the traditional model-based detection mode.

On average, 4 pieces of malicious software are newly added every second worldwide, and the fast-variation malicious software has great influence on daily life. To enhance the ability to detect malware mutations, it is necessary to have sufficient knowledge of these malware. Therefore, suspicious malicious software is marked accurately in advance, which is helpful for security enterprises to archive malicious software of different families and assist security analysts to further analyze malicious behaviors and repair vulnerabilities.

In practical application, the detection result information formats of various antivirus engines are different, and each antivirus engine defines different virus names and types, so that malicious family information of malicious software cannot be accurately identified.

For this reason, in some optional embodiments, a virus name knowledge graph may also be constructed based on the virus detection results of the plurality of antivirus engines on the malicious binary files corresponding to the massive malware. Thus, the malicious family information of the malicious software can be accurately identified by utilizing the virus name knowledge graph.

Based on the above, further optionally, the file identifiers and the virus name strings of the plurality of malicious binary files are analyzed to obtain an analysis result, where the analysis result includes a plurality of file identifier entities, a plurality of virus name word entities, an association relationship between the file identifier entities and the virus name word entities, and an association relationship between the virus name word entities; and constructing a virus name knowledge graph according to the analysis result.

In practical application, a large number of virus detection results output by a plurality of antivirus engines can be collected, and the large number of virus detection results are analyzed to obtain file identifications and virus name strings of a plurality of malicious binary files.

In this embodiment, the entities in the virus name knowledge graph include: the virus name is used as a word entity and a file identification entity. In practical application, the term entity and the file identification entity for the virus name can be extracted from massive virus detection results based on NLP (Natural Language Processing ) technology.

When entity extraction is performed, the extraction logic core is used for extracting useful information, and the number of extracted entities is proper, so that knowledge graph construction and searching can be influenced by too much entity. In performing entity extraction, one or more of the following policies may be employed:

(1) Words (i.e., token) are segmented based on punctuation.

(2) Unified lowercase for token.

(3) The nonsensical token is excluded based on length screening or based on algorithm screening.

(4) The token representing specific information, e.g., numbers, hash values, is filtered.

(5) A token ending with three or more digits is considered to contain sample information, and such a token is rejected.

In this embodiment, the relationship type of the association relationship between the two virus name word entities may be determined based on the conditional probability between the two virus name word entities. Assuming that any two virus name wording entities are respectively marked as A and B, the association relationship between the virus name wording entities can be defined as the following 5 categories:

(1) A=b, i.e. entity a and instance B are equal, when the case of conditional probability P (b|a) >0.9 and conditional probability P (a|b) >0.9 occurs, the association between entity a and instance B is: a=b.

(2) A noteq B, i.e., entity A and instance B are not equal, when the condition probability P (B|A <0.1 and condition probability P (A|B) < 0.1) occurs, the association between entity A and instance B is A noteq B.

(3)That is, B is a subset of A when the conditional probability P (B|A)<0.9 and conditional probability P (A|B)>When the condition of 0.9 occurs, the association relationship between the entity A and the instance B is as follows: />

(4)That is, A is a subset of B, when the conditional probability P (B|A)>0.9 and conditional probability P (A|B)<When the condition of 0.9 occurs, the association relationship between the entity A and the instance B is as follows: />

(5) Weak dependency.

Notably, P (b|a) refers to the probability that entity B occurs in the case where entity a has already occurred. P (a|b) refers to the probability that entity a appears in the case where entity B has already appeared.

In this embodiment, the collected massive virus name character strings are analyzed, each virus name word entity is extracted from the massive virus name character strings, the occurrence frequency of each virus name word entity is counted, and the occurrence frequency of every two virus name word entities in the same virus name character string is counted.

As an example, P (b|a) = (C _A∩B )/C _A ；C _A∩B Representing the number of occurrences of entity A and entity B in the same virus name string at the same time, C _A Representing the number of occurrences of entity A, namely C _A∩B And C _A As the conditional probability P (b|a).

In this embodiment, the association relationship between the file identification entity and the virus name wording entity is a home relationship. The attribution relation indicates that the virus name word entity is an entity corresponding to a word appearing in a virus name character string corresponding to the file identification entity.

In this embodiment, after the virus name knowledge graph is constructed, the virus family information can be inferred by using the virus name knowledge graph. Specifically, for a binary file to be detected, firstly, virus detection is performed on the binary file to be detected by using a plurality of antivirus engines respectively, so as to obtain a plurality of virus name character strings. Next, a plurality of virus name word entities are extracted from the plurality of virus name character strings. And then, inputting a plurality of virus name word entities into a virus name knowledge graph for reasoning to obtain family names of the binary files to be detected.

When the virus name knowledge graph performs family name reasoning, firstly, the priority scores of the plurality of virus name word entities corresponding to the binary file to be detected are determined, and then, a plurality of virus name word entities are selected from the plurality of virus name word entities according to the priority scores to serve as family names of the binary file to be detected. For example, the virus name with the highest priority score is selected as the family name of the binary file to be detected. For example, the plurality of virus name word entities are ranked in order of the priority score from the top to the bottom, and a plurality of virus name word entities are selected from the plurality of virus name word entities ranked at the top as family names of the binary file to be detected, which is not limited.

In practical application, when determining the priority score of each virus name word entity, one or more of weight information, importance and generality of the virus name word entity can be determined, which is not limited.

In this embodiment, the frequency that the virus name word entity is represented in the plurality of virus name strings of the binary file to be detected by the weight information of the virus name word entity may be represented by the number of the virus name word entities extracted from the plurality of virus name strings of the binary file to be detected. Assuming that an ith virus name word entity in the plurality of virus name word entities is marked as token_i, i is a positive integer, and weight information of the ith virus name word entity is marked as Wi, wi=log (b×tni), wherein Tni refers to the number of token_i extracted from a plurality of virus name character strings of a binary file to be detected; b is an empirical value set as needed, for example, the smaller b=1.8, the greater the influence of the number of token_i on the weight information; log represents the log symbol.

In this embodiment, the significance of the virus name word entity characterizes the range in which the virus name word entity appears in the plurality of virus name strings of the binary file to be detected, and may be characterized by the number of the virus name word entity included in the virus name knowledge graph and the number of the files of the virus name word entity included. Assume that the importance of the i-th virus name word entity token_i is denoted Ii. Ii=Tci/Tfi, tci is the number of token_i in the virus name knowledge graph, tfi is the number of files containing token_i in the virus name knowledge graph. It can be appreciated that the number of files containing token_i in the virus name knowledge graph can be determined by the association between the file identification entity and the virus name wording entity in the virus name knowledge graph.

In this embodiment, the commonality of the virus name word entity characterizes the probability size of the virus name word entity being selected as a family name. Assuming that the universality of the ith virus name word entity token_i is denoted as Gi, gi=tsi/(n×c), where Tsi refers to the number of subsets of token_i in the virus name knowledge graph, N refers to the total number of virus name word entities in the virus name knowledge graph, C is a constant set as required, and x refers to the multiplication symbol, that is, the ratio of Tsi to (n×c) is taken as Gi.

The manner in which the priority score of each virus name word entity is determined in this embodiment is not limited, for example, it is assumed that the priority score of the ith virus name word entity is Tpi. tpi=wi+ii-Gi, i.e. the sum of Wi and Ii minus Gi gives Tpi.

In this embodiment, a set of method for constructing a virus name knowledge graph of a malicious file is provided, and family name information is deduced by using an unsupervised graph probability. Based on an unsupervised implementation, the marker dataset need not be provided in advance. The standardized family names can be generated based on expert-free cold start automatic identification without the addition of expert knowledge (virus type, virus family, virus name collection, virus format, etc.).

In this embodiment, the virus detection results provided by the plurality of antivirus engines are used as input data to generate a virus name knowledge graph representing the relationship between entities. This method enables the family name of the virus to be extracted without any prior knowledge.

In this embodiment, the relationship between the virus name word entities is determined by the virus name knowledge graph, and no specific or common word is required to be used, so that the family name of each malicious binary file can be represented by the virus name word better.

In the embodiment, under the condition that an alias table or a data set containing priori knowledge is not used, a more common family name is found through the use frequency or the use range of each token in a plurality of antivirus engines, so that full-automatic marking is realized, and the operation and maintenance cost is reduced.

For a better understanding of family name recognition, the following description is provided in connection with the scene graph shown in FIG. 5.

Referring to fig. 5, in the virus name knowledge graph construction stage, massive binary files are input into a multi-engine platform to perform virus detection, multi-engine result information output by the multi-engine platform is extracted, the multi-engine result information comprises virus detection results corresponding to the binary files, and the virus detection results comprise virus name character strings and file identifications of the binary files. And extracting the relation between the entities from the multi-engine result information to obtain a plurality of file identification entities, a plurality of virus name word entities, the association relation between the file identification entities and the virus name word entities and the association relation between the virus name word entities. And constructing a virus name knowledge graph based on the extracted entity and the relation between the entities.

For a specific example, first, malicious files mal_1, mal_2 are input to a multi-engine platform (including antivirus engine a, antivirus engine B). For malicious file mal_1, the detection result a_s_1= "PUP/win32.Downloadguide.r25632" of antivirus engine a, the detection result of antivirus engine B

b_s_1= "application. Bundler. Downloadguide. KF". For malicious file mal_2, the detection result a_s_2= "Trojan/win32.Nitol. V305521" for antivirus engine a, and the detection result b_s_1= "generic.servstar.a.b3cd2481" for antivirus engine B.

Then, entity extraction is performed: for the malicious file mal_1, the word entity for the virus name in the detection result of the antivirus engine a is [ win32, downloadguide ], and the word entity for the virus name in the detection result of the antivirus engine B is [ application, bundler, downloadguide ]. For the malicious file mal_2, the word entity token for the virus name in the detection result of the antivirus engine a is [ trojan, win32, nitol ], and the word entity for the virus name in the detection result of the antivirus engine B is [ generic, servstart ]. In addition, the file identification entity needs to be mined.

Then, the relationships between the entities are extracted. Namely, the relation between the virus name word entities and the relation between the file identification entity and the virus name word entities are extracted.

And finally, constructing a virus name knowledge graph based on the relationship between the entities.

Referring to fig. 5, in the family name recognition stage, a binary file to be detected is input into a multi-engine platform to perform virus detection, multi-engine result information output by the multi-engine platform is extracted, and a word entity for a virus name corresponding to the binary file to be detected is extracted from the multi-engine result information. And inputting the extracted virus name word entity into a virus name knowledge graph to perform family name recognition, so as to obtain the family name corresponding to the binary file to be detected.

For example, after the malicious file mal_3 of the family to be identified is input to the multi-engine platform, the detection result a_s_3= "PUP/win32.Downloadguide.r78sd9we" of the antivirus engine a, and the detection result b_s_1= "application. Bundler. Downloadguide.PTE" of the antivirus engine B "

Based on the knowledge graph reasoning of the virus names, the family names are identified, and specifically, the priority score of each word entity for the virus names is calculated, for example, the priority score of win32 is-5, the priority score of downloadguide is 2, the priority score of application is-1, and the priority score of bundle is-1. Finally, take the token with the largest score as the family name, namely the downloadguide.

Fig. 6 is a schematic structural diagram of a detection rule determining apparatus according to an embodiment of the present application. Referring to fig. 6, the apparatus may include:

the training module 61 is configured to perform model training by using the first sample file set to obtain a first decision tree; performing model training by using a second sample file set corresponding to each of the plurality of virus types to obtain a second decision tree corresponding to each of the plurality of virus types;

a rule determining module 62, configured to determine a malicious binary file detection rule according to the first decision tree; respectively determining virus type detection rules corresponding to the virus types according to second decision trees corresponding to the virus types;

wherein the sample data of the sample binary file in the first sample file set includes: splicing virus name character strings and malicious result labels for indicating whether the virus name character strings are malicious binary files, wherein the splicing virus name character strings are obtained by splicing a plurality of virus name character strings, and the plurality of virus name character strings are obtained by respectively carrying out virus detection on corresponding sample binary files by utilizing a plurality of antivirus engines; sample data of a sample binary file in the second sample file set includes: the virus name string and the virus type tag indicating the virus type are spliced.

Further optionally, the apparatus further includes: the acquisition module is used for: the method for acquiring the initial sample file set comprises the following steps of: splicing a virus name character string, a malicious score and a virus type; obtaining a first sample file set from the initial sample file set, wherein sample binary files with malicious scores greater than or equal to a score threshold in the first sample file set have malicious result tags indicating as malicious binary files, and sample binary files with malicious scores less than the score threshold in the first sample file set have malicious result tags indicating as not malicious binary files; classifying the sample binary files with the malicious result labels indicating the malicious binary files in the first sample file set to obtain a second sample file set of each of the plurality of virus types.

Further optionally, before the first set of sample files is acquired from the initial set of sample files, the acquiring module is further configured to: if the same spliced virus name character string in the initial sample file set has a plurality of malicious scores, acquiring the sample number corresponding to each of the plurality of malicious scores; and correcting the malicious scores of the sample binary files with the same spliced virus name character strings in the initial sample file set according to the sample numbers corresponding to the malicious scores.

Further optionally, the acquiring module is further configured to: if the same spliced virus name character string in the initial sample file set has a plurality of virus types, acquiring the sample number corresponding to each of the plurality of virus types; and correcting the virus types of the sample binary files with the same spliced virus name character strings in the initial sample file set according to the sample numbers corresponding to the virus types.

Further optionally, the training module 61 is specifically configured to: constructing a first word segmentation word library according to a plurality of spliced virus name character strings acquired from a first sample file set; aiming at any sample binary file in the first sample file set, word segmentation processing is carried out on spliced virus name character strings of the sample binary file; vectorizing word segmentation results of the sample binary file by using the first word segmentation word stock to obtain a word vector sequence of the sample binary file; and performing model training by using word vector sequences of a plurality of sample binary files and malicious result labels to obtain a first decision tree.

Further optionally, the rule determining module 62 is specifically configured to: according to a first decision tree, determining a malicious binary file detection rule, including: judging whether the classification accuracy of the leaf nodes meets the specified condition or not according to the leaf nodes in the first decision tree, wherein the first decision tree also comprises a root node; if the specified condition is met, determining a regular expression corresponding to the leaf node according to the sub-regular expressions of a plurality of branches on the path from the root node to the leaf node; and combining regular expressions corresponding to at least one leaf node in the first decision tree to obtain a malicious binary file detection rule.

The apparatus shown in fig. 6 may perform the method shown in fig. 2, and its implementation principle and technical effects will not be described again. The specific manner in which the various modules and units perform the operations in the apparatus shown in fig. 6 in the above embodiments has been described in detail in connection with the embodiments of the method, and will not be described in detail herein.

Fig. 7 is a schematic structural diagram of a malicious binary file detection device according to an embodiment of the present application. Referring to fig. 7, the apparatus may include:

an acquisition module 71, configured to acquire a binary file to be detected;

a first detecting module 72, configured to perform virus detection on binary files to be detected by using a plurality of antivirus engines, so as to obtain a plurality of virus name strings;

the splicing module 73 is configured to splice the multiple virus name strings to obtain a target virus name string;

a second detection module 74, configured to detect the target virus name string by using a malicious binary file detection rule, so as to determine whether the binary file to be detected is a malicious binary file;

and a third detection module 75, configured to, in a case where the binary file to be detected is a malicious binary file, perform virus type detection on the target virus name string by using virus type detection rules corresponding to the multiple virus types, so as to determine the virus type of the binary file to be detected.

The determination manner of the malicious binary file detection rule and the virus type detection rule corresponding to each of the plurality of virus types can be seen from the foregoing.

Further optionally, the apparatus further includes: the family name recognition module is used for extracting a plurality of virus name word entities from a plurality of virus name character strings; and inputting a plurality of virus names into a virus name knowledge graph by using a word entity to identify family names.

Further optionally, the apparatus further includes: the map construction module is used for analyzing the file identifications and the virus name character strings of the plurality of malicious binary files to obtain analysis results, wherein the analysis results comprise a plurality of file identification entities, a plurality of virus name word entities, association relations between the file identification entities and the virus name word entities and association relations between the virus name word entities; and constructing a virus name knowledge graph according to the analysis result.

The apparatus shown in fig. 7 may perform the method shown in fig. 4, and its implementation principle and technical effects will not be described again. The specific manner in which the various modules and units perform the operations in the apparatus shown in fig. 4 in the above embodiments has been described in detail in connection with the embodiments of the method, and will not be described in detail herein.

It should be noted that, the execution subjects of each step of the method provided in the above embodiment may be the same device, or the method may also be executed by different devices. For example, the execution subject of steps 201 to 204 may be device a; for another example, the execution subject of steps 201 and 202 may be device a and the execution subject of steps 203 and 204 may be device B; etc.

In addition, in some of the flows described in the above embodiments and the drawings, a plurality of operations appearing in a specific order are included, but it should be clearly understood that the operations may be performed out of the order in which they appear herein or performed in parallel, the sequence numbers of the operations such as 201, 202, etc. are merely used to distinguish between the various operations, and the sequence numbers themselves do not represent any order of execution. In addition, the flows may include more or fewer operations, and the operations may be performed sequentially or in parallel. It should be noted that, the descriptions of "first" and "second" herein are used to distinguish different messages, devices, modules, etc., and do not represent a sequence, and are not limited to the "first" and the "second" being different types.

It should be noted that, the user information (including but not limited to user equipment information, user personal information, etc.) and the data (including but not limited to data for analysis, stored data, presented data, etc.) related to the present application are information and data authorized by the user or fully authorized by each party, and the collection, use and processing of the related data need to comply with the related laws and regulations and standards of the related country and region, and provide corresponding operation entries for the user to select authorization or rejection.

Fig. 8 is a schematic structural diagram of an electronic device according to an embodiment of the present application. As shown in fig. 8, the electronic device includes: a memory 81 and a processor 82;

memory 81 is used to store computer programs and may be configured to store various other data to support operations on the computing platform. Examples of such data include instructions for any application or method operating on a computing platform, contact data, phonebook data, messages, pictures, videos, and the like.

The Memory 81 may be implemented by any type or combination of volatile or non-volatile Memory devices, such as Static Random access Memory (Static Random-AccessMemory, SRAM), electrically erasable programmable Read-Only Memory (Electrically Erasable Programmable Read Only Memory, EEPROM), erasable programmable Read-Only Memory (Erasable Programmable Read Only Memory, EPROM), programmable Read-Only Memory (Programmable Read-Only Memory, PROM), read-Only Memory (ROM), magnetic Memory, flash Memory, magnetic or optical disk.

A processor 82 coupled to the memory 81 for executing the computer program in the memory 81 for: a detection rule determining method or a malicious binary file detecting method.

Further, as shown in fig. 8, the electronic device further includes: communication component 83, display 84, power component 85, audio component 86, and other components. Only some of the components are schematically shown in fig. 8, which does not mean that the electronic device only comprises the components shown in fig. 8. In addition, the components within the dashed box in fig. 8 are optional components, not necessarily optional components, depending on the product form of the electronic device. The electronic device in this embodiment may be implemented as a terminal device such as a desktop computer, a notebook computer, a smart phone, or an IOT (internet of things ) device, or may be a server device such as a conventional server, a cloud server, or a server array. If the electronic device of the embodiment is implemented as a terminal device such as a desktop computer, a notebook computer, or a smart phone, the electronic device may include components within the dashed-line frame in fig. 8; if the electronic device of the embodiment is implemented as a server device such as a conventional server, a cloud server, or a server array, the components within the dashed box in fig. 8 may not be included.

The detailed implementation process of each action performed by the processor may refer to the related description in the foregoing method embodiment or the apparatus embodiment, and will not be repeated herein.

Accordingly, the present application also provides a computer readable storage medium storing a computer program, where the computer program is executed to implement the steps executable by the electronic device in the above method embodiments.

Accordingly, embodiments of the present application also provide a computer program product comprising a computer program/instructions which, when executed by a processor, cause the processor to carry out the steps of the above-described method embodiments that are executable by an electronic device.

The communication component is configured to facilitate wired or wireless communication between the device in which the communication component is located and other devices. The device where the communication component is located may access a wireless network based on a communication standard, such as a mobile communication network of WiFi (Wireless Fidelity ), 2G (2 generation,2 generation), 3G (3 generation ), 4G (4 generation,4 generation)/LTE (long Term Evolution ), 5G (5 generation,5 generation), or a combination thereof. In one exemplary embodiment, the communication component receives a broadcast signal or broadcast-related information from an external broadcast management system via a broadcast channel. In one exemplary embodiment, the communication component further includes a near field communication (Near Field Communication, NFC) module to facilitate short range communications. For example, the NFC module may be implemented based on radio frequency identification (Radio Frequency Identification, RFID) technology, infrared data association (The Infrared Data Association, irDA) technology, ultra Wideband (UWB) technology, bluetooth (BT) technology, and other technologies.

The display includes a screen, which may include a liquid crystal display (Liquid Crystal Display, LCD) and a Touch Panel (TP). If the screen includes a touch panel, the screen may be implemented as a touch screen to receive input signals from a user. The touch panel includes one or more touch sensors to sense touches, swipes, and gestures on the touch panel. The touch sensor may sense not only the boundary of a touch or sliding action, but also the duration and pressure associated with the touch or sliding operation.

The power supply component provides power for various components of equipment where the power supply component is located. The power components may include a power management system, one or more power sources, and other components associated with generating, managing, and distributing power for the devices in which the power components are located.

The audio component described above may be configured to output and/or input an audio signal. For example, the audio component includes a Microphone (MIC) configured to receive external audio signals when the device in which the audio component is located is in an operational mode, such as a call mode, a recording mode, and a voice recognition mode. The received audio signal may be further stored in a memory or transmitted via a communication component. In some embodiments, the audio assembly further comprises a speaker for outputting audio signals.

It will be appreciated by those skilled in the art that embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-readable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

In one typical configuration, a computing device includes one or more processors (Central Processing Unit, CPUs), input/output interfaces, network interfaces, and memory.

The Memory may include non-volatile Memory in a computer readable medium, random access Memory (Random Access Memory, RAM) and/or non-volatile Memory, etc., such as Read Only Memory (ROM) or flash RAM. Memory is an example of computer-readable media.

Computer readable media, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of storage media for a computer include, but are not limited to, phase Change RAM (PRAM), static Random-Access Memory (SRAM), dynamic Random-Access Memory (Dynamic Random Access Memory, DRAM), other types of Random-Access Memory (Random Access Memory, RAM), read Only Memory (ROM), electrically Erasable Programmable Read Only Memory (EEPROM), flash Memory or other Memory technology, compact disc Read Only Memory (CD-ROM), digital versatile disc (Digital versatile disc, DVD) or other optical storage, magnetic cassettes, magnetic tape storage or other magnetic storage devices, or any other non-transmission medium, operable to store information that may be accessed by the computing device. Computer-readable media, as defined herein, does not include transitory computer-readable media (transmission media), such as modulated data signals and carrier waves.

It should also be noted that the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article or apparatus that comprises an element.

The foregoing is merely exemplary of the present application and is not intended to limit the present application. Various modifications and variations of the present application will be apparent to those skilled in the art. Any modification, equivalent replacement, improvement, etc. which come within the spirit and principles of the application are to be included in the scope of the claims of the present application.

Claims

1. A detection rule determining method, comprising:

model training is carried out by using the first sample file set, and a first decision tree is obtained;

performing model training by using a second sample file set corresponding to each of the plurality of virus types to obtain a second decision tree corresponding to each of the plurality of virus types;

Determining a malicious binary file detection rule according to the first decision tree;

respectively determining virus type detection rules corresponding to the virus types according to second decision trees corresponding to the virus types;

2. The method of claim 1, wherein model training using the first set of sample files further comprises, prior to deriving the first decision tree:

obtaining an initial sample file set, wherein sample data of a sample binary file in the initial sample file set comprises: splicing a virus name character string, a malicious score and a virus type;

Obtaining a first sample file set from the initial sample file set, wherein sample binary files with malicious scores greater than or equal to a score threshold in the first sample file set have malicious result tags indicating as malicious binary files, and sample binary files with malicious scores less than the score threshold in the first sample file set have malicious result tags indicating not as malicious binary files;

and classifying the sample binary files with the malicious result labels indicating the malicious binary files in the first sample file set to obtain a second sample file set of each of the plurality of virus types.

3. The method of claim 2, further comprising, prior to obtaining the first set of sample files from the initial set of sample files:

if the same spliced virus name character string in the initial sample file set has a plurality of malicious scores, acquiring the sample number corresponding to each of the plurality of malicious scores;

and correcting the malicious scores of the sample binary files with the same spliced virus name character strings in the initial sample file set according to the sample numbers corresponding to the malicious scores.

4. A method according to claim 3, further comprising:

if the same spliced virus name character string in the initial sample file set has a plurality of virus types, acquiring the sample number corresponding to each of the plurality of virus types;

and correcting the virus types of the sample binary files with the same spliced virus name character strings in the initial sample file set according to the sample numbers corresponding to the virus types.

5. The method of any of claims 1 to 4, wherein model training using the first set of sample files results in a first decision tree comprising:

constructing a first word segmentation word stock according to a plurality of spliced virus name character strings acquired from the first sample file set;

aiming at any sample binary file in the first sample file set, performing word segmentation on spliced virus name character strings of the sample binary file;

vectorizing the word segmentation result of the sample binary file by using the first word segmentation word stock to obtain a word vector sequence of the sample binary file;

and performing model training by using word vector sequences of the plurality of sample binary files and malicious result labels thereof to obtain the first decision tree.

6. The method of any one of claims 1 to 5, wherein determining malicious binary detection rules from the first decision tree comprises:

judging whether the classification accuracy of the leaf nodes meets a specified condition or not according to the leaf nodes in the first decision tree, wherein the first decision tree also comprises a root node;

if the specified condition is met, determining a regular expression corresponding to the leaf node according to sub-regular expressions of a plurality of branches on a path from the root node to the leaf node;

and combining regular expressions corresponding to at least one leaf node in the first decision tree to obtain the malicious binary file detection rule.

7. A malicious binary file detection method, comprising:

acquiring a binary file to be detected;

respectively carrying out virus detection on the binary file to be detected by using a plurality of antivirus engines to obtain a plurality of virus name character strings;

splicing the plurality of virus name character strings to obtain a target virus name character string;

detecting the target virus name character string by using a malicious binary file detection rule to determine whether the binary file to be detected is a malicious binary file;

Under the condition that the binary file to be detected is a malicious binary file, virus type detection is carried out on the target virus name character string by utilizing virus type detection rules corresponding to a plurality of virus types respectively so as to determine the virus type of the binary file to be detected;

wherein the malicious binary file detection rule and the virus type detection rule corresponding to each of the plurality of virus types are determined according to the method of any one of claims 1-6.

8. The method as recited in claim 7, further comprising:

extracting a plurality of virus name word entities from the plurality of virus name character strings;

and inputting the plurality of virus names into a virus name knowledge graph by using a word entity to identify family names.

9. The method according to claim 8, wherein the virus name knowledge graph is constructed in the following manner:

analyzing file identifications and virus name character strings of a plurality of malicious binary files to obtain analysis results, wherein the analysis results comprise a plurality of file identification entities, a plurality of virus name word entities, association relations between the file identification entities and the virus name word entities and association relations between the virus name word entities;

And constructing the virus name knowledge graph according to the analysis result.

10. An electronic device, comprising: a memory and a processor; the memory is used for storing a computer program; the processor is coupled to the memory for executing the computer program for performing the steps in the method of any of claims 1-9.

11. A computer readable storage medium storing a computer program, which, when executed by a processor, causes the processor to carry out the steps of the method according to any one of claims 1-9.