CN116781396B

CN116781396B - Method, apparatus, device and storage medium for attack behavior detection

Info

Publication number: CN116781396B
Application number: CN202310900309.5A
Authority: CN
Inventors: 马骏; 陈越; 方欣
Original assignee: Beijing Volcano Engine Technology Co Ltd
Current assignee: Beijing Volcano Engine Technology Co Ltd
Priority date: 2023-07-20
Filing date: 2023-07-20
Publication date: 2024-12-17
Anticipated expiration: 2043-07-20
Also published as: US20250030707A1; CN116781396A; WO2025016384A1

Abstract

According to embodiments of the present disclosure, methods, apparatuses, devices, and storage medium for attack behavior detection are provided. The method comprises the steps of obtaining attack behavior data and non-attack behavior data, wherein the attack behavior data are collected from at least one honey pot host, the non-attack behavior data are collected from at least one user host, screening keywords from the attack behavior data according to keywords in the non-attack behavior data and keywords in the attack behavior data to generate at least one attack behavior detection rule, wherein each attack behavior detection rule comprises at least one keyword used for representing attack behaviors, and executing attack behavior detection on a target user host based on the at least one attack behavior detection rule. By the mode, the automatic and efficient detection of unknown attack behaviors can be realized, and the detection capability and detection accuracy of the attack behaviors are improved.

Description

Method, apparatus, device and storage medium for attack behavior detection

Technical Field

Example embodiments of the present disclosure relate generally to the field of network security, and in particular, relate to a method, apparatus, device, and computer-readable storage medium for attack behavior detection.

Background

Honeypots, such as Secure Shell (SSH) high-interaction honeypots, are a network security tool for trapping intruder (e.g., hacker) attacks. The honeypot may simulate a server or computer that actually uses the SSH protocol to remotely log in to attract intruders to attempt to log in and perform operations to collect and record data such as the behavioral activity of the intruder (e.g., including operational commands performed by the intruder). When using behavior data collected by honeypots, manual analysis of attack techniques of intruders is generally required, and expert knowledge is relied on to design different security strategies, so that the intrusion prevention efficiency is low. Therefore, a solution is desired that is capable of timely and accurately detecting intrusion behavior.

Disclosure of Invention

In a first aspect of the disclosure, a method for attack behavior detection is provided, comprising obtaining attack behavior data and non-attack behavior data, wherein the attack behavior data is acquired from at least one honeypot host, the non-attack behavior data is acquired from at least one user host, screening keywords from the non-attack behavior data and keywords from the attack behavior data to generate at least one attack behavior detection rule, wherein each attack behavior detection rule comprises at least one keyword for representing attack behavior, and performing attack behavior detection on a target user host based on the at least one attack behavior detection rule.

In a second aspect of the disclosure, an apparatus for attack behavior detection is provided, comprising a data acquisition module configured to acquire attack behavior data and non-attack behavior data, the attack behavior data being acquired from at least one honeypot host, the non-attack behavior data being acquired from at least one user host, a rule generation module configured to screen keywords in the non-attack behavior data and keywords in the attack behavior data from the attack behavior data to generate at least one attack behavior detection rule, wherein each attack behavior detection rule comprises at least one keyword for characterizing an attack behavior, and a behavior detection module configured to perform attack behavior detection on a target user host based on the at least one attack behavior detection rule.

In a third aspect of the present disclosure, an electronic device is provided. The apparatus includes at least one processing unit, and at least one memory coupled to the at least one processing unit and storing instructions for execution by the at least one processing unit. The instructions, when executed by at least one processing unit, cause the apparatus to perform the method of the first aspect.

In a fourth aspect of the present disclosure, a computer-readable storage medium is provided. The computer readable storage medium has stored thereon a computer program executable by a processor to implement the method of the first aspect.

It should be understood that what is described in this section of the disclosure is not intended to limit key features or essential features of the embodiments of the disclosure, nor is it intended to limit the scope of the disclosure. Other features of the present disclosure will become apparent from the following description.

Drawings

The above and other features, advantages and aspects of embodiments of the present disclosure will become more apparent by reference to the following detailed description when taken in conjunction with the accompanying drawings. In the drawings, wherein like or similar reference numerals denote like or similar elements, in which:

FIG. 1 illustrates a schematic diagram of an example network environment in which embodiments of the present disclosure may be implemented;

FIG. 2 illustrates a schematic diagram of an attack behavior detection system according to some embodiments of the present disclosure;

FIG. 3 illustrates a schematic diagram of an initialization subsystem processing non-offensive behavior data, according to some embodiments of the present disclosure;

FIG. 4 illustrates a schematic diagram of an instruction aggregation module generating a user instruction sequence set in accordance with some embodiments of the present disclosure;

FIG. 5 illustrates a schematic diagram of a keyword extraction module processing a user instruction sequence set in accordance with some embodiments of the present disclosure;

FIG. 6 illustrates a schematic diagram of an attack behavior determination subsystem according to some embodiments of the present disclosure;

FIG. 7 illustrates a schematic diagram of a keyword extraction module processing a honeypot instruction sequence set in accordance with some embodiments of the present disclosure;

FIG. 8 illustrates a schematic diagram of compressing and merging keyword sets by a keyword cluster compression module in accordance with some embodiments of the present disclosure;

FIG. 9 illustrates a flow diagram for generating detection rules according to some embodiments of the present disclosure;

FIG. 10 illustrates a flow diagram of detection rule publication and adjustment according to some embodiments of the present disclosure;

FIG. 11 illustrates a flowchart of a method for attack behavior detection according to some embodiments of the present disclosure;

FIG. 12 illustrates a schematic block diagram of an apparatus for attack behavior detection according to some embodiments of the present disclosure;

Fig. 13 illustrates a block diagram of an electronic device in which one or more embodiments of the disclosure may be implemented.

Detailed Description

It will be appreciated that prior to using the technical solutions disclosed in the embodiments of the present disclosure, the user should be informed and authorized of the type, usage range, usage scenario, etc. of the personal information related to the present disclosure in an appropriate manner according to the relevant legal regulations.

For example, in response to receiving an active request from a user, a prompt is sent to the user to explicitly prompt the user that the operation it is requesting to perform will require personal information to be obtained and used with the user. Thus, the user can autonomously select whether to provide personal information to software or hardware such as an electronic device, an application program, a server or a storage medium for executing the operation of the technical scheme of the present disclosure according to the prompt information.

As an alternative but non-limiting implementation, in response to receiving an active request from a user, the manner in which the prompt information is sent to the user may be, for example, a popup, in which the prompt information may be presented in a text manner. In addition, a selection control for the user to select to provide personal information to the electronic device in a 'consent' or 'disagreement' manner can be carried in the popup window.

It will be appreciated that the above-described notification and user authorization process is merely illustrative and not limiting of the implementations of the present disclosure, and that other ways of satisfying relevant legal regulations may be applied to the implementations of the present disclosure.

It will be appreciated that the data (including but not limited to the data itself, the acquisition or use of the data) involved in the present technical solution should comply with the corresponding legal regulations and the requirements of the relevant regulations.

Embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While certain embodiments of the present disclosure have been illustrated in the accompanying drawings, it is to be understood that the present disclosure may be embodied in various forms and should not be construed as limited to the embodiments set forth herein, but rather, these embodiments are provided so that this disclosure will be more thorough and complete. It should be understood that the drawings and embodiments of the present disclosure are for illustration purposes only and are not intended to limit the scope of the present disclosure.

It should be noted that any section/subsection headings provided herein are not limiting. Various embodiments are described throughout this document, and any type of embodiment may be included under any section/subsection. Furthermore, the embodiments described in any section/subsection may be combined in any manner with any other embodiment described in the same section/subsection and/or in a different section/subsection.

In describing embodiments of the present disclosure, the term "comprising" and its like should be taken to be open-ended, i.e., including, but not limited to. The term "based on" should be understood as "based at least in part on". The term "one embodiment" or "the embodiment" should be understood as "at least one embodiment". The term "some embodiments" should be understood as "at least some embodiments". Other explicit and implicit definitions are also possible below. The terms "first," "second," and the like, may refer to different or the same object. Other explicit and implicit definitions are also possible below.

In the field of network security, there are at least the following ways to use the attack behavior data collected by the honeypot (or the honeypot host):

Contribution intelligence-collecting intrusion beacons (Indicator of compromise, IOC for short), e.g., file hashes, C2 domain names, IP addresses, and file names, etc., as threat intelligence.

Collecting malicious samples, namely collecting malicious samples such as Trojan horse, mine digging, back door and the like thrown by an invader.

Intrusion situation is depicted by sensing possible attack behaviors and situations in the network space through the honeypot data.

And collecting attack techniques, namely helping a security team to better know the strategy and means of the invader through the collected attack techniques of the invader, and further adopting corresponding security measures and enhancing network security defense.

The utilization schemes of the honey pot behavior data all need manual intervention to analyze attack techniques of intruders, and then rely on expert knowledge to design different security strategies. However, for honeypots, particularly SSH high-interaction honeypots, because the collected operational behaviors of SSH high-interaction honeypots are mainly Shell instructions executed by intruders, they are generally complex and do not have a uniform format, and it is difficult to implement standardized and automated processing of the collected operational behaviors.

A Host Intrusion Detection System (HIDS) may collect various data including system operation and user behavior and perform intrusion detection according to pre-configured detection rules. The detection rule can generally alarm completely matched data in a blacklist mode, can alarm behaviors conforming to a certain characteristic mode in a mode matching mode, and can design complex detection logic to realize higher detection rate and lower false alarm rate. The detection rule of the HIDS on the malicious command is mainly manually developed by security professionals aiming at different invasion modes or invasion behaviors, so that the detection rule can be matched with and detect Shell instructions or Shell instruction sequences corresponding to each invasion behavior.

Because Shell instructions are different from IOCs, the Shell instructions have more complexity and diversity, and the parameters and the formats of the Shell instructions have great freedom, the current detection schemes, such as a network intrusion detection system (Network Intrusion Detection System, NIDS for short) and an industrial intrusion monitoring system (Industrial Intrusion Detection System, IIDS for short), can be carefully designed by an intruder to bypass. Moreover, a pattern matching detection scheme that is not strictly designed and tested is prone to a large number of false positives.

It is therefore desirable to be able to provide improved solutions for enabling attack detection.

In an embodiment of the present disclosure, an improved scheme for attack behavior detection is presented. The scheme performs subsequent attack detection by automatically generating an identified attack pattern from attack data from the honeypot and generating corresponding detection rules by referring to the non-attack data collected from the user host (which may characterize the non-attack). This can improve the recognition efficiency of the attack behavior, the detection capability of the attack behavior, and the detection accuracy.

In some embodiments, the deployed honeypot host can continuously collect attack behavior data, and through the process, more unknown or novel attack behaviors can be timely and automatically updated and identified, so that network security is greatly improved.

Fig. 1 illustrates a schematic diagram of an example network environment 100 in which embodiments of the present disclosure may be implemented. In this example network environment 100, a honeypot network 102 is deployed for trapping an intruder's attack. For example, hacker 112 may use hacking host 110 to access honeypot network 102 due to the attraction of honeypot network 102. In this way, attack behavior data may be collected on the honeypot network 102. Additionally, HIDS 104 is deployed in network environment 100 for collecting non-offensive behavior data generated during normal user operation. In embodiments of the present disclosure, the attack behavior data and the non-attack behavior data will be used to generate attack behavior detection rules suitable for performing attack behavior detection. The generated attack detection rules may be used for detection of attacks by the subscriber hosts on HIDS 104 or HIDS 106.

In the example network environment 100, the honeypot network 102 may include one or more honeypot hosts 120-1, 120-2..120-N (collectively honeypot hosts 120 for ease of discussion), the HIDS 104 may include one or more user hosts 130-1, 130-2..130-M (collectively user hosts 130 for ease of discussion), and the HIDS 106 may include one or more target user hosts 140-1, 140-2..140-K (collectively target user hosts 140 for ease of discussion). In fig. 1, N, M and K can be any values. In some embodiments, HIDS 106 and HIDS 104 may be in the same system, or may be separate systems.

The hosts 110, honey host 120, subscriber host 130, and target subscriber host 140 may be servers or any type of computing-capable device, including end devices. The terminal device may be any type of mobile terminal, fixed terminal, or portable terminal, including a mobile handset, desktop computer, laptop computer, notebook computer, netbook computer, tablet computer, media computer, multimedia tablet, personal Communication System (PCS) device, personal navigation device, personal Digital Assistant (PDA), audio/video player, digital camera/camcorder, positioning device, television receiver, radio broadcast receiver, electronic book device, game device, or any combination of the preceding, including accessories and peripherals for these devices, or any combination thereof.

It should be understood that the structure and function of environment 100 are described for illustrative purposes only and are not meant to suggest any limitation as to the scope of the disclosure.

Some example embodiments of the present disclosure are described below with reference to the accompanying drawings. Fig. 2 illustrates a schematic diagram of an attack behavior detection system 200 according to some embodiments of the present disclosure. For ease of discussion, the attack detection system 200 will be described with reference to the environment 100 of FIG. 1.

As shown in fig. 2, the rule generation process for attack detection can be divided into two phases. In a first phase, non-offensive behavior data 205 from the HIDS 104 is acquired, wherein the non-offensive behavior data 205 can be acquired from at least one subscriber host in the HIDS 104. The non-attack behavior data 205 from the HIDS 104 is input to the initialization subsystem 220 for use as reference behavior data. In a second phase, attack activity data 210-1, 210-2..210-J (collectively attack activity data 210) is acquired, wherein the attack activity data may be collected from at least one honeypot host, and the attack activity data 210 from the honeypot network 102 is provided to an attack activity determination subsystem 230 for use in generating attack activity detection rules. The different attack activity data 210-1, 210-2..210-J may include data collected from the honeypot network 102 over different time periods.

Based on the non-attack data 205, the attack activity determination subsystem 230 may determine at least one attack activity detection rule from the attack activity data 210, wherein each attack activity detection rule may correspond to a class of attack activities. In some embodiments, based on the input attack activity data 210 and the output of the initialization subsystem 220, the attack activity determination subsystem 230 may output attack activity detection rules. Specifically, the attack activity determination subsystem 230 generates at least one attack activity detection rule from the keywords in the non-attack activity data 205 and the keywords in the attack activity data 210 by selecting the keywords from the attack activity data 210, wherein each attack activity detection rule includes at least one keyword for characterizing an attack activity. Here, the non-attack behavior data 205 is considered to be capable of characterizing normal non-attack behavior, and thus can be used as reference data. When the attack behavior detection rule is identified, a keyword which can be distinguished from the non-attack behavior data and can characterize the attack behavior can be identified from the attack behavior data 210 by referencing the non-attack behavior data 205.

The determined attack detection rules are provided to the behavior detection subsystem 240 for use in performing detection of an attack on the target user host 140. The behavior detection subsystem 240 performs attack behavior detection on the target user host 140 based on at least one attack behavior detection rule.

The attack activity data 210 may include sequences of instructions (e.g., shell instruction sequences) collected by one or more honeypot hosts 120 in the honeypot network 102, which may also be referred to as honeypot activity data. Because it is collected on the honey host, the attack behavior data 210 represents data where abnormal behavior may exist, such as data resulting from malicious behavior by a malicious user (e.g., hacker 112). Table 1 below shows exemplary attack behavior data 210.

TABLE 1 exemplary attack behavior data

The non-offensive behavior data 205 may represent a sequence of instructions (e.g., shell instruction sequence) collected by one or more user hosts 130 in the HIDS 104, and may also be referred to as normal behavior data, representing data resulting from normal behavior of a user, such as data resulting from normal behavior (non-malicious behavior) of a normal user (non-malicious user).

In the first stage, non-attack data 205 from HIDS 104 is input to initialization subsystem 220 to perform data preprocessing and statistics by initialization subsystem 220 for use by attack determination subsystem 230. The process of the first stage of the initialization subsystem 220 is described below in connection with fig. 3-5.

Fig. 3 illustrates the structure of an initialization subsystem 220 and the processing of example non-offensive behavior data 205, according to some embodiments of the present disclosure. As shown in fig. 3, the initialization subsystem 220 includes an instruction aggregation module 320 and a keyword extraction module 330.

The non-offensive behavior data 205 includes a sequence of user instructions collected from at least one user host 130 of the HIDS 104. In some embodiments, multiple user instruction sets may be collected separately in multiple sessions on at least one user host, e.g., one session for each user instruction set. In some embodiments, the user instruction sequence may include Shell instructions, and may also include other instructions. As shown in FIG. 3, example non-offensive behavior data 315 includes a host ID of 1, a session ID of 100, shell instructions of "ls", a host ID of 1, a session ID of 100, shell instructions of "pwd", a host ID of 2, a session ID of 200, shell instructions of "history". The behavior data with consistent host IDs (also referred to as host identifiers) may indicate that the behavior data was generated by the same user host, and the behavior data with consistent session IDs (also referred to as session identifiers) may indicate that the behavior data was generated by the same session.

The instruction aggregation module 320 is configured to aggregate the non-attack data 205, for example, the user instruction sequences in the non-attack data 205 may be aggregated by a host Identifier (ID) and a session ID to obtain a user instruction sequence set 325 (which may be a Shell instruction sequence set in this example) grouped by host ID and session ID, where the user instruction sequences in each user instruction sequence set have the same host ID and session ID, and may indicate that the user instructions in one user instruction sequence set all come from the same host and group of sessions. The example user instruction sequence set shown in fig. 3 includes user instruction sequence set 1: { "ls", "pwd", the Shell instructions "ls" and "pwd" in the..the } are both from the same host and the same group session (host ID of 1 in this example, session ID 100), and the remaining user instruction sets are similar.

Fig. 4 gives an example of the aggregation process of the example non-offensive behavior data 205 by the instruction aggregation module 320. In the example of FIG. 4, example non-offending behavior data 205 in the data stream includes instruction sequences 410, 425, 420, and 425 for each session from each user host, including host ID 1, session ID 100, shell instruction: "ls", host ID 2, session ID 112, shell instruction: "rm. Test. Sh", host ID 3, session ID 1321, shell instruction: "history", host ID 1, session ID 100, shell instruction: "pwd", respectively. The example non-offensive behavior data 205 is input into an instruction aggregation module 320, which may result in multiple sets of user instruction sequences, where the instructions contained in each set of user instruction sequences are from the same user host and the same session. For example, in the example of FIG. 4, instructions "ls" and "pwd" from a host ID of 1 session ID of 100 are included in user instruction set 1, instruction "rm.test.sh" from a host ID of 2 session ID of 112 is included in user instruction set 2 435, instruction "history-c" from a host ID of 3 session ID of 1321 is included in user instruction set 3 440, and so on.

In some embodiments, each user instruction sequence set may be considered a behavior description document, and multiple behavior description documents corresponding to all non-attack behavior data 205 may be obtained. In some embodiments, the behavior description document (i.e., the user instruction sequence set 430, 435, 440, etc.) corresponding to the non-attack behavior data 205 is input to the keyword extraction module 330. Fig. 5 illustrates an example structure of the keyword extraction module 330 and processing of an example set of user instruction sequences in accordance with some embodiments of the present disclosure. As shown in fig. 5, the keyword extraction module 330 may include a word segmentation module 510 and a word frequency statistics module 530.

The word segmentation module 510 may be configured to perform word segmentation (e.g., english word segmentation) processing on a plurality of users in the non-offensive behavior data 205. Word segmentation refers to the division of a text sequence into individual words or phrases. As shown in FIG. 5, the instructions "ls" and "pwd" included in the user instruction set 1 430 are subjected to word segmentation by the word segmentation module 510 to obtain character strings 515"ls" and "pwd", the instructions "rm.test.sh" included in the user instruction set 2 435 are subjected to word segmentation by the word segmentation module 510 to obtain character strings 520"rm", "test" and "sh", and the instructions "history-c" included in the user instruction set 3 440 are subjected to word segmentation by the word segmentation module 510 to obtain character strings 525"history" and "-c". After word segmentation, the character strings obtained after the processing can be cleaned, for example, all IP addresses and pure numbers (namely, the characters are completely composed of numbers) are cleaned, so that the interference of irrelevant contents on the follow-up keyword extraction is avoided. Because keywords such as file names can be combinations of English and numbers in some cases, the combination of English and numbers can be reserved in the cleaning process, so that incorrect cleaning is avoided. After the word segmentation process or after the washing process, a reference keyword set including a plurality of reference keywords (e.g., character strings) may be obtained.

The term frequency statistics module 530 is configured to perform statistics on the benchmark keywords. In particular, the term frequency statistics module 530 may count the total number of behavior description documents (i.e., the set of user instruction sequences) and the number of behavior description documents containing each benchmark keyword. In the example shown in fig. 5, the total number 535 of statistical behavior description documents is 18888, the number 540 of behavior description documents containing "ls" is 3200, the number 545 of behavior description documents containing "pwd" is 1800, and the number 550 of behavior description documents containing "rm" is 2800, for example.

Note that the specific instruction sequences and their statistics shown in fig. 3-5 are examples given for illustrative purposes only and are not meant to limit the scope of embodiments of the present disclosure in any way.

The first stage of initialization of the keyword extraction module can be completed through the steps, and the construction of the reference keyword set, the statistics of the number of user instruction sequences in the non-attack behavior data 205, and the statistics of the number of user instruction sequences containing each reference keyword are completed through the keyword extraction module. Statistical information obtained from the non-offensive behavior data 205 may be used for subsequent screening of representative keywords in the offensive behavior data 210.

In the second phase, the attack data 210 collected from the honeypot network 102 may be converted into rules for attack detection. Next, a process performed in the second stage will be described with reference to fig. 6 to 8.

Fig. 6 illustrates an example structure of an attack activity determination subsystem 230 and a schematic diagram of processing of example attack activity data 210 according to some embodiments of the present disclosure. The attack activity determination subsystem 230 may convert the attack activity data 210 into attack activity detection rules 655 for attack activity detection. As shown in fig. 6, the attack activity determination subsystem 230 includes an instruction aggregation module 620, a keyword extraction module 630, a keyword cluster compression module 640, a detection rule generation module 650, and a detection rule publishing and adjustment module 660.

For convenience of the following discussion, it is assumed that the attack behavior data 210 includes a host ID of 1, a session ID of 365, shell instructions of "chattr-iae/root/. Ssh", a host ID of 2, a session ID of 365, shell instructions of "/meow meow.Selfrep arm5", and so on.

The instruction aggregation module 620 is configured to aggregate the attack data 210 by host ID and session ID resulting in a honeypot instruction sequence set (e.g., honeypot instruction sequence set 625) that is grouped by host ID and session ID. The instruction aggregation module 620 here is consistent with the functionality of the instruction aggregation module 320 used in the first stage. In the example of FIG. 6, a honeypot instruction sequence set 1: { "/meow meow.Selfrep.arm5", shell instruction 2, shell instruction 3,......... }, is obtained through an aggregation operation.

In some embodiments, multiple honeypot instruction sequence sets are determined from the attack behavior data 210, which may be collected separately in multiple sessions on at least one honeypot host. For example, each honeypot instruction sequence set may include instruction sequences from a session on a honeypot host. A plurality of candidate keyword sets are extracted from a plurality of honeypot instruction sequence sets of the attack data 210, respectively. In some embodiments, all strings of each honeypot instruction sequence set (e.g., shell instructions included in the set) may be used as one behavior description document, and all behavior description documents corresponding to the attack behavior data 210 may be obtained. The behavior description document corresponding to the attack behavior data 210 may be input to the keyword extraction module 630 to obtain the candidate keyword set 635. The keyword cluster compression module 640 may determine the frequency of occurrence of each keyword in the plurality of candidate keyword sets in the non-offensive data and the non-offensive data by referencing keywords in the non-offensive data 205, and determine at least one target keyword set 645 from the plurality of candidate keyword sets based on the frequency of occurrence of each keyword. Each target keyword set 645 includes at least one keyword that can be used to characterize an attack.

Fig. 7 presents a specific example of a keyword extraction module 630, in which a schematic diagram of the processing of a honeypot instruction sequence set by the keyword extraction module 630 is shown, in accordance with some embodiments. As shown in fig. 7, the keyword extraction module 630 may include a word segmentation module 720, a statistics update module 735, an importance calculation module 760, and a keyword determination module 775. The word segmentation module 720 is configured to perform word segmentation (e.g., english word segmentation) on the behavioral description document corresponding to the attack behavioral data 210, where the word segmentation is similar to the first stage. As shown in FIG. 7, the instructions "/meow meow.Selfrep.arm5" included in the honeypot instruction sequence set 1 710 are subjected to word segmentation by the word segmentation module 720 to obtain candidate keyword sets 725"meow", "Selfrep" and "arm5", and the instructions "chattr-iae/root/. Ssh" included in the honeypot instruction sequence set 2 715 are subjected to word segmentation by the word segmentation module 720 to obtain candidate keyword sets 730 "chat", "iae" and "root".

In some embodiments, the statistics update module 735 is configured to update statistics of the total number of behavior description documents, update statistics of the number of behavior description documents containing each word, and statistics of word frequencies of each word in all of the behavior description documents. The statistical update for the total number of behavior description documents may be performed by adding the total number of behavior description documents counted based on the non-attack behavior data 205 in the first stage initialization process to the total number of description documents currently counted based on the attack behavior data 210 in the second stage as the total number of updated behavior description documents. For statistical updating of the number of behavior description documents containing each word, the number of behavior description documents containing each word counted based on the non-attack behavior data 205 at the time of the first stage initialization may be added to the number of behavior description documents containing each word counted currently based on the attack behavior data 210 at the second stage as the number of behavior description documents containing each word after updating. And counting the number of times that each word appears in the current behavior description document of the second stage by dividing the total word number of the current behavior description document aiming at the word frequency of each word in all the behavior description documents.

In the example shown in fig. 7, the total number 740 of updated behavior description documents is 18888+25 (wherein 18888 represents the total number of behavior description documents counted based on the non-attack behavior data 205 during the first phase initialization, 25 represents the total number of description documents counted based on the attack behavior data 210 at the second phase), and the number 745 of updated behavior description documents containing "ls" is 3200 +.

In some embodiments, the importance calculation module 760 is configured to calculate an importance score for each keyword. The importance calculation module 760 may determine an importance score for each keyword in the plurality of candidate keyword sets based on the frequency of occurrence of each keyword in the offensiveness data 210 and the keywords in the non-offensiveness data 205, wherein the importance score for the candidate keyword indicates the degree to which the candidate keyword is capable of characterizing the offensiveness. In general, the higher the frequency of occurrence of a keyword in a certain candidate keyword set, and the lower the frequency of occurrence in each keyword set in the attack behavior data 210 and the non-attack behavior data 205, means that the more representative the keyword is for a certain attack behavior.

In some embodiments, the importance scores of keywords may be calculated by word frequency-inverse document frequency (TF-IDF). The TF value may be determined by word frequency of each keyword in the plurality of candidate keyword sets in the corresponding candidate keyword set, e.g., counting the number of occurrences of the current keyword in the current behavior description document, and the IDF value may be determined by inverse document frequency of each keyword in the plurality of candidate keyword sets in the plurality of reference keyword sets and the plurality of candidate keyword sets, e.g., calculating the total number of behavior description documents divided by the number of behavior description documents containing the word. Then, importance scores of the respective keywords are determined based on the IDFs and the TFs of the respective keywords in the plurality of candidate keyword sets, respectively, and the importance score of one keyword is a product of the TF value and the IDF value of the keyword. The higher the importance score of a keyword, the more representative the keyword is, indicating that the keyword appears more frequently in the current behavior and less frequently in all behaviors. In the example shown in fig. 7, the importance score of the keyword meow in the candidate keyword set 765 corresponding to the host identity 1 session identity 365 is 12.5, the importance score of selfrep is 11.1, and the importance score of arm5 is 2.0. The scheme of the present disclosure can screen out representative keywords in the attack behavior data 210 by using word frequency differences of the non-attack behavior data 205 and the attack behavior data 210.

Table 2 below shows the importance scores corresponding to the keywords of the examples.

TABLE 2 exemplary keywords and importance scores thereof

Keyword(s)	Importance score
		meow	12.8
Selfrep	10.8
		curl	1.2
fsSL	2.5
		......	......

In some embodiments, the keyword determination module 775 is configured to determine at least one target keyword set from the plurality of candidate keyword sets based on the importance scores of the individual keywords in the plurality of candidate keyword sets, wherein each target keyword set may be used to characterize an attack.

In some embodiments, a predetermined number (e.g., 10) of keywords in the candidate keyword set that have the highest importance scores are extracted. Each keyword thus extracted is representative, and a predetermined number of keywords thus extracted can form a summary of the entire honeypot instruction sequence set, and different honeypot instruction sequence sets can be distinguished by the summary including the keywords. Since the keyword extraction module 630 is initialized by the non-attack behavior data 205, it can ensure that the extracted honeypot behavior keywords belong to malicious behaviors by giving a low importance score to the words belonging to normal behaviors in the collected attack behavior data 210. In the example shown in fig. 7, the determined keywords meow and Selfrep are included in the candidate keyword set 780 for a1 session ID 365. The determined keywords ls and pwd are included in the candidate keyword set 785 corresponding to a2 session ID of 365.

To compress and merge the candidate keyword set 635, the candidate keyword set may be input to the keyword cluster compression module 640. The plurality of candidate keyword sets (e.g., all candidate keyword sets) may be clustered by a clustering algorithm to obtain at least one clustered keyword set cluster, where each keyword set cluster may include at least two candidate keyword sets. In some embodiments, SINGLEPASS clustering algorithms may be used, and features may employ a bag of words model suitable for text clustering, so that after new data is added, the clustering result may be updated directly on the basis of the current clustering result. Other clustering algorithms may also be selected, such as, but not limited to, K-Means, AGNES, and the like.

Each keyword set cluster comprises a plurality of candidate keyword sets, and some candidate keyword sets included in one keyword set cluster may have a certain similarity, for example, may be from the same intrusion behavior (attack behavior). For each keyword set cluster, an active keyword set is selected from at least two candidate keyword sets included in the keyword set cluster based on importance scores of the respective keywords included in the keyword set cluster. In some embodiments, for each of at least two candidate keyword sets, a number of keywords in the candidate keyword set for which the importance score exceeds a first predetermined threshold is determined. An active set of keywords is selected from the at least two candidate sets of keywords based on the determined number of keywords exceeding a first predetermined threshold. For example, the candidate keyword set with the largest number of keywords with importance scores greater than the threshold may be used as the active keyword set in the keyword set cluster where it is located. The first predetermined threshold may be related to the proportion of the non-attack behavior data 205 and the attack behavior data 210, and the keywords with importance scores exceeding the first predetermined threshold represent keywords whose high probability is unique to the attack behavior data 210, and the specific value of the first predetermined threshold is not limited.

After the active keyword sets are selected for each keyword set cluster, an active keyword set can be determined in each keyword set cluster, and the representative behavior of the keyword set cluster is represented. In some embodiments, the target keyword set may be determined based on the selected active keyword set. In some embodiments, the target set of keywords may be obtained by deleting at least one keyword from the active set of keywords based on a comparison of the importance scores of the individual keywords in the active set of keywords to a second predetermined threshold. For example, the active set of keywords may be cleaned, leaving only keywords with importance scores greater than a second predetermined threshold, and removing other keywords. In some embodiments, the trigger words for the target set of keywords are determined based on at least one of the importance scores of the individual keywords in the active set of keywords and the character lengths of the individual keywords. For example, for each active set of keywords, one keyword (e.g., the first keyword satisfying the condition, any keyword satisfying the condition) having an importance score greater than the average importance score of the keywords within the active set of keywords and a character length greater than a predetermined length (e.g., 4) is screened, and the keyword is labeled as a trigger word.

In some embodiments, where there are multiple keyword set clusters, active keyword sets of the multiple active keyword sets selected for the multiple keyword set clusters may be merged if the trigger words of the active keyword sets are the same. For example, if the trigger words of the first active keyword set and the second active keyword set are the same, the first active keyword set and the second active keyword set are combined to obtain a target keyword set. Thus, the number of target keyword sets can be further reduced, and different target keyword sets can comprise different trigger words. The active keyword sets with the same trigger words can be combined by traversing all the active keyword sets, and repeated word removal processing is performed on the combined active keyword sets. For example, a union may be taken for multiple sets of active keywords that have the same trigger word. In the example embodiment of the disclosure, the trigger words have at least the following effects that the trigger words can be used as pre-matched objects in the generated rule to realize trigger detection (subsequent detection is not needed when the trigger words are not matched), and a plurality of active keyword sets can be combined through the trigger words, so that the number of the active keyword sets is further reduced, the calculated amount is reduced, resources are saved and the efficiency is improved.

Fig. 8 shows a specific structure of the keyword cluster compression module 640 in the attack activity determination subsystem 230. The target keyword set 645 may be obtained by the keyword cluster compression module 640. Fig. 8 illustrates a schematic diagram of compressing and merging candidate keyword sets by a keyword cluster compression module 640, according to some embodiments of the present disclosure. The keyword cluster compression module 640 may include a clustering module 820 and a screening module 840, as shown in fig. 8, and first performs cluster compression on candidate keyword sets through the clustering module 820 to obtain keyword set clusters (as shown in fig. 8, keyword set cluster 1 825, keyword set cluster 2 830 and keyword set cluster 3 835), where each keyword set cluster includes multiple sets of candidate keyword sets with similarity. The screening module 840 is configured to screen active keyword sets and trigger words in the keyword set clusters.

As shown in fig. 8, the active keyword set 845 with the host ID of 1 and the session ID of 365 includes { meow; selfrep }, the trigger word of meow, the active keyword set 850 with the host ID of 1 and the session ID of 277 includes { meow; watchd0g }, the trigger word of meow, and the active keyword set 855 with the host ID of 2 and the session ID of 365 includes { xmrig; minor }, the trigger word of xmrig. Because the trigger words corresponding to the active keyword set 845 and the active keyword set 850 are the same and are meow, the two trigger words can be combined and de-duplicated, the combined target keyword set 860 comprises { meow; selfrep; watchd0g }, and the combined target keyword set comprises all words in the two active keyword sets without repeated words. Here, the same session ID corresponding to the target keyword set 860 as the session ID corresponding to the active keyword set 845 is merely an example. Since the merging process is based on trigger words and is not related to the host ID and session ID, the host ID and session ID corresponding to the target keyword set 860 may be the host ID and session ID corresponding to the active keyword set 845 and/or the active keyword set 850.

The set of target keywords that are received by the attack activity determination subsystem 230 may be provided to the activity detection subsystem 240. Each target keyword set may be regarded as an attack detection rule. The behavior detection subsystem 240 is configured to perform attack behavior detection on a target user host (e.g., target user host 140 in fig. 1) based on at least one target keyword set.

In some embodiments, detection subsystem 240 may determine an attack behavior detection result on the target user host based at least on a degree of matching between at least one user instruction collected from the target user host and at least one target keyword set, wherein the attack behavior detection result may indicate whether an attack behavior occurred. For example, a user instruction is matched to a target set of keywords, and if the user instruction matches a certain target set of keywords in the target set of keywords highly (e.g., the user instruction matches a plurality of keywords in the certain target set of keywords highly), the user instruction may be considered to indicate the occurrence of an attack. The matching degree may be calculated by similarity, by converting a character string into a vector, and by a vector distance.

In some embodiments, where each of the at least one set of target keywords includes a trigger word and at least one keyword, the degree of matching between the first user instruction collected from the target user host and the trigger word in the at least one set of target keywords is determined based at least on the degree of matching between the at least one user instruction collected from the target user host and the at least one set of target keywords. Specifically, for each target keyword set, a rule is generated that matches the trigger word, e.g., a hit is indicated if it is detected that the user instruction (e.g., shell instruction) contains the trigger word. An attack detection rule may also be generated for keywords in the target set of keywords, for example, indicating a hit if it is detected that the executed user instruction contains keywords in the target set of keywords.

In some embodiments, the attack activity determination subsystem 230 may generate at least one attack activity detection rule based on at least one target keyword set, respectively, and issue the attack activity detection rule to the activity detection subsystem 240. Specifically, in the attack activity determination subsystem 230, the target keyword set obtained after the keyword cluster compression module 640 processes is input to the detection rule generation module 650 for generating an attack activity detection rule adapted to HIDS. The detection rule generation module 650 generates at least one attack detection rule from at least one target keyword set, respectively.

By automatically converting the attack behavior data 210 into HIDS detection rules, automatic sensing and detection of unknown threats can be realized, and the attack behavior detection capability of HIDS is improved.

Fig. 9 illustrates a flow diagram for generating attack detection rules 655 according to some embodiments of the present disclosure. As shown in fig. 9, assume a target keyword set 860 and a target keyword set 865 obtained by the keyword cluster compression module 640. From the target keyword set 860, an attack detection rule 910 with the trigger word "meow" may be determined, and from the target keyword set 865, an attack detection rule 915 with the trigger word "xmrig" may be determined. Rules that match keywords in the target keyword set may also be derived, such as attack detection rules 920 that match keywords [ meow; selfrep; watchd g. ], and attack detection rules 925 that match keywords [ xmrig; miner. ].

And if the first user instruction is determined to be matched with the first trigger word in the first target keyword set, acquiring a plurality of user instructions from the target user host based on the first user instruction. In some embodiments, a plurality of user instructions from a target user host are obtained based on the host ID and session ID of the first user instruction. And determining an attack behavior detection result based on determining the matching degree between the plurality of user instructions and at least one keyword in the first target keyword set. For example, in the case where it is detected that the user instruction contains a trigger word, the host ID and session ID are filtered out for a predetermined time (for example, 30 minutes), data in accordance with the current user instruction is counted, and the number of kinds of keywords hit in the attack behavior detection rule, for example, several keywords hit, is counted, and if the number of kinds is greater than a preset threshold, an alarm is given. For example, a user instruction is first matched to a trigger word rule, and if the trigger word rule misses, the user instruction is ignored. If the trigger word rule hits, the keyword rule is matched. And if the keyword rule is not hit, the user instruction is ignored. Here, the filtering of the data within the predetermined time may be a trace of the historical data within the predetermined time, or may be a trace of the data within the predetermined time from the current time point, for example, from the current time point to 30 minutes later.

According to the technical scheme of the disclosure, the generation of the attack behavior detection rule comprises the step of screening one or more target keyword sets capable of representing attack behaviors or malicious behaviors from attack behavior data acquired by a honey pot host, wherein each target keyword set comprises keywords extracted from an instruction sequence for initiating the attack behaviors. The target keyword set is determined by the non-attack behavior data collected from the user host. By comparing the keywords of the instruction sequences of the non-attack behavior data, the keywords which can accurately represent the attack behavior can be screened out. In this way, by combining the attack behavior data from the honeypot and the non-attack behavior data from the user host, the attack behavior data of the honeypot can be automatically converted into rules for behavior detection, i.e., target keyword matching.

Fig. 10 gives a specific example of a rule distribution flow according to some embodiments of the present disclosure. As shown in fig. 10, after the detection rule issuing and adjusting module 660 determines the attack detection rule 1 and the attack detection rule 2, rule issuing (for example, issuing a specific detection rule, or issuing a target keyword set) may be performed. The published attack detection rules may be applied in HIDS (e.g., HIDS 106 or HIDS 104 in fig. 1). And providing an alarm result in the case that the HIDS detects the attack according to the rule.

In some embodiments, the generated attack detection rules may be pushed to the HIDS for incremental updating by the detection rule issuing and adjustment module 660. In some embodiments, additional attack activity data is collected from at least one honeypot host and the currently existing attack activity detection rules, e.g., the determined at least one target keyword set, are updated for attack activity detection by the attack activity determination subsystem 230 based on the additional attack activity data. The additional attack data may be collected periodically, or in real time, and the target keyword set may be updated periodically, or in real time, so that the attack detection rule may be continuously produced, for example, the attack detection rule may be iteratively updated every hour or every day, so that the HIDS may apply the latest attack detection rule in time. As shown in the figure 2 of the drawings, for different attack behaviors data 210-1....210-J, the determined set of target keywords may be further updated based on the embodiments described above. Updating the target keyword set includes adding one or more new target keyword sets, adding, deleting, modifying, etc., keywords or triggers in existing target keyword sets, deleting existing target keyword sets, etc. The updated behavior detection rules may continue to be published and used for attack behavior detection on the target user host. Therefore, the honey pot host continuously collects the attack behavior data to update the attack behavior detection rule, a new attack mode can be tracked, and the safety of the user host is continuously ensured.

In the scheme of the present disclosure, since the normal behavior data is used for initialization, the generated rule false alarm rate is low. Meanwhile, the generated attack behavior detection rules have generalization capability, and can detect variant behaviors of some malicious samples. The generated overall detection rule has logic for detecting the trigger words and then detecting the key words, so that the efficiency in detecting the user instruction sequence can be effectively improved.

It should be understood that the example instruction sequences and rule codes are given in the figures and description for illustrative purposes only, and this does not impose any limitation on the scope of the present disclosure. Embodiments of the present disclosure may be similarly applied to any other type of instruction sequences and code.

Example procedure

Fig. 11 illustrates a flow chart of a method 1100 for attack behavior detection according to some embodiments of the present disclosure. The method 1100 may be implemented by an electronic device, which may include, for example, the attack detection system 200 of fig. 2, for example.

As shown in fig. 11, at block 1110, the electronic device obtains attack data collected from at least one honeypot host and non-attack data collected from at least one user host.

In block 1120, the electronic device screens keywords from the attack data according to the keywords in the non-attack data and the keywords in the attack data to generate at least one attack detection rule, wherein each attack detection rule includes at least one keyword for characterizing an attack.

At block 1130, the electronic device performs attack detection on the target user host based on at least one attack detection rule.

In some embodiments, screening keywords from the attack data to generate at least one attack detection rule includes determining a plurality of honeypot instruction sequence sets from the attack data, each honeypot instruction sequence set including an instruction sequence from a session on a honeypot host, determining at least one target keyword set from the plurality of candidate keyword sets based on frequency of occurrence of individual keywords in the attack data and the non-attack data, each target keyword set including at least one keyword for characterizing an attack, and generating at least one attack detection rule from the at least one target keyword set, respectively.

In some embodiments, determining at least one target set of keywords from the plurality of candidate sets of keywords includes determining a importance score for each keyword in the plurality of candidate sets of keywords based on a frequency of occurrence of each keyword in the plurality of candidate sets of keywords in the offensive data and the keywords in the non-offensive data, the importance score for the candidate keywords indicating a degree to which the candidate keywords are capable of characterizing the offensive, and determining at least one target set of keywords from the plurality of candidate sets of keywords based on the importance score for each keyword in the plurality of candidate sets of keywords.

In some embodiments, the non-offensive behavior data includes a plurality of sets of user instructions collected from at least one user host, a plurality of sets of reference keywords are extracted from the plurality of sets of user instructions, respectively, and wherein determining the importance score for each keyword in the plurality of sets of candidate keywords includes determining an inverse document frequency IDF for each keyword in the plurality of sets of candidate keywords in the plurality of sets of reference keywords and the plurality of sets of candidate keywords, determining a word frequency TF for each keyword in the plurality of sets of candidate keywords in the corresponding sets of candidate keywords, and determining the importance score for each keyword based on the IDF and TF for each keyword in the plurality of sets of candidate keywords, respectively.

In some embodiments, determining at least one target keyword set from the plurality of candidate keyword sets based on the importance scores of the respective keywords in the plurality of candidate keyword sets includes clustering the plurality of candidate keyword sets to obtain at least one keyword set cluster, each keyword set cluster including at least two candidate keyword sets, and for each keyword set cluster, selecting an active keyword set from the at least two candidate keyword sets included in the keyword set cluster based on the importance scores of the respective keywords included in the keyword set cluster, and determining the target keyword set based on the selected active keyword set.

In some embodiments, for each keyword set cluster, selecting an active candidate keyword set from at least two candidate keyword sets included in the keyword set cluster includes, for each candidate keyword set in the at least two candidate keyword sets, determining a number of keywords in the candidate keyword set for which the importance score exceeds a first predetermined threshold, and selecting the active keyword set from the at least two candidate keyword sets based on the determined number of keywords exceeding the first predetermined threshold.

In some embodiments, determining the target set of keywords based on the selected active set of keywords includes deleting at least one keyword from the active set of keywords based on a comparison of the importance scores of the individual keywords in the active set of keywords to a second predetermined threshold, resulting in the target set of keywords.

In some embodiments, determining at least one target keyword set from the plurality of candidate keyword sets based on the importance scores of the respective keywords in the plurality of candidate keyword sets includes determining a trigger word for the target keyword set based on at least one of the importance scores of the respective keywords in the active keyword set and the character lengths of the respective keywords.

In some embodiments, the at least one keyword set cluster comprises a plurality of keyword set clusters, and wherein determining the at least one target keyword set further comprises merging the first active keyword set and the second active keyword set to obtain the target keyword set if a trigger word of the first active keyword set and the second active keyword set in the plurality of active keyword sets selected for the plurality of keyword set clusters is the same, the target keyword set comprising at least the same trigger word.

In some embodiments, performing attack detection on the target user host includes determining whether at least one user instruction collected from the target user host satisfies at least one attack detection rule and determining an attack detection result on the target user host based on the determination of whether the at least one user instruction satisfies the at least one attack detection rule, the attack detection result indicating whether an attack occurred. In some embodiments, each of the at least one attack detection rule includes a trigger word and at least one keyword, and wherein determining whether the at least one user instruction collected from the target user host satisfies the at least one attack detection rule includes determining a degree of matching between a first user instruction collected from the target user host and the trigger word in the at least one target attack detection rule, if it is determined that the first user instruction matches the first trigger word in the first attack detection rule, obtaining a plurality of user instructions from the target user host based on the first user instruction, and determining whether the first attack detection rule is satisfied based on the degree of matching between the plurality of user instructions and the at least one keyword in the first attack detection rule.

In some embodiments, obtaining the plurality of user instructions from the target user host based on the first user instruction includes obtaining the plurality of user instructions from the target user host based on a host identifier and a session identifier of the first user instruction.

In some embodiments, the electronic device may further collect additional attack data from the at least one honeypot host, update the at least one attack detection rule based on the additional attack data, and perform attack detection on the target user host based on the updated at least one attack detection rule.

Fig. 12 shows a schematic block diagram of an apparatus 1200 for attack behavior detection according to some embodiments of the present disclosure. The apparatus 1200 may be implemented in an electronic device, for example, for implementing the attack detection system 200 of fig. 2 or one or more subsystems therein. The various modules/components in apparatus 1200 may be implemented in hardware, software, firmware, or any combination thereof.

As shown in fig. 12, the apparatus 1200 includes a data acquisition module 1210 configured to acquire attack behavior data acquired from at least one honeypot host and non-attack behavior data acquired from at least one user host. The apparatus 1200 further comprises a rule generation module 1220 configured to generate at least one attack behavior detection rule from the keywords in the non-attack behavior data and the keywords in the attack behavior data by screening the keywords from the attack behavior data, wherein each attack behavior detection rule comprises at least one keyword for characterizing an attack behavior. The apparatus 1200 further comprises a behavior detection module 1230 configured to perform attack behavior detection on the target user host based on at least one attack behavior detection rule.

In some embodiments, the rule generation module 1220 is configured to determine a plurality of sets of honeypot instruction sequences from the attack data, each set of honeypot instruction sequences including an instruction sequence from a session on the honeypot host, extract a plurality of candidate keyword sets from the plurality of sets of honeypot instruction sequences of the attack data, respectively, determine at least one target keyword set from the plurality of candidate keyword sets based on a frequency of occurrence of keywords in the attack data and keywords in the non-attack data for each of the plurality of candidate keyword sets, each target keyword set including at least one keyword for characterizing an attack, and generate at least one attack detection rule from the at least one target keyword set, respectively.

In some embodiments, the rule generation module 1220 is configured to determine a importance score for each of the plurality of candidate keyword sets based on the frequency of occurrence of each of the plurality of candidate keywords in the offensive data and the keywords in the non-offensive data, the importance score for the candidate keyword indicating the degree to which the candidate keyword is capable of characterizing offensive, and determine at least one target keyword set from the plurality of candidate keyword sets based on the importance score for each of the plurality of candidate keyword sets.

In some embodiments, the non-offensive behavior data includes a plurality of sets of user instructions collected from at least one user host, a plurality of sets of reference keywords are extracted from the plurality of sets of user instructions, respectively, and the score determination module 1220 is further configured to determine inverse document frequencies, IDFs, of each of the plurality of sets of candidate keywords in the plurality of sets of reference keywords and in the plurality of sets of candidate keywords, determine word frequencies, TF, of each of the plurality of sets of candidate keywords in the corresponding set of candidate keywords, and determine importance scores for each of the keywords, respectively, based on the IDFs and the TF of each of the plurality of sets of candidate keywords.

In some embodiments, the rule generation module 1220 is configured to cluster a plurality of candidate keyword sets, resulting in at least one keyword set cluster, each keyword set cluster comprising at least two candidate keyword sets, and for each keyword set cluster, select an active keyword set from the at least two candidate keyword sets comprised by the keyword set cluster based on the importance scores of the respective keywords comprised in the keyword set cluster, and determine a target keyword set based on the selected active keyword set.

In some embodiments, for each keyword set cluster, the rule generating module 1220 is further configured to determine, for each candidate keyword set of the at least two candidate keyword sets, a number of keywords in the candidate keyword set for which the importance score exceeds a first predetermined threshold, and select an active keyword set from the at least two candidate keyword sets based on the determined number of keywords exceeding the first predetermined threshold.

In some embodiments, the rule generation module 1220 is configured to delete at least one keyword from the set of active keywords to obtain the set of target keywords based on a comparison of the importance scores of the individual keywords in the set of active keywords to a second predetermined threshold.

In some embodiments, the rule generation module 1220 is configured to determine the trigger words of the target set of keywords based on at least one of the importance scores of the individual keywords in the active set of keywords and the character lengths of the individual keywords.

In some embodiments, the at least one keyword set cluster comprises a plurality of keyword set clusters, and the rule generating module 1220 is further configured to combine the first active keyword set and the second active keyword set to obtain a target keyword set if a trigger word of the first active keyword set and the second active keyword set is the same among the plurality of active keyword sets selected for the plurality of keyword set clusters, the target keyword set comprising at least the same trigger word.

In some embodiments, the behavior detection module 1230 is configured to determine whether at least one user instruction collected from the target user host satisfies at least one attack behavior detection rule, and determine an attack behavior detection result on the target user host based on a determination of whether the at least one user instruction satisfies the at least one attack behavior detection rule, the attack behavior detection result indicating whether an attack behavior occurred.

In some embodiments, each of the at least one attack detection rule includes a trigger word and at least one keyword, and the behavior detection module 1230 is further configured to determine a degree of match between a first user instruction collected from the target user host and the trigger word in the at least one target attack detection rule, obtain a plurality of user instructions from the target user host based on the first user instruction if it is determined that the first user instruction matches the first trigger word in the first attack detection rule, and determine whether the first attack detection rule is satisfied based on the degree of match between the plurality of user instructions and the at least one keyword in the first attack detection rule.

In some embodiments, the behavior detection module 1230 is further configured to obtain a plurality of user instructions from the target user host based on the host identifier and the session identifier of the first user instruction.

In some embodiments, the rule generation module 1220 is further configured to collect additional attack activity data from the at least one honeypot host, and update the at least one attack activity detection rule based on the additional attack activity data. The behavior detection module 1230 is further configured to perform attack behavior detection on the target user host based on the updated at least one attack behavior detection rule.

Fig. 13 illustrates a block diagram of an electronic device 1300 in which one or more embodiments of the disclosure can be implemented. It should be understood that the electronic device 1300 illustrated in fig. 13 is merely exemplary and should not be construed as limiting the functionality and scope of the embodiments described herein. The electronic device 1300 shown in fig. 13 may be used to implement the electronic device 130 of fig. 1.

As shown in fig. 13, the electronic device 1300 is in the form of a general-purpose electronic device. The components of electronic device 1300 may include, but are not limited to, one or more processors or processing units 1205, memory 1320, storage 1330, one or more communication units 1340, one or more input devices 1350, and one or more output devices 1360. The processing unit 1205 may be an actual or virtual processor and can execute various processes according to programs stored in the memory 1320. In a multiprocessor system, multiple processing units execute computer-executable instructions in parallel to improve the parallel processing capabilities of electronic device 1300.

The electronic device 1300 typically includes a number of computer storage media. Such a medium may be any available media that is accessible by electronic device 1300 and includes, but is not limited to, volatile and non-volatile media, removable and non-removable media. The memory 1320 may be volatile memory (e.g., registers, cache, random Access Memory (RAM)), non-volatile memory (e.g., read-only memory (ROM), electrically erasable programmable read-only memory (EEPROM), flash memory), or some combination thereof. The storage device 1330 may be a removable or non-removable media and may include machine-readable media such as flash drives, magnetic disks, or any other medium that may be capable of storing information and/or data and that may be accessed within the electronic device 1300.

The electronic device 1300 may further include additional removable/non-removable, volatile/nonvolatile storage media. Although not shown in fig. 13, a magnetic disk drive for reading from or writing to a removable, nonvolatile magnetic disk (e.g., a "floppy disk") and an optical disk drive for reading from or writing to a removable, nonvolatile optical disk may be provided. In these cases, each drive may be connected to a bus (not shown) by one or more data medium interfaces. Memory 1320 may include a computer program product 1325 having one or more program modules configured to perform the various methods or acts of the various embodiments of the present disclosure.

Communication unit 1340 enables communication with other electronic devices via a communication medium. Additionally, the functionality of the components of the electronic device 1300 may be implemented as a single computing cluster or as multiple computing machines capable of communicating over a communications connection. Thus, the electronic device 1300 may operate in a networked environment using logical connections to one or more other servers, a network Personal Computer (PC), or another network node.

The input device 1350 may be one or more input devices such as a mouse, keyboard, trackball, etc. The output device 1360 may be one or more output devices such as a display, speakers, printer, etc. The electronic device 1300 may also communicate with one or more external devices (not shown), such as storage devices, display devices, etc., with one or more devices that enable a user to interact with the electronic device 1300, or with any device (e.g., network card, modem, etc.) that enables the electronic device 1300 to communicate with one or more other electronic devices, as desired, via the communication unit 1340. Such communication may be performed via an input/output (I/O) interface (not shown).

According to an exemplary implementation of the present disclosure, a computer-readable storage medium having stored thereon computer-executable instructions, wherein the computer-executable instructions are executed by a processor to implement the method described above is provided. According to an exemplary implementation of the present disclosure, there is also provided a computer program product tangibly stored on a non-transitory computer-readable medium and comprising computer-executable instructions that are executed by a processor to implement the method described above.

Various aspects of the present disclosure are described herein with reference to flowchart illustrations and/or block diagrams of methods, apparatus, devices, and computer program products implemented according to the disclosure. It will be understood that each block of the flowchart illustrations and/or block diagrams, and combinations of blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer-readable program instructions.

These computer readable program instructions may be provided to a processing unit of a general purpose computer, special purpose computer, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processing unit of the computer or other programmable data processing apparatus, create means for implementing the functions/acts specified in the flowchart and/or block diagram block or blocks. These computer readable program instructions may also be stored in a computer readable storage medium that can direct a computer, programmable data processing apparatus, and/or other devices to function in a particular manner, such that the computer readable medium having the instructions stored therein includes an article of manufacture including instructions which implement the function/act specified in the flowchart and/or block diagram block or blocks.

The computer readable program instructions may be loaded onto a computer, other programmable data processing apparatus, or other devices to cause a series of operational steps to be performed on the computer, other programmable apparatus or other devices to produce a computer implemented process such that the instructions which execute on the computer, other programmable apparatus or other devices implement the functions/acts specified in the flowchart and/or block diagram block or blocks.

The flowcharts and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various implementations of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of instructions, which comprises one or more executable instructions for implementing the specified logical function(s). In some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

The foregoing description of implementations of the present disclosure has been provided for illustrative purposes, is not exhaustive, and is not limited to the implementations disclosed. Many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the various implementations described. The terminology used herein was chosen in order to best explain the principles of each implementation, the practical application, or the improvement of technology in the marketplace, or to enable others of ordinary skill in the art to understand each implementation disclosed herein.

Claims

1. A method for attack behavior detection, comprising:

Determining the behavior data collected from at least one honeypot host as attack behavior data, and determining the behavior data collected from at least one user host as non-attack behavior data;

According to the keywords in the non-offensive behavior data and the keywords in the offensive behavior data, the keywords are filtered out from the offensive behavior data to generate at least one offensive behavior detection rule, wherein each offensive behavior detection rule includes at least one keyword for characterizing the offensive behavior, wherein filtering the keywords from the offensive behavior data to generate the at least one offensive behavior detection rule includes:

Determine a plurality of honeypot instruction sequence sets from the attack behavior data, each honeypot instruction sequence set comprising an instruction sequence from a session on a honeypot host;

Extracting a plurality of candidate keyword sets from the plurality of honeypot instruction sequence sets respectively;

Determining the importance score of each keyword in the plurality of candidate keyword sets based on the frequency of occurrence of each keyword in the offensive behavior data and the keyword in the non-offensive behavior data;

Based on the importance scores of the respective keywords in the plurality of candidate keyword sets, determining at least one target keyword set from the plurality of candidate keyword sets, each target keyword set including at least one keyword for characterizing an attack behavior, the importance scores of the candidate keywords indicating the extent to which the candidate keywords can characterize the attack behavior; and

Generating the at least one attack behavior detection rule from the at least one target keyword set respectively; and

Based on the at least one attack behavior detection rule, attack behavior detection is performed on the target user host.

2. The method according to claim 1, wherein the non-offensive behavior data comprises a plurality of user instruction sequence sets collected from at least one user host, and a plurality of benchmark keyword sets are respectively extracted from the plurality of user instruction sequence sets; and

Determining the importance score of each keyword in the plurality of candidate keyword sets includes:

Determine the inverse document frequency (IDF) of each keyword in the plurality of candidate keyword sets in the plurality of benchmark keyword sets and the plurality of candidate keyword sets;

Determine the word frequency TF of each keyword in the plurality of candidate keyword sets in the corresponding candidate keyword set; and

The importance score of each keyword is determined based on the IDF and TF of each keyword in the plurality of candidate keyword sets.

3. The method according to claim 1, wherein determining the at least one target keyword set from the plurality of candidate keyword sets based on the importance score of each keyword in the plurality of candidate keyword sets comprises:

Clustering the multiple candidate keyword sets to obtain at least one keyword set cluster, each keyword set cluster including at least two candidate keyword sets; and

For each keyword set cluster,

Based on the importance score of each keyword included in the keyword set cluster, selecting an active keyword set from at least two candidate keyword sets included in the keyword set cluster, and

A target keyword set is determined based on the selected active keyword set.

4. The method according to claim 3, wherein for each keyword set cluster, selecting an active candidate keyword set from at least two candidate keyword sets included in the keyword set cluster comprises:

For each candidate keyword set of the at least two candidate keyword sets, determining the number of keywords in the candidate keyword set whose importance scores exceed a first predetermined threshold; and

An active keyword set is selected from the at least two candidate keyword sets based on the determined number of keywords exceeding the first predetermined threshold.

5. The method of claim 3, wherein determining the target keyword set based on the selected active keyword set comprises:

Based on the comparison between the importance score of each keyword in the active keyword set and a second predetermined threshold, at least one keyword is deleted from the active keyword set to obtain the target keyword set.

6. The method according to claim 3, wherein determining the at least one target keyword set from the plurality of candidate keyword sets based on the importance score of each keyword in the plurality of candidate keyword sets comprises:

The trigger words of the target keyword set are determined based on at least one of the importance score of each keyword in the active keyword set and the character length of each keyword.

7. The method of claim 6, wherein the at least one keyword set cluster comprises a plurality of keyword set clusters, and wherein determining the at least one target keyword set further comprises:

If the trigger words of the first active keyword set and the second active keyword set in the multiple active keyword sets selected for clustering the multiple keyword sets are the same, the first active keyword set and the second active keyword set are merged to obtain a target keyword set, and the target keyword set at least includes the same trigger words.

8. The method according to claim 1, wherein performing attack behavior detection on the target user host comprises:

Determining whether at least one user instruction collected from the target user host satisfies the at least one attack behavior detection rule; and

Based on determining whether the at least one user indication satisfies the at least one attack behavior detection rule, an attack behavior detection result on the target user host is determined, the attack behavior detection result indicating whether an attack behavior occurs.

9. The method according to claim 8, wherein each of the at least one attack behavior detection rule comprises a trigger word and at least one keyword, and wherein determining whether at least one user instruction collected from the target user host satisfies the at least one attack behavior detection rule comprises:

Determining a degree of match between a first user instruction collected from the target user host and a trigger word in the at least one target attack behavior detection rule;

If it is determined that the first user instruction matches the first trigger word in the first attack behavior detection rule, acquiring multiple user instructions from the target user host based on the first user instruction; and

Whether the first attack behavior detection rule is satisfied is determined based on the matching degree between the multiple user instructions and at least one keyword in the first attack behavior detection rule.

10. The method according to claim 9, wherein acquiring a plurality of user instructions from the target user host based on the first user instruction comprises:

Based on the host identifier and the session identifier of the first user instruction, the plurality of user instructions from the target user host are obtained.

11. The method according to claim 1, further comprising:

collecting additional attack behavior data from the at least one honeypot host;

updating the at least one offensive behavior detection rule based on the additional offensive behavior data; and

Attack behavior detection is performed on the target user host based on the updated at least one attack behavior detection rule.

12. A device for attack behavior detection, comprising:

A data acquisition module configured to determine the behavior data collected from at least one honeypot host as attack behavior data, and to determine the behavior data collected from at least one user host as non-attack behavior data;

A rule generation module is configured to filter keywords from the attack behavior data to generate at least one attack behavior detection rule based on keywords in the non-attack behavior data and keywords in the attack behavior data, wherein each attack behavior detection rule includes at least one keyword for characterizing the attack behavior, wherein the rule generation module is further configured to:

The behavior detection module is configured to perform attack behavior detection on the target user host based on the at least one attack behavior detection rule.

13. An electronic device, comprising:

at least one processing unit; and

At least one memory, the at least one memory being coupled to the at least one processing unit and storing instructions for execution by the at least one processing unit, the instructions causing the electronic device to perform the method according to any one of claims 1 to 11 when executed by the at least one processing unit.

14. A computer-readable storage medium having a computer program stored thereon, wherein the computer program can be executed by a processor to implement the method according to any one of claims 1 to 11.