CN112632966B

CN112632966B - Alarm information marking method, device, medium and equipment

Info

Publication number: CN112632966B
Application number: CN202011614604.7A
Authority: CN
Inventors: 张润滋; 刘文懋; 陈磊; 薛见新; 吴复迪
Original assignee: Nsfocus Technologies Inc; Nsfocus Technologies Group Co Ltd
Current assignee: Nsfocus Technologies Inc; Nsfocus Technologies Group Co Ltd
Priority date: 2020-12-30
Filing date: 2020-12-30
Publication date: 2023-07-21
Anticipated expiration: 2040-12-30
Also published as: CN112632966A

Abstract

The invention relates to an alarm information marking method, an alarm information marking device, a medium and equipment. The topic distribution vector corresponding to the type of the currently received alarm information can be determined by utilizing the pre-trained LDA model, the topic distribution vector corresponding to each context text corresponding to the alarm information (formed according to the alarm information and the alarm statement formed by the type of the associated alarm information) can be determined, and the semantic deviation value between the type of the currently received alarm information and each context text corresponding to the type of the currently received alarm information can be measured by the Euclidean distance value between the topic distribution vectors. Therefore, when a certain Euclidean distance value is larger, the semantic deviation value between the type of the currently received alarm information and the context text corresponding to the Euclidean distance value is considered to be larger, a corresponding context anomaly label is generated for the currently received alarm information, and the alarm information is prompted to be possibly high-risk alarm information aiming at a certain context text.

Description

Alarm information marking method, device, medium and equipment

Technical Field

The present invention relates to the field of network security technologies, and in particular, to a method, an apparatus, a medium, and a device for marking alarm information.

Background

This section is intended to provide a background or context to the embodiments of the invention that are recited in the claims. The description herein is not admitted to be prior art by inclusion in this section.

A major challenge faced by security operation centers today is to implement network security management under limited human and cost resource constraints. The method processes a large amount of alarm information, which is far more than the manual processing capability of safety operators, so that serious alarm fatigue phenomenon is caused, network safety cannot be effectively maintained, but the safety operators can not trust the alarm information, and further the network safety is reduced.

In order to reduce the occurrence of the alarm fatigue phenomenon, the existing scheme classifies and classifies alarm information through rule-driven static classification, experience-driven black-and-white list or primary data frequency statistical scheme and the like so as to distinguish high-risk alarm information (which can be understood as alarm information with higher influence on system security by corresponding attack of the high-risk alarm information) and low-risk alarm information (which can be understood as alarm information with lower influence on system security by corresponding attack) in a large amount of alarm information, so that the discovery of the high-risk alarm information is realized, and safety operators can effectively process the high-risk alarm information in a targeted manner.

However, the scheme of finding high-risk alarm information from the alarm information at present often cannot timely and accurately identify the high-risk alarm information, so that the best threat capturing opportunity is missed, and huge hidden hazards are buried for the stable operation of data assets and IT systems of enterprises and organizations.

Disclosure of Invention

The embodiment of the invention provides an alarm information marking method, an alarm information marking device, an alarm information marking medium and alarm information marking equipment, which are used for solving the problem that timeliness and accuracy of finding high-risk alarm information from alarm information are poor.

In a first aspect, the present invention provides an alarm information marking method, the method comprising:

if the type of the currently received first alarm information is determined, and the currently received first alarm information belongs to one of the types of the alarm information for training the potential dirichlet allocation LDA model trained in advance, determining the second alarm information received in a set time period before the first alarm information is received;

determining at least one context text corresponding to the first alarm information according to the first alarm information and the second alarm information, and determining each context text as a topic distribution vector corresponding to a document by using the LDA model;

Determining the type of the first alarm information as a topic distribution vector corresponding to the word aiming at the LDA model, and respectively determining the Euclidean distance value between the topic distribution vector and the topic distribution vector corresponding to each context text as a document;

and if at least one Euclidean distance value is larger than the set value, generating a context anomaly tag corresponding to each Euclidean distance value larger than the set value for the first alarm information.

Optionally, the method further comprises: aiming at each Euclidean distance value larger than a set value, according to a pre-manual labeling result of the LDA model, acquiring semantic descriptions corresponding to specified topics, wherein the specified topics are topics corresponding to context texts corresponding to the Euclidean distance values; and outputting the Euclidean distance value, a context exception label corresponding to the Euclidean distance value and a semantic description corresponding to a specified subject corresponding to the Euclidean distance value;

the topic corresponding to one context text is determined according to the topic distribution vector corresponding to the context text serving as a document.

Optionally, the at least one context text includes a source context text, a destination context text, and a source-destination context text;

The source context text is formed according to an alarm statement that the source internet protocol address is the same as the first alarm information;

the destination context text is formed according to an alarm statement that the destination Internet protocol address is the same as the first alarm information;

the source-destination context text is formed according to an alert sentence in which both a source internet protocol address and a destination internet protocol address are identical to the first alert information.

Optionally, determining, by using the LDA model, each context text as a topic distribution vector corresponding to the document includes:

determining a vector corresponding to each context text, and determining the context text as a topic distribution vector corresponding to the document by using the LDA model according to the vector;

the vector length corresponding to one context text is the number of types of the alarm information for training the LDA model, and the vector value is a weight value of each type of the alarm information for training the LDA model, which is obtained according to a word frequency inverse text frequency index TF-IDF model, in the context text.

Optionally, after determining at least one context text corresponding to the first alarm information according to the first alarm information and the second alarm information, before determining each context text as a topic distribution vector corresponding to a document by using the LDA model, the method further includes:

And if the length of at least one context text in the determined context text is smaller than the threshold value, increasing the set duration, and returning to execute the second alarm information received in the set duration before the first alarm information is determined to be received.

Optionally, the method further comprises:

and if the number of the alarm information which does not belong to one of the alarm information types for training the pre-trained LDA model reaches a threshold value, prompting that the LDA model needs to be trained again.

Optionally, the method further comprises:

determining at least one context text corresponding to the second alarm information according to the first alarm information and the second alarm information, and determining each context text as a topic distribution vector corresponding to a document by using the LDA model;

determining the type of the second alarm information as a topic distribution vector corresponding to the word aiming at the LDA model, and respectively determining the Euclidean distance value between the topic distribution vector and each context text as a topic distribution vector corresponding to the document;

if at least one Euclidean distance value is larger than a set value, generating a context anomaly tag corresponding to each Euclidean distance value larger than the set value for the second alarm information;

And forming each context text corresponding to each piece of second alarm information according to the first alarm information and the alarm statement corresponding to each piece of second alarm information.

In a second aspect, the present invention also provides an alarm information marking device, where the device includes:

the analysis module is used for determining second alarm information received in a set time period before the first alarm information is received if the type of the first alarm information received currently is determined, and the second alarm information belongs to one of the types of the alarm information for training the potential dirichlet allocation LDA model trained in advance; determining at least one context text corresponding to the first alarm information according to the first alarm information and the second alarm information, and determining each context text as a topic distribution vector corresponding to a document by using the LDA model; determining the type of the first alarm information as a topic distribution vector corresponding to the word aiming at the LDA model, and respectively determining the Euclidean distance value between the topic distribution vector and the topic distribution vector corresponding to each context text as a document;

and the marking module is used for generating context abnormal labels corresponding to the Euclidean distance values which are larger than the set value for the first alarm information if at least one Euclidean distance value is larger than the set value.

In a third aspect, the present invention also provides a non-volatile computer storage medium storing an executable program for execution by a processor to implement the method as described above.

In a fourth aspect, the present invention further provides a blockchain data processing device, including a processor, a communication interface, a memory, and a communication bus, where the processor, the communication interface, and the memory complete communication with each other through the communication bus;

the memory is used for storing a computer program;

the processor, when executing the program stored on the memory, implements the method steps described above.

According to the scheme provided by the embodiment of the invention, the pre-trained LDA model can be utilized to determine the topic distribution vector corresponding to the type of the currently received alarm information, and the topic distribution vector corresponding to each context text corresponding to the alarm information (formed according to the alarm information and the alarm statement formed by the type of the associated alarm information) can be determined, so that the semantic deviation value between the type of the currently received alarm information and each context text corresponding to the type of the currently received alarm information can be measured through the Euclidean distance value between the topic distribution vectors. Therefore, when a certain Euclidean distance value is larger, the semantic deviation value between the type of the currently received alarm information and the context text corresponding to the Euclidean distance value is considered to be larger, a corresponding context anomaly label is generated for the currently received alarm information, and the alarm information is prompted to be possibly high-risk alarm information aiming at a certain context text. The timeliness of the high-risk alarm information discovery is ensured by generating the label of the currently received alarm information in real time. And the semantic deviation value between the type of the currently received alarm information and each corresponding context text is measured through the Euclidean distance value between the topic distribution vectors, so that whether the corresponding context anomaly label is needed to be generated for the currently received alarm information is judged, and the accuracy of high-risk alarm information discovery is also effectively ensured.

Additional features and advantages of the invention will be set forth in the description which follows, and in part will be obvious from the description, or may be learned by practice of the invention. The objectives and other advantages of the invention will be realized and attained by the structure particularly pointed out in the written description and claims thereof as well as the appended drawings.

Drawings

In order to more clearly illustrate the embodiments of the invention or the technical solutions in the prior art, the drawings that are required in the embodiments or the description of the prior art will be briefly described, it being obvious that the drawings in the following description are only some embodiments of the invention, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a flowchart of an alarm information marking method according to an embodiment of the present invention;

FIG. 2 is a schematic diagram of an alert sentence according to an embodiment of the present invention;

FIG. 3 is a schematic diagram of a context provided by an embodiment of the present invention;

FIG. 4 is a schematic flow chart of obtaining a trained LDA model according to an embodiment of the present invention;

FIG. 5 is a schematic structural diagram of an alarm information marking device according to an embodiment of the present invention;

Fig. 6 is a schematic structural diagram of an alarm information marking device according to an embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention will be described in further detail below with reference to the accompanying drawings, and it is apparent that the described embodiments are only some embodiments of the present invention, not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

It should be noted that, as used herein, reference to "a plurality of" or "a plurality of" means two or more. "and/or", describes an association relationship of an association object, and indicates that there may be three relationships, for example, a and/or B, and may indicate: a exists alone, A and B exist together, and B exists alone. The character "/" generally indicates that the context-dependent object is an "or" relationship.

The terms first, second and the like in the description and in the claims and in the above-described figures, are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used may be interchanged where appropriate such that the embodiments of the invention described herein may be implemented in sequences other than those illustrated or otherwise described herein.

Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

The inventor researches and discovers that low-risk alarm information such as scanning information and information often has a stable context, namely the same type of alarm information is associated. If the alarm information is the abnormal alarm information in the associated alarm information, the alarm information is often meant to be high-risk alarm information mixed in the large-scale alarm information, and possibly the real attack action of an attacker. The present application contemplates enabling screening of high risk alert information based on this research discovery.

In order to improve the timeliness and accuracy of finding high-risk alarm information from alarm information, to rapidly and accurately distinguish high-risk alarm information from low-risk alarm information, an effective means is required to describe context information when the alarm information is generated, for example, which alarm information is generated before generation of one piece of alarm information, which alarm information is generated after generation, and a behavior environment when a certain alarm information is generated needs to be described as a whole through a modeling and quantization method.

The latent dirichlet allocation (LDA, latent Dirichlet Allocation) model is a document topic generation model, which is a three-layer bayesian probability model and comprises a word, topic and document three-layer structure. The document topic generation model is a process of considering that each word of a document is obtained by selecting a topic with a certain probability and selecting a word from the topic with a certain probability. The document-to-topic obeys a polynomial distribution and the topic-to-word obeys a polynomial distribution. The purpose of the LDA model is to identify topics, i.e. to change the document word matrix into a document topic matrix (distribution) and a topic word matrix (distribution).

The method and the device establish a semantic model of the type sequence of the alarm information by taking the type of the alarm information (such as a scanning type, an information type, an exploit type and the like) as a word, the type sequence of the alarm information as a sentence, and the type sequence set of the alarm information as a document, and learn potential context semantic relations from a corpus through an LDA model, so that the trained LDA model can be utilized to dynamically identify, classify and screen high-risk alarm information. Of course, the network security expert may further classify or rank the alarm information based on the recognition result of the high risk alarm information.

The embodiment of the invention provides an alarm information marking method, which can be shown in fig. 1 and comprises the following steps:

step 101, determining whether the type of the currently received alarm information belongs to one of the types of the alarm information for training the pre-trained LDA model.

In order to mark the alert information by using the LDA model trained in advance, it may be first determined whether the type of the alert information (first alert information) currently received belongs to one of the types of alert information for training the LDA model trained in advance (i.e., whether the type of the first alert information currently received belongs to one of words corresponding to the LDA model trained in advance is determined), if so, step 102 is continuously performed, otherwise, it may be considered that the alert information currently received cannot be marked by using the LDA model trained in advance, and the process may be ended.

The currently received alarm information can be understood as the currently generated alarm information, that is, in this embodiment, the currently generated alarm information can be marked in real time, so as to ensure the real-time performance of the discovery of the high-risk alarm data.

Step 102, determining second alarm information received in a set time period before receiving the first alarm information.

If the type of the first alarm information is determined to belong to one of the types of the alarm information for training the LDA model trained in advance, in this step, the alarm information (the second alarm information) received within a set period of time before the first alarm information is received may be determined, so that the part of alarm information is used as the associated alarm information of the currently received alarm information, and the context text corresponding to the currently received alarm information is determined.

Step 103, determining at least one context text corresponding to the first alarm information according to the first alarm information and the second alarm information.

In this embodiment, each context text corresponding to the first alarm information may be understood as being formed according to alarm sentences corresponding to the first alarm information and the second alarm information, where one alarm sentence is formed according to alarm information with the same source Internet Protocol (IP) address and the same destination IP address, and the types of the alarm information are arranged according to time sequence.

The method can be understood that an alarm statement is formed by arranging the types of alarm information according to time sequence according to the alarm information receiving time sequence from early to late and aiming at the alarm information with the same source IP address and destination IP address.

In this embodiment, a context text corresponding to the first alarm information may be obtained by aggregating alarm sentences corresponding to the first alarm information and the second alarm information with a certain granularity according to any aggregation method, so long as the context text can represent context information in a certain sense.

In one possible implementation manner, in this step, three context texts corresponding to the first alarm information may be determined, which are a source context text, a destination context text, and a source-destination context text, and further, whether the alarm information is high risk alarm information may be determined from multiple dimensions according to the obtained multi-dimensional context text.

The source context text is formed from an alert sentence having the same source internet protocol address as the first alert message. The source context text may be understood as a context environment corresponding to the alarm information triggered by an attacker initiating an attack, which may reflect the attack technique of the attacker.

The destination context text is formed according to an alert sentence having the destination internet protocol address identical to the first alert message. The destination context text may be understood as a context environment corresponding to the alert information triggered by an attack on one server, which may reflect the service characteristics and associated vulnerability of the attacked server.

The source-destination context text is formed from alert sentences having both the source internet protocol address and the destination internet protocol address identical to the first alert information. The source-destination context may be understood as a context environment corresponding to the alarm information triggered by an attacker launching an attack on a server.

Taking the source context text corresponding to the first alarm information and the second alarm information as an example, the destination context text and the source-destination context text corresponding to the first alarm information are determined, and the determining process of each context text can be as follows:

for the first alarm information and the second alarm information, an alarm statement can be formed according to the alarm information with the same source IP address and destination IP address and the types of the alarm information arranged according to time sequence, so as to form at least one alarm statement. A schematic of the at least one alert sentence formed may be as shown in fig. 2 (in fig. 2, 5 alert sentences are included).

One block in fig. 2 represents one piece of alarm information, blocks of different filled contents represent different types of alarm information (blocks outlined by dotted lines represent first alarm information), one dot represents one IP address, the arrow direction represents from a source IP address to a destination IP address, and one alarm sentence can be understood as being formed by the types of alarm information represented by filled contents in each block arranged in time series generated by each block between two dots.

Furthermore, the formed alarm sentences can be aggregated by using the source IP, the destination IP, the source IP and the destination IP as keys respectively to form context texts with different granularities. A schematic representation of each context text formed may be as shown in fig. 3.

And aggregating the alarm sentences with the same source Internet protocol address as the first alarm information to form a source context text corresponding to the first alarm information.

And aggregating the alarm sentences with the same destination Internet protocol address as the first alarm information to form a destination context text corresponding to the first alarm information.

And aggregating the alarm sentences of which the source internet protocol address and the destination internet protocol address are the same as those of the first alarm information to form a source-destination context text corresponding to the first alarm information.

And 104, determining each context text as a topic distribution vector corresponding to the document by using the LDA model.

In this step, each context text corresponding to the first alert information may be used as input by using the LDA model trained in advance, and each context text may be determined as a topic distribution vector corresponding to the document.

In this step, vectorization may be performed on each context text, and then, according to a vector corresponding to each context text, a topic distribution vector corresponding to each context text as a document may be determined by using an LDA model trained in advance.

That is, in this step, a vector corresponding to each context text may be determined for each context text, and according to the vector, the topic distribution vector corresponding to the document may be determined by using an LDA model trained in advance.

In one possible implementation, the vector length corresponding to one context text is the number of types of alert information for training the pre-trained LDA model, and the vector value is a weight value of each type of alert information for training the pre-trained LDA model obtained according to a word frequency inverse text frequency index (TF-IDF) model in the context text.

Step 105, determining topic distribution vectors corresponding to the types of the first alarm information, and determining Euclidean distance values between every two topic distribution vectors respectively.

After the LDA model is trained, a matrix distribution of each type of alarm information and each topic for training can be obtained, and in this step, the matrix distribution corresponding to the LDA model trained in advance can be queried, and the type of the first alarm information is determined as a topic distribution vector corresponding to the word for the LDA model trained in advance.

Considering that low-risk alarm information such as scanning type and information type often has a relatively stable context, if the type of one alarm information deviates relatively greatly from the semantics of one context text (which can also be understood as a subject corresponding to one context text), the alarm information is often high-risk alarm information mixed in a large amount of alarm information, and the high-risk alarm information may be a real attack action of an attacker. It may be considered to measure the degree of semantic deviation of the type of an alert message with respect to a context to determine whether the alert message is a high risk alert message.

In this step, the euclidean distance value between the topic distribution vector corresponding to the LDA model trained in advance and each context text as the document can be determined according to the determined type of the first alarm information as the topic distribution vector corresponding to the word.

Each Euclidean distance value (may be denoted as L _K ) It can be understood that the type of the first alert information is represented, and the semantic deviation degree of one context text corresponding to the euclidean distance value is defined.

And 106, if at least one Euclidean distance value is larger than the set value, generating a context anomaly tag corresponding to each Euclidean distance value larger than the set value for the first alarm information.

In this step, for each euclidean distance value greater than the set value, a corresponding context anomaly tag may be generated for the first alert information, to indicate that the type of alert information currently received has a higher degree of semantic deviation from the context text corresponding to the euclidean distance value, where the alert information may be a high risk alert information.

Further, the present embodiment may further include the following steps:

step 107, outputting relevant prompt information for each Euclidean distance value larger than the set value.

In this step, for each euclidean distance value greater than the set value, according to the pre-artificial labeling result of the pre-trained LDA model, a semantic description corresponding to a specified topic is obtained, where the specified topic is a topic corresponding to a context text corresponding to the euclidean distance value. And further outputting the Euclidean distance value, a context anomaly tag corresponding to the Euclidean distance value and a semantic description corresponding to a specified subject corresponding to the Euclidean distance value, prompting a context text with higher semantic deviation degree of the type of the currently received alarm information, the actual alarm meaning of the subject of the context text and the semantic deviation degree of the type of the currently received alarm information and the context text.

The topic corresponding to one context text is determined according to the topic distribution vector corresponding to the context text serving as a document. It can be understood that, in the topic distribution vector corresponding to one context text, the topic with the largest corresponding generation probability is the topic corresponding to the context text.

The output related prompt information can realize or assist safety operators to realize classification or classified filtration of the alarm information, and the alarm information with the highest risk and most attention value can be quickly found.

It should be noted that, in one possible implementation manner, in order to ensure accuracy of determining whether the currently received alarm information is high risk alarm information, after step 103, before step 104, step 103' may be performed:

step 103', determining whether at least one of the determined context texts has a length smaller than a threshold value.

If at least one of the determined context texts has a length smaller than the threshold value, the set duration may be increased, and the execution returns to step 102, otherwise, step 104 may be continued. Therefore, the situation that the context text is too short can be avoided, and the accuracy of judging whether the currently received alarm information is high risk alarm information is ensured through the sufficiently long context text.

In addition, in this embodiment, if the number of alarm information that does not belong to one of the types of alarm information for training the LDA model trained in advance reaches the threshold value, the LDA model may be prompted to need to be retrained.

That is, if a large number of alarm information of types not belonging to the alarm information types for training the LDA model are found, a round of training can be performed on the LDA model again, so that the trained LDA model can better adapt to the alarm information, and has wider applicability.

In addition, it should be noted that, in this embodiment, whenever new alarm information arrives, the alarm information will change the context semantics of the alarm information associated with the new alarm information, so that the context abnormality of the alarm information may be a continuous evaluation process, and in addition to implementing real-time evaluation on whether the alarm information is high risk alarm information, dynamic evaluation on whether the alarm information is high risk alarm information may also be implemented.

In this embodiment, after the first alert information is currently received, it may also be determined, for each piece of second alert information, again whether a context anomaly tag needs to be generated for the alert information.

It can be understood that, according to the first alarm information and the second alarm information, for each piece of second alarm information, at least one context text corresponding to the second alarm information is determined, and each context text is determined as a topic distribution vector corresponding to the document by using an LDA model trained in advance.

And determining the type of the second alarm information as a topic distribution vector corresponding to the word aiming at the LDA model trained in advance, and respectively determining the Euclidean distance value between the topic distribution vector and the topic distribution vector corresponding to each context text as a document.

And if at least one Euclidean distance value is larger than the set value, generating a context anomaly tag corresponding to each Euclidean distance value larger than the set value for the second alarm information.

The context text corresponding to each piece of second alarm information can be understood as being formed according to the first alarm information and the alarm statement corresponding to each piece of second alarm information.

Next, a training process of the LDA model will be described.

And step one, obtaining a training sample.

The training process for the LDA model first requires obtaining training samples.

In the process of obtaining the training sample, the batch of alarm information for training (namely, the alarm information for training) can be grouped according to a set period T corresponding to the time of receiving each piece of alarm information.

For each group of alarm information, an alarm statement can be formed according to the same alarm information of the source IP address and the destination IP address and the types of the alarm information are arranged according to time sequence to form at least one alarm statement, wherein the types of the alarm information corresponding to the alarm information for training can be understood as words corresponding to the LDA model, and the alarm statement can be understood as being formed according to the words.

After forming at least one alert sentence for each set of alert information, at least one context text corresponding to each alert information in the set of alert information may be further determined.

Each context text corresponding to one piece of alarm information can be understood as being formed according to the alarm statement corresponding to the alarm information of the group in which the alarm information is located. Each context text may be understood as a document to which the LDA model corresponds.

In one possible implementation manner, three context texts corresponding to each piece of alarm information can be determined, and the three context texts are respectively a source context text, a destination context text and a source-destination context text, so that a corpus formed by multi-dimensional context texts formed by integrating alarm sentences is obtained.

After obtaining the corpus of context texts, vectorization can be performed on each context text in the corpus to obtain a corpus of vector representations.

The vector length of each context text is the number of types of the warning information for training, and the vector value is the weight value of each type of warning information for training, which is obtained according to the TF-IDF model, in the context text.

In one possible implementation, after obtaining the context text, the obtained context text may also be preprocessed to ensure the accuracy of the trained LDA model.

For example, for the types of alarm information with occurrence frequency lower than the set frequency in the batch of alarm information for training, each context text corresponding to the type of alarm information in the corpus can be copied to obtain a set number of context texts. Of course, in another possible implementation manner, the type of alert information with the occurrence frequency lower than the set frequency may be directly discarded (in this case, it may be understood that the context text corresponding to the type of alert information is not obtained in the corpus).

For another example, for the continuously repeated alarm information types in an alarm sentence corresponding to a context text in the corpus, only the first alarm information type in the continuously repeated alarm information types can be reserved, and the subsequently occurring alarm information types can be deleted.

For another example, for each context text with a length smaller than the set length in the corpus, before the context text corresponding to the context text is spliced to the alert information corresponding to the context text, for example, after the source context text corresponding to the alert information 1 is received adjacently, if the length of the source context text is smaller than the set length, the source context text may be spliced to the source context text corresponding to the alert information 2 is received adjacently before the alert information 1 is received.

And secondly, training the LDA model.

After obtaining the corpus represented by the vectors, the number of topics (for example, set to K) of the LDA model can be set, the vector corresponding to each context text is used as one training sample in the training sample set, and unsupervised training is performed on the pre-established LDA model to obtain the trained LDA model.

The trained LDA model can learn to obtain the generation probability between the theme and the word (the type of the alarm information) and the generation probability between the theme and the document (the context text) according to the context text.

The trained LDA model learns the resulting topic (which may be denoted as T ₁ 、T ₂ ……T _K ) And word (may be denoted as W) ₁ 、W ₂ ……W _M ) The generation probabilities therebetween may be as shown in table 1, it is assumed that the number of types of alarm information in the batch of alarm information for training is M.

TABLE 1

	T ₁	T ₂	T ₃	……	T _K
						W ₁	0.1	0.1	0.6	……	0.1
W ₂	0.4	0.1	0.2	……	0.13
						……	……	……	……	……	……
W _M	0.23	0.45	0.01	……	0.1

The trained LDA model learns the resulting topics and documents (which may be denoted as D ₁ 、D ₂ ……D _N ) The probability of generation between these can be as shown in table 2, in which table 2 the number of context texts in the corpus is assumed to be N.

TABLE 2

	T ₁	T ₂	T ₃	……	T _K
						D ₁	0.76	0.1	0.02	……	0.2
D ₂	0.3	0.67	0.01	……	0.03
						……	……	……	……	……	……
D _N	0.01	0.1	0.2	……	0.34

Thirdly, manually marking the trained LDA model.

In this embodiment, the number of topics of the LDA model is set to K, which can be understood that each document in the LDA model hypothesis corpus is generated by K topics (topic) according to a certain probability. For example, a certain context consists of potentially four topics, "host scan", "exploit", "DDoS attack", "information steal". Each topic, in turn, corresponds to a certain word, i.e. a certain alert information type, with a certain probability.

The trained LDA model is only responsible for generating probability distribution among corresponding topics, documents and words, and semantic descriptions such as connotation and names corresponding to the topics need to be marked manually. That is, after training the LDA model, T may be determined by means of manual labeling ₁ 、T ₂ ……T _K The corresponding semantic descriptions of the K topics are obtained, so that a trained LDA model after manual marking is obtained, and the topics of the trained LDA model have actual alarm meanings.

A schematic flow diagram of the trained LDA model obtained comprising the manual labeling process may be as shown in fig. 4.

The probability of generation between the topic shown in table 1 and the word corresponding to the LDA model obtained by training can be used for inquiring the type of the alarm information received at present as the topic distribution vector corresponding to the word for the LDA model aiming at the alarm information received at present.

In addition, each context text (corresponding vector) corresponding to the currently received alarm information is used as input, and the LDA model can be used for determining that each context text is used as the topic distribution vector corresponding to the document.

In addition, for each Euclidean distance value greater than the set value, the topic corresponding to the context text corresponding to the Euclidean distance value can be determined by querying the pre-manual labeling result of the trained LDA model after being determined according to the context text as the topic distribution vector corresponding to the document, so that the user can obtain the actual alarming meaning of the topic corresponding to the context text of the high-risk alarming information.

According to the scheme provided by the embodiment of the invention, the alarm information context can be modeled and analyzed based on the statistical language model, and the context semantic information when the alarm information happens can be accurately described, so that the internal rule of the alarm information context can be automatically mined in a data-driven mode. And further, expert marks and an abnormality processing mechanism can be integrated, the alarm information deviating from the context semantics can be automatically identified, the abnormality degree of the alarm information can be evaluated, and the context abnormality label of the alarm information can be provided for realizing or assisting in realizing the classification and grading of the alarm information.

The method can further identify the high-risk or attention-worthy alarm information manually, effectively solve the inefficiency of a static alarm information grading scheme, improve the alarm information processing efficiency in safe operation, reduce the period of threat event analysis response and improve the protection capability.

According to the scheme provided by the embodiment of the invention, the high-risk alarm data can be effectively screened out from a large amount of alarm data, the misleading of the misinformation high-risk alarm information to the safe operation is reduced, and the signal to noise ratio of the high-risk alarm data of the safe operation center is improved.

Corresponding to the provided method, the following apparatus is further provided.

An embodiment of the present invention provides an alarm information marking apparatus, where the structure of the apparatus may be as shown in fig. 5, including:

the analysis module 12 is configured to determine, if it is determined that the type of the first alarm information currently received belongs to one of the types of alarm information for training the latent dirichlet allocation LDA model trained in advance, second alarm information received within a set period of time before the first alarm information is received; determining at least one context text corresponding to the first alarm information according to the first alarm information and the second alarm information, and determining each context text as a topic distribution vector corresponding to a document by using the LDA model; determining the type of the first alarm information as a topic distribution vector corresponding to the word aiming at the LDA model, and respectively determining the Euclidean distance value between the topic distribution vector and the topic distribution vector corresponding to each context text as a document;

the marking module 13 is configured to generate, for the first alert information, context exception tags corresponding to euclidean distance values each greater than a set value if at least one euclidean distance value is greater than the set value.

Wherein, the device may further include a judging module 11:

The judging module 11 is configured to determine whether the type of the currently received first alarm information belongs to one of the types of alarm information for training the pre-trained latent dirichlet allocation LDA model;

at this time, if it is determined that the type of the currently received first alarm information belongs to one of the types of the alarm information for training the pre-trained latent dirichlet allocation LDA model, it may be understood that if the determination module 11 determines that the type of the currently received first alarm information belongs to one of the types of the alarm information for training the pre-trained latent dirichlet allocation LDA model.

And each context text corresponding to the first alarm information is formed according to alarm sentences corresponding to the first alarm information and the second alarm information, and one alarm sentence is formed according to the alarm information with the same source internet protocol address and the same destination internet protocol address and the types of the alarm information are arranged according to time sequence.

Optionally, the apparatus further comprises an output module 14:

the output module 14 is configured to obtain, for each euclidean distance value greater than a set value, a semantic description corresponding to a specified topic according to a pre-manual labeling result of the LDA model, where the specified topic is a topic corresponding to a context text corresponding to the euclidean distance value; and outputting the Euclidean distance value, a context exception label corresponding to the Euclidean distance value and a semantic description corresponding to a specified subject corresponding to the Euclidean distance value;

Optionally, the analysis module 12 determines, using the LDA model, each context text as a topic distribution vector corresponding to a document, including:

Optionally, the analysis module 12 is further configured to, after determining at least one context text corresponding to the first alert information according to the first alert information and the second alert information, increase the set duration if the length of at least one context text in the determined context texts is less than a threshold before determining, by using the LDA model, each context text as a topic distribution vector corresponding to the document, and return to execute the second alert information received in the set duration before determining that the first alert information is received.

Optionally, the analysis module 12 is further configured to prompt that the LDA model needs to be retrained if the number of alarm information that does not belong to one of the types of alarm information for training the LDA model trained in advance reaches a threshold value.

Optionally, the analysis module 12 is further configured to determine, for each piece of second alert information, at least one context text corresponding to the second alert information according to the first alert information and the second alert information, and determine, using the LDA model, each context text as a topic distribution vector corresponding to a document;

The marking module 13 is further configured to generate, for the second alarm information, a context anomaly tag corresponding to each euclidean distance value greater than the set value if at least one euclidean distance value is greater than the set value;

The functions of the functional units of each device provided in the foregoing embodiments of the present invention may be implemented by the steps of the corresponding methods, so that the specific working process and the beneficial effects of each functional unit in each device provided in the embodiments of the present invention are not repeated herein.

Based on the same inventive concept, embodiments of the present invention provide the following apparatuses and media.

The embodiment of the invention provides an alarm information marking device, which can be structured as shown in fig. 6, and comprises a processor 21, a communication interface 22, a memory 23 and a communication bus 24, wherein the processor 21, the communication interface 22 and the memory 23 complete communication with each other through the communication bus 24;

the memory 23 is used for storing a computer program;

the processor 21 is configured to implement the steps described in the above method embodiments of the present invention when executing the program stored in the memory.

Alternatively, the processor 21 may specifically include a Central Processing Unit (CPU), an application specific integrated circuit (ASIC, application Specific Integrated Circuit), one or more integrated circuits for controlling program execution, a hardware circuit developed using a field programmable gate array (FPGA, field Programmable Gate Array), and a baseband processor.

Alternatively, the processor 21 may comprise at least one processing core.

Alternatively, the Memory 23 may include a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), and a disk Memory. The memory 23 is used for storing data required by the operation of the at least one processor 21. The number of memories 23 may be one or more.

The embodiment of the invention also provides a non-volatile computer storage medium, which stores an executable program, and when the executable program is executed by a processor, the method provided by the embodiment of the method of the invention is realized.

In a specific implementation, the computer storage medium may include: a universal serial bus flash disk (USB, universal Serial Bus Flash Drive), a removable hard disk, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), a magnetic disk or an optical disk, or the like, which can store program codes.

In the embodiments of the present invention, it should be understood that the disclosed apparatus and method may be implemented in other manners. For example, the above-described embodiments of the apparatus are merely illustrative, e.g., the division of the units or units is merely a logical functional division, and there may be additional divisions when actually implemented, e.g., multiple units or components may be combined or integrated into another system, or some features may be omitted or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be through some interface, indirect coupling or communication connection of devices or units, electrical or otherwise.

The functional units in the embodiments of the present invention may be integrated into one processing unit, or each unit may be an independent physical module.

The integrated units, if implemented in the form of software functional units and sold or used as stand-alone products, may be stored in a computer readable storage medium. With such understanding, all or part of the technical solution of the embodiments of the present invention may be embodied in the form of a software product stored in a storage medium, including instructions for causing a computer device, which may be, for example, a personal computer, a server, or a network device, or a processor (processor), to perform all or part of the steps of the method described in the embodiments of the present invention. And the aforementioned storage medium includes: universal serial bus flash disk (Universal Serial Bus Flash Drive), removable hard disk, ROM, RAM, magnetic or optical disk, or other various media capable of storing program code.

It will be appreciated by those skilled in the art that embodiments of the present invention may be provided as a method, system, or computer program product. Accordingly, the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The present invention is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

While preferred embodiments of the present invention have been described, additional variations and modifications in those embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. It is therefore intended that the following claims be interpreted as including the preferred embodiments and all such alterations and modifications as fall within the scope of the invention.

It will be apparent to those skilled in the art that various modifications and variations can be made to the present invention without departing from the spirit or scope of the invention. Thus, it is intended that the present invention also include such modifications and alterations insofar as they come within the scope of the appended claims or the equivalents thereof.

Claims

1. A method for marking alert information, the method comprising:

2. The method of claim 1, wherein the method further comprises: aiming at each Euclidean distance value larger than a set value, according to a pre-manual labeling result of the LDA model, acquiring semantic descriptions corresponding to specified topics, wherein the specified topics are topics corresponding to context texts corresponding to the Euclidean distance values; and is combined with the other components of the water treatment device,

outputting the Euclidean distance value, a context exception label corresponding to the Euclidean distance value and a semantic description corresponding to a specified subject corresponding to the Euclidean distance value;

3. The method of claim 1, wherein the at least one context text comprises a source context text, a destination context text, and a source-destination context text;

4. The method of claim 1, wherein determining each context text as a topic distribution vector for a document using the LDA model comprises:

5. The method of claim 1, wherein after determining at least one context text corresponding to the first alert information based on the first alert information and the second alert information, before determining each context text as a topic distribution vector corresponding to a document using the LDA model, the method further comprises:

6. The method of claim 1, wherein the method further comprises:

7. The method of any one of claims 1-6, further comprising:

8. An alert information marking apparatus, the apparatus comprising:

9. A non-transitory computer storage medium storing an executable program that is executed by a processor to implement the method of any one of claims 1 to 7.

10. An alarm information marking device, characterized in that the device comprises a processor, a communication interface, a memory and a communication bus, wherein the processor, the communication interface and the memory complete the communication with each other through the communication bus;

the memory is used for storing a computer program;

the processor is configured to implement the method steps of any one of claims 1 to 7 when executing the program stored on the memory.