[go: up one dir, main page]

CN117978499B - System and method for identifying converged communication malicious data based on AI intelligence - Google Patents

System and method for identifying converged communication malicious data based on AI intelligence Download PDF

Info

Publication number
CN117978499B
CN117978499B CN202410143834.1A CN202410143834A CN117978499B CN 117978499 B CN117978499 B CN 117978499B CN 202410143834 A CN202410143834 A CN 202410143834A CN 117978499 B CN117978499 B CN 117978499B
Authority
CN
China
Prior art keywords
mail
text content
training
semantic
text
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202410143834.1A
Other languages
Chinese (zh)
Other versions
CN117978499A (en
Inventor
王友峰
刘晓周
李晓凡
王崇斌
王红
潘星
王劲航
王一杰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shaanxi Ruixin Technology Development Co ltd
Original Assignee
Shaanxi Ruixin Technology Development Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shaanxi Ruixin Technology Development Co ltd filed Critical Shaanxi Ruixin Technology Development Co ltd
Priority to CN202410143834.1A priority Critical patent/CN117978499B/en
Publication of CN117978499A publication Critical patent/CN117978499A/en
Application granted granted Critical
Publication of CN117978499B publication Critical patent/CN117978499B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L63/00Network architectures or network communication protocols for network security
    • H04L63/02Network architectures or network communication protocols for network security for separating internal from external traffic, e.g. firewalls
    • H04L63/0227Filtering policies
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L51/00User-to-user messaging in packet-switching networks, transmitted according to store-and-forward or real-time protocols, e.g. e-mail
    • H04L51/21Monitoring or handling of messages
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L51/00User-to-user messaging in packet-switching networks, transmitted according to store-and-forward or real-time protocols, e.g. e-mail
    • H04L51/42Mailbox-related aspects, e.g. synchronisation of mailboxes
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L63/00Network architectures or network communication protocols for network security
    • H04L63/14Network architectures or network communication protocols for network security for detecting or protecting against malicious traffic
    • H04L63/1441Countermeasures against malicious traffic
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L9/00Cryptographic mechanisms or cryptographic arrangements for secret or secure communications; Network security protocols
    • H04L9/40Network security protocols

Landscapes

  • Engineering & Computer Science (AREA)
  • Computer Security & Cryptography (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Signal Processing (AREA)
  • Computer Hardware Design (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Information Transfer Between Computers (AREA)
  • Data Exchanges In Wide-Area Networks (AREA)

Abstract

Disclosed are a converged communication malicious data identification system and method based on AI intelligence. Firstly, acquiring a detected mail, extracting mail text content from the detected mail, then carrying out segmentation processing on the mail text content to obtain a sequence of mail text content segments, then carrying out semantic analysis based on segment granularity on each mail text content segment in the sequence of mail text content segments to obtain a sequence of mail text content segment granularity semantic understanding feature vectors, then carrying out full text semantic association coding on the sequence of mail text content segment granularity semantic understanding feature vectors to obtain full text semantic understanding features of the mail text content, and finally, determining whether the detected mail is junk mail or not based on the full text semantic understanding features of the mail text content. In this way, communication security and efficiency can be improved.

Description

System and method for identifying converged communication malicious data based on AI intelligence
Technical Field
The application relates to the field of malicious data identification, in particular to an AI intelligent-based fusion communication malicious data identification system and method.
Background
With the development of the internet, email has become an important tool for people to communicate and work daily. However, the e-mail also faces the trouble of the spam, and the spam not only occupies network resources and affects user experience, but also can carry malicious information, thereby causing security threat to users. Therefore, how to effectively identify and filter spam is of great importance.
At present, common junk mail identification is mainly based on a rule or a statistical mode, so that the method often needs to manually define characteristics or rules and carry out a large amount of data labeling, has low efficiency, is difficult to capture semantic information and subject information of mail content, and cannot adapt to rapid change and diversity of malicious data.
Therefore, an AI-intelligent-based converged communication malicious data identification system is desired.
Disclosure of Invention
In view of this, the present application proposes a system and a method for identifying converged communication malicious data based on AI intelligence, which can identify the subject of a detected mail by acquiring the detected mail and extracting the text content part of the mail from the detected mail, and then, performing semantic analysis and understanding of the text content of the mail by referring to an artificial intelligence based data processing and semantic understanding technology at the back end, so as to determine whether the mail is a spam mail.
According to an aspect of the present application, there is provided an AI-intelligent based converged communication malicious data identification system, including:
the mail acquisition module is used for acquiring detected mails;
The mail text content extraction module is used for extracting mail text content from the detected mail;
the content segmentation module is used for carrying out segmentation processing on the mail text content to obtain a sequence of mail text content segments;
The mail text content segment granularity semantic analysis module is used for carrying out semantic analysis based on segment granularity on each mail text content segment in the sequence of mail text content segments to obtain a sequence of mail text content segment granularity semantic understanding feature vectors;
The mail text content full text semantic understanding module is used for carrying out full text semantic association coding on the sequence of the mail text content segment granularity semantic understanding feature vectors so as to obtain mail text content full text semantic understanding features; and
And the mail subject classification module is used for determining whether the detected mail is junk mail or not based on the full text semantic understanding characteristics of the mail text content.
In the above system for identifying converged communication malicious data based on AI intelligence, the granularity semantic analysis module is configured to:
And passing each mail text content segment in the sequence of mail text content segments through a segment semantic encoder comprising a word embedding layer and a text convolutional neural network model to obtain the sequence of mail text content segment granularity semantic understanding feature vectors.
In the above system for identifying converged communication malicious data based on AI intelligence, the text content full text semantic understanding module is configured to:
and the sequence of the mail text content segment granularity semantic understanding feature vector is used for obtaining the mail text content full text semantic understanding feature vector as the mail text content full text semantic understanding feature through a context encoder based on a converter module.
In the above system for identifying converged communication malicious data based on AI intelligence, the mail subject classification module is configured to:
And the text content full text semantic understanding feature vector of the mail is passed through a classifier to obtain a classification result, wherein the classification result is used for indicating whether the detected mail is junk mail or not.
The AI intelligent based fusion communication malicious data identification system further comprises a training module for training the segment semantic encoder comprising the word embedding layer and the text convolutional neural network model, the context encoder based on the converter module and the classifier.
In the above fused communication malicious data identification system based on AI intelligence, the training module includes:
The training data acquisition unit is used for acquiring training data, wherein the training data comprises a true value for training the detected mail and whether the detected mail is junk mail or not;
the training mail text content extraction unit is used for extracting training mail text content from the training detected mail;
the training content segmentation unit is used for carrying out segmentation processing on the training mail text content to obtain a sequence of training mail text content segments;
The training mail text content segment granularity semantic analysis unit is used for enabling each training mail text content segment in the training mail text content segment sequence to pass through the segment semantic encoder comprising the word embedding layer and the text convolutional neural network model to obtain a training mail text content segment granularity semantic understanding feature vector sequence;
The text content full text semantic understanding unit of the training mail is used for enabling the sequence of the text content segment granularity semantic understanding feature vectors of the training mail to pass through the context encoder based on the converter module to obtain text content full text semantic understanding feature vectors of the training mail;
The training classification loss unit is used for enabling the text content full-text semantic understanding feature vector of the training mail to pass through the classifier to obtain a classification loss function value; and
And the loss training unit is used for training the segment semantic encoder comprising the word embedding layer and the text convolutional neural network model, the context encoder based on the converter module and the classifier by using the classification loss function value, wherein in each iteration of training, the full-text semantic understanding feature vector of the text content of the training mail is corrected.
In the above fused communication malicious data identification system based on AI intelligence, the training classification loss unit is configured to:
processing the text content full text semantic understanding feature vector of the training mail by the classifier according to the following classification training formula to obtain a training classification result;
wherein, the classification training formula is: ; wherein, To the point ofAs a matrix of weights, the weight matrix,To the point ofAs a result of the offset vector,The feature vector is understood for the text content full text semantic meaning of the training mail; and
And calculating a cross entropy value between the training classification result and the true value as the classification loss function value.
According to another aspect of the present application, there is provided an AI-intelligent-based converged communication malicious data identification method, including:
Acquiring detected mails;
extracting mail text content from the detected mail;
segmenting the mail text content to obtain a sequence of mail text content segments;
Performing semantic analysis based on segment granularity on each mail text content segment in the sequence of mail text content segments to obtain a sequence of mail text content segment granularity semantic understanding feature vectors;
Performing full text semantic association coding on the sequence of the mail text content segment granularity semantic understanding feature vector to obtain full text semantic understanding features of mail text content; and
And determining whether the detected mail is junk mail or not based on the text content full text semantic understanding characteristics of the mail.
In the above-mentioned method for identifying converged communication malicious data based on AI intelligence, performing semantic analysis based on segment granularity on each mail text content segment in the sequence of mail text content segments to obtain a sequence of mail text content segment granularity semantic understanding feature vectors, including:
And passing each mail text content segment in the sequence of mail text content segments through a segment semantic encoder comprising a word embedding layer and a text convolutional neural network model to obtain the sequence of mail text content segment granularity semantic understanding feature vectors.
In the above-mentioned method for identifying converged communication malicious data based on AI intelligence, performing full text semantic association coding on the sequence of the granularity semantic understanding feature vectors of the mail text content to obtain full text semantic understanding features of the mail text content, including:
and the sequence of the mail text content segment granularity semantic understanding feature vector is used for obtaining the mail text content full text semantic understanding feature vector as the mail text content full text semantic understanding feature through a context encoder based on a converter module.
In the method, firstly, a detected mail is obtained, then mail text content is extracted from the detected mail, then segmentation processing is carried out on the mail text content to obtain a sequence of mail text content segments, then semantic analysis based on segment granularity is carried out on each mail text content segment in the sequence of mail text content segments to obtain a sequence of mail text content segment granularity semantic understanding feature vectors, then full text semantic association coding is carried out on the sequence of mail text content segment granularity semantic understanding feature vectors to obtain full text semantic understanding features of mail text content, and finally whether the detected mail is junk mail is determined based on the full text semantic understanding features of mail text content. In this way, communication security and efficiency can be improved.
Other features and aspects of the present application will become apparent from the following detailed description of the application with reference to the accompanying drawings.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate exemplary embodiments, features and aspects of the application and together with the description, serve to explain the principles of the application.
FIG. 1 illustrates a block diagram of an AI-intelligence based converged communication malicious data identification system in accordance with an embodiment of the application.
Fig. 2 shows a flowchart of an AI-intelligent-based converged communication malicious data identification method in accordance with an embodiment of the present application.
Fig. 3 shows an architecture diagram of an AI-intelligent-based converged communication malicious data identification method in accordance with an embodiment of the present application.
Fig. 4 shows an application scenario diagram of an AI-intelligent-based converged communication malicious data identification system in accordance with an embodiment of the present application.
Detailed Description
The following description of the embodiments of the present application will be made clearly and fully with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some, but not all embodiments of the application. All other embodiments, which can be made by those skilled in the art based on the embodiments of the application without making any inventive effort, are also within the scope of the application.
As used in the specification and in the claims, the terms "a," "an," "the," and/or "the" are not specific to a singular, but may include a plurality, unless the context clearly dictates otherwise. In general, the terms "comprises" and "comprising" merely indicate that the steps and elements are explicitly identified, and they do not constitute an exclusive list, as other steps or elements may be included in a method or apparatus.
Various exemplary embodiments, features and aspects of the application will be described in detail below with reference to the drawings. In the drawings, like reference numbers indicate identical or functionally similar elements. Although various aspects of the embodiments are illustrated in the accompanying drawings, the drawings are not necessarily drawn to scale unless specifically indicated.
In addition, numerous specific details are set forth in the following description in order to provide a better illustration of the application. It will be understood by those skilled in the art that the present application may be practiced without some of these specific details. In some instances, well known methods, procedures, components, and circuits have not been described in detail so as not to obscure the present application.
In view of the above technical problems, the technical concept of the present application is to perform topic identification of a detected mail by acquiring the detected mail, extracting a text content part of the mail from the detected mail, and then performing semantic analysis and understanding of the text content of the mail by referring to an artificial intelligence-based data processing and semantic understanding technology at a back end so as to determine whether the mail is junk mail. Therefore, the automatic identification and filtration of malicious mails and junk mails can be realized through semantic coding and topic classification of mail contents, and further the communication safety and efficiency are improved.
FIG. 1 shows a block diagram schematic of an AI-intelligence based converged communication malicious data identification system in accordance with an embodiment of the application. As shown in fig. 1, an AI-intelligent-based converged communication malicious data identification system 100 according to an embodiment of the present application includes: a mail acquisition module 110 for acquiring a detected mail; a mail text content extraction module 120, configured to extract mail text content from the detected mail; a content segmentation module 130, configured to segment the mail text content to obtain a sequence of mail text content segments; the mail text content segment granularity semantic analysis module 140 is configured to perform semantic analysis based on segment granularity on each mail text content segment in the sequence of mail text content segments to obtain a sequence of mail text content segment granularity semantic understanding feature vectors; the text content full text semantic understanding module 150 is configured to perform full text semantic association encoding on the sequence of the text content segment granularity semantic understanding feature vectors to obtain text content full text semantic understanding features; and a mail subject classification module 160 for determining whether the detected mail is junk mail based on the text content full text semantic understanding feature of the mail.
It should be appreciated that the mail retrieval module 110 may retrieve mail from a mail server via a network protocol (e.g., POP3, IMAP) or other means. The mail text content extraction module 120 may extract text by parsing the format of the mail (e.g., MIME). The purpose of the content segmentation module 130 to divide the text content into paragraphs or sequences of paragraphs is to better process and analyze the semantic information of the mail. The process of segment-level semantic analysis of each text content segment of the mail by mail text content segment-level semantic analysis module 140 may include lexical analysis, syntactic analysis, semantic role labeling, and other natural language processing techniques to better understand the meaning and semantics of each segment. The process of text content full text semantic understanding module 150 encoding a sequence of text content segment granularity semantic understanding feature vectors of a mail full text semantic association may involve integrating, associating, and encoding semantic features of individual segments to better understand the semantics of the entire mail. The mail topic classification module 160 may use a machine learning algorithm (e.g., a classifier) to train and predict mail categories, and make classification decisions based on predefined spam and non-spam characteristics. The modules are combined together to form an AI intelligent-based fusion communication malicious data identification system, and the malicious data (such as junk mail) of the mail is identified and classified through a series of processing procedures of mail acquisition, text content extraction, segmentation processing, semantic analysis, classification judgment and the like.
Specifically, in the technical scheme of the application, firstly, a detected mail is obtained, and mail text content is extracted from the detected mail. It should be appreciated that since the mail text content portion is typically made up of a plurality of paragraphs, each paragraph may contain different topics or content semantics, that is, paragraphs of different mail text content may contain different keywords, topics or semantic information. Therefore, in order to better understand and represent the semantic information of the mail text content, in the technical solution of the present application, the mail text content needs to be segmented to obtain a sequence of mail text content segments. It should be appreciated that by separating the mail text content into different paragraphs, subsequent independent semantic encoding of each paragraph may be facilitated, thereby better capturing the semantic features of each paragraph. Therefore, the system can better understand the structure and the content of the mail, so that the intention and the theme of the mail are more comprehensively understood, and the identification accuracy of the malicious mail is improved.
Then, in order to better capture the semantic information of each mail text content segment so that the full text content and the theme of the mail text can be understood and identified more accurately later, in the technical scheme of the application, each mail text content segment in the sequence of the mail text content segment is further processed through a segment semantic encoder comprising a word embedding layer and a text convolutional neural network model to obtain the sequence of the granularity semantic understanding feature vector of the mail text content segment. By inputting the text content segments of each mail into the segment semantic encoder comprising the word embedding layer and the text convolutional neural network model, semantic encoding can be carried out on each text content segment of each mail so as to capture semantic characteristic information of each segment in the text content of the mail, which is helpful for understanding the semantic information of the whole mail more comprehensively and carefully, thereby judging intention and subject of the mail content more accurately and carrying out recognition detection on junk mail and malicious mail.
Accordingly, the mail text content segment granularity semantic analysis module 140 is configured to: and passing each mail text content segment in the sequence of mail text content segments through a segment semantic encoder comprising a word embedding layer and a text convolutional neural network model to obtain the sequence of mail text content segment granularity semantic understanding feature vectors.
It is worth mentioning that the word embedding layer (Word Embedding Layer) is one of the techniques commonly used in natural language processing for mapping words into a continuous vector space. It can represent discrete words as dense real vectors, so that the semantic information of the words can be represented and processed in the form of vectors. The word embedding layer aims at enabling words with similar semantics to be closer in vector space through learning semantic relations among words. The Text convolutional neural network model (Text Convolutional Neural Network, text CNN) is a convolutional neural network-based Text classification model. The method comprises the steps of extracting features of input text by using convolution operation, capturing local features of different positions, and performing dimension reduction and summarization on the extracted features by pooling operation. The text convolutional neural network model has better performance in natural language processing tasks, and is particularly suitable for processing text data with local characteristics having larger influence on overall semantics. In the mail text content segment granularity semantic analysis module 140, a word embedding layer and a text convolutional neural network model are used as segment semantic encoders for semantically analyzing and encoding mail text content segments to obtain a sequence of semantically understood feature vectors of mail text content segment granularity. Specifically, the word embedding layer maps each word into a word vector, and the text convolutional neural network model extracts semantic features of each text segment through a convolution and pooling operation and encodes the features as feature vectors. In this way, each text segment can represent its semantic information with a vector, thereby facilitating subsequent semantic analysis and processing. By using a word embedding layer and a text convolutional neural network model, the mail text content segment granularity semantic analysis module can convert the mail text content segment into a semantic understanding feature vector sequence, and provides a basis for subsequent full text semantic understanding and topic classification.
In processing mail text, the semantic understanding feature vector of each paragraph contains the semantic feature information of the paragraph due to the semantic understanding feature vector of each paragraph of the piece of text content. However, the semantic relationship and the context information of the whole mail cannot be completely captured by only relying on the feature vector of each individual paragraph, and it is difficult to accurately identify the subject matter of the mail and whether the mail is junk mail. Therefore, in order to integrate the semantic feature information of each paragraph of the text content of the mail to more fully represent the semantic information of the whole mail, in the technical scheme of the application, the sequence of the granularity semantic understanding feature vector of the text content of the mail is further encoded in a context encoder based on a converter module so as to extract the context semantic association feature information based on the full text between the semantic understanding feature information of each paragraph in the text content of the mail, thereby obtaining the full text semantic understanding feature vector of the text content of the mail.
Accordingly, the text content full text semantic understanding module 150 is configured to: and the sequence of the mail text content segment granularity semantic understanding feature vector is used for obtaining the mail text content full text semantic understanding feature vector as the mail text content full text semantic understanding feature through a context encoder based on a converter module.
It is worth mentioning that the converter module (Transformer Module) is a neural network model for natural language processing tasks, originally proposed for machine translation tasks, which uses a Self-attention mechanism (Self-Attention Mechanism) to model the context of the input sequence, enabling efficient capturing of dependencies between different positions in the sequence. In the mail text content full text semantic understanding module 150, a converter module is used as a context encoder for converting a sequence of mail text content segment granularity semantic understanding feature vectors into mail text content full text semantic understanding feature vectors. The converter module can perform global context coding on each element in the sequence by performing the combined operation of multi-head self-attention and feedforward neural network on the input sequence, so that semantic information of the sequence can be better understood. In particular, the converter module encodes each element in the input sequence by a self-attention mechanism such that each element can take into account the context information of the entire sequence. Such global context coding can help capture relationships and dependencies between different elements in a sequence, thereby better understanding the semantics of the entire mail text content. Finally, the converter module converts the sequence of the granularity semantic understanding feature vectors of the mail text content into the full text semantic understanding feature vectors of the mail text content, and provides more comprehensive and accurate semantic information for subsequent mail subject classification. The converter module has the advantage of being able to process longer sequences and to compute in parallel and therefore has a better effect in processing text data such as mail.
And further, the text content full-text semantic understanding feature vector of the mail is passed through a classifier to obtain a classification result, wherein the classification result is used for indicating whether the detected mail is junk mail or not. That is, the text content of the mail is classified by using the text context semantic association characteristic information based on the segment granularity semantics, so that the subject of the detected mail is identified, and whether the mail is junk mail or not is judged.
Accordingly, the mail subject classification module 160 is configured to: and the text content full text semantic understanding feature vector of the mail is passed through a classifier to obtain a classification result, wherein the classification result is used for indicating whether the detected mail is junk mail or not. Specifically, the mail subject classification module 160 is further configured to: using a full-connection layer of the classifier to carry out full-connection coding on the text content full-text semantic understanding feature vector of the mail so as to obtain a coding classification feature vector; and inputting the coding classification feature vector into a Softmax classification function of the classifier to obtain the classification result.
That is, in the technical solution of the present application, the labels of the classifier include that the detected mail is spam (first label) and that the detected mail is not spam (second label), wherein the classifier determines, through a soft maximum function, to which classification label the text content full-text semantic understanding feature vector of the mail belongs. It should be noted that the first tag p1 and the second tag p2 do not include a manually set concept, and in fact, during the training process, the computer model does not have a concept of "whether the detected mail is spam" which is simply two kinds of classification tags and the probability that the output feature is the sum of the two classification tags sign, i.e., p1 and p2 is one. Therefore, the classification result of whether the detected mail is junk mail is actually converted into a classification probability distribution conforming to the natural rule through classifying the labels, and the physical meaning of the natural probability distribution of the labels is essentially used instead of the language text meaning of whether the detected mail is junk mail.
It should be appreciated that the role of the classifier is to learn the classification rules and classifier using a given class, known training data, and then classify (or predict) the unknown data. Logistic regression (logistics), SVM, etc. are commonly used to solve the classification problem, and for multi-classification problems (multi-class classification), logistic regression or SVM can be used as well, but multiple bi-classifications are required to compose multiple classifications, but this is error-prone and inefficient, and the commonly used multi-classification method is the Softmax classification function.
It is worth mentioning that full-connection encoding (Fully Connected Encoding) is a process of encoding input data through the full-connection layer. Fully connected layers are a common layer type in neural networks, where each neuron is connected to all neurons of the previous layer and calculated by weights and biases. In the mail topic classification module 160, full-join encoding is used to encode the full-text semantic understanding feature vector of mail text content to obtain an encoded classification feature vector. This process may be accomplished by: 1. inputting a feature vector: and transferring the text content full text semantic understanding feature vector of the mail to a full connection layer as input. 2. Weight and bias calculation: each neuron in the fully connected layer has a set of weights and a bias term. For each neuron, the input feature vector is multiplied by the corresponding weight and a bias term is added to obtain a weighted sum. 3. Activation function: typically, after the weighted sum calculation of the fully connected layers, a nonlinear activation function, such as a ReLU (RECTIFIED LINEAR Unit) or Sigmoid function, is applied to introduce nonlinear transformation and expression capabilities. 4. Outputting a coding feature vector: after the activation function, the obtained output is the coding classification feature vector. This feature vector reflects the encoded representation of the input feature vector in the fully connected layer. The purpose of full-join coding is to map the input feature vectors to coded representations in a high-dimensional feature space by learning appropriate weights and offsets. Such encoding may help the system learn a more abstract and useful representation of the input data for more accurate classification. The full connection layer has stronger fitting capability, and can learn complex relations and modes in input features, so that the performance of the classifier is improved. In the mail subject classification module, full-connection coding converts the text content full-text semantic understanding feature vector of the mail into coding classification feature vector, and then classification is carried out through a Softmax classification function to obtain a final classification result. The process of fully concatenated coding may help the system better understand and represent the semantic information of the mail, thereby improving the accuracy and performance of classification.
Further, in the technical scheme of the application, the AI-intelligent-based fusion communication malicious data identification system further comprises a training module for training the segment semantic encoder comprising the word embedding layer and the text convolutional neural network model, the context encoder based on the converter module and the classifier.
It should be appreciated that the training module plays a vital role in the AI-intelligent based converged communication malicious data identification system. It is used to train the various components in the system so that they can learn and identify malicious data effectively. Specifically, the training module is used for training the segment semantic encoder, the context encoder and the classifier. Through the training module, the components can learn the characteristics and modes of the malicious data, so that the identification and classification of the malicious data can be accurately performed. The workflow of the training module generally includes the following steps: 1. data preparation: the training module needs to prepare a labeled training data set containing samples of normal data and malicious data. These samples should be marked to indicate their category (normal or malicious). 2. Feature extraction: the training module will use the segment semantic encoder and the context encoder to perform feature extraction on the text in the training dataset. These feature extractors will semantically analyze and encode the text to obtain feature vectors representing its semantic information. 3. Model training: the training module will train the classifier using the feature vectors and the labeled class information. The classifier may be a variety of machine learning algorithms or neural network models, such as Support Vector Machines (SVMs), decision trees, random forests, deep neural networks, and the like. By training, the classifier will learn how to judge the text category, i.e. normal or malicious, based on the feature vectors. 4. Model evaluation and optimization: the training module evaluates the performance of the model obtained by training and optimizes the model according to the evaluation result. Evaluation typically uses performance metrics such as accuracy, recall, precision, and F1 value. If the performance of the model is not ideal, the training module can perform operations such as parameter adjustment, feature engineering or modification of the model structure to optimize the model. Through the training process of the training module, each component in the system can learn the characteristics of malicious data step by step, and the identification capability of the malicious data is improved. Therefore, the system can more accurately identify and filter malicious data in practical application, and communication safety and user experience are improved.
In one example, the training module includes: the training data acquisition unit is used for acquiring training data, wherein the training data comprises a true value for training the detected mail and whether the detected mail is junk mail or not; the training mail text content extraction unit is used for extracting training mail text content from the training detected mail; the training content segmentation unit is used for carrying out segmentation processing on the training mail text content to obtain a sequence of training mail text content segments; the training mail text content segment granularity semantic analysis unit is used for enabling each training mail text content segment in the training mail text content segment sequence to pass through the segment semantic encoder comprising the word embedding layer and the text convolutional neural network model to obtain a training mail text content segment granularity semantic understanding feature vector sequence; the text content full text semantic understanding unit of the training mail is used for enabling the sequence of the text content segment granularity semantic understanding feature vectors of the training mail to pass through the context encoder based on the converter module to obtain text content full text semantic understanding feature vectors of the training mail; the training classification loss unit is used for enabling the text content full-text semantic understanding feature vector of the training mail to pass through the classifier to obtain a classification loss function value; and a loss training unit, configured to train the segment semantic encoder including the word embedding layer and the text convolutional neural network model, the context encoder based on the converter module, and the classifier with the classification loss function value, where in each iteration of the training, a full text semantic understanding feature vector of the training mail text content is corrected.
Wherein, training categorised loss unit is used for: processing the text content full text semantic understanding feature vector of the training mail by the classifier according to the following classification training formula to obtain a training classification result; wherein, the classification training formula is: ; wherein, To the point ofAs a matrix of weights, the weight matrix,To the point ofAs a result of the offset vector,The feature vector is understood for the text content full text semantic meaning of the training mail; and calculating a cross entropy value between the training classification result and the true value as the classification loss function value.
In the technical scheme of the application, the text semantic feature of the text content of the training mail under the local semantic space domain is expressed by the sequence of the text content of the training mail, so that after the sequence of the text content of the training mail under the local semantic space domain passes through a context encoder based on a converter module, the extraction of the associated text semantic feature under the global semantic space domain can be further performed based on the text semantic feature context under the local semantic space domain, but the text semantic information game discretization caused by the text semantic distribution difference between the text semantic feature vector under the local semantic space domain and the global semantic space domain of the text content of the training mail is also caused, and the classification training of the text content of the training mail through a classifier is affected.
Based on this, the applicant of the present application preferably corrects the full text semantic understanding feature vector of the training mail text content each time the full text semantic understanding feature vector of the training mail text content is iteratively trained by a classifier.
Accordingly, in one example, in each iteration of the training, the full text semantic understanding feature vector of the training mail text content is corrected with the following correction formula to obtain a corrected full text semantic understanding feature vector of the training mail text content; wherein, the correction formula is: ; wherein, Is the first of the text content full text semantic understanding feature vectors of the training mailCharacteristic values, andIs the dimension of the super-parameter,A logarithmic function with a base of 2 is shown,Is the first of the corrected text content full text semantic understanding feature vectors of the training mailAnd characteristic values.
Specifically, when the full text semantic understanding feature vector of the training mail text is iteratively trained by the classifier, the weight matrix of the classifier acts on the full text semantic understanding feature vector of the training mail text during training, due to the compact characteristic of the weight matrix, text semantic information game discretization among feature values of all positions of the full text semantic understanding feature vector of the training mail text can generate large-scale information game, so that classification solutions can not be converged to Nash equilibrium on the game basis, especially in the case of large-scale imperfect game discretization information based on the local semantic space text semantic feature distribution of the full text semantic understanding feature vector of the training mail text segment granularity, and therefore, the training effect of the full text semantic feature vector of the training mail text through the classifier can be improved by carrying out equivalent convergence of information game equalization on the full text semantic understanding feature vector of the training mail text through self-control equalization of vector information of the full text semantic understanding feature vector of the training mail text. Therefore, the automatic identification and filtration of malicious mails and junk mails can be realized through semantic coding and topic classification of mail contents, and further the communication safety and efficiency are improved.
In summary, the AI-intelligence-based converged communication malicious data identification system 100 in accordance with an embodiment of the present application is illustrated, which can improve communication security and efficiency.
As described above, the AI-intelligent-based converged communication malicious data identification system 100 according to the embodiment of the present application may be implemented in various terminal devices, for example, a server or the like having an AI-intelligent-based converged communication malicious data identification algorithm. In one example, the AI-intelligent based converged communication malicious data identification system 100 can be integrated into a terminal device as one software module and/or hardware module. For example, the AI-intelligent-based converged communication malicious data identification system 100 can be a software module in the operating system of the terminal device, or can be an application developed for the terminal device; of course, the AI-intelligent-based converged communication malicious data identification system 100 can also be one of numerous hardware modules of the terminal device.
Alternatively, in another example, the AI-intelligent-based converged communication malicious data identification system 100 and the terminal device can be separate devices, and the AI-intelligent-based converged communication malicious data identification system 100 can be connected to the terminal device via a wired and/or wireless network, and communicate the interaction information in accordance with the agreed data format.
Fig. 2 shows a flowchart of an AI-intelligent-based converged communication malicious data identification method in accordance with an embodiment of the present application. Fig. 3 shows a schematic diagram of a system architecture of an AI-intelligent-based converged communication malicious data identification method according to an embodiment of the present application. As shown in fig. 2 and 3, the AI-intelligence-based converged communication malicious data identification method according to an embodiment of the present application includes: s110, acquiring detected mails; s120, extracting mail text content from the detected mail; s130, carrying out segmentation processing on the mail text content to obtain a sequence of mail text content segments; s140, carrying out semantic analysis based on segment granularity on each mail text content segment in the sequence of mail text content segments to obtain a sequence of mail text content segment granularity semantic understanding feature vectors; s150, performing full text semantic association coding on the sequence of the granularity semantic understanding feature vectors of the mail text content segment to obtain full text semantic understanding features of the mail text content; and S160, determining whether the detected mail is junk mail or not based on the text content full text semantic understanding characteristics of the mail.
In one possible implementation manner, performing semantic analysis based on segment granularity on each mail text content segment in the sequence of mail text content segments to obtain a sequence of mail text content segment granularity semantic understanding feature vectors, including: and passing each mail text content segment in the sequence of mail text content segments through a segment semantic encoder comprising a word embedding layer and a text convolutional neural network model to obtain the sequence of mail text content segment granularity semantic understanding feature vectors.
In one possible implementation manner, performing full text semantic association coding on the sequence of the granularity semantic understanding feature vectors of the mail text content segment to obtain full text semantic understanding features of the mail text content, including: and the sequence of the mail text content segment granularity semantic understanding feature vector is used for obtaining the mail text content full text semantic understanding feature vector as the mail text content full text semantic understanding feature through a context encoder based on a converter module.
Here, it will be understood by those skilled in the art that the specific operations of the respective steps in the above-described AI-intelligent-based converged communication malicious data recognition method have been described in detail above with reference to the description of the AI-intelligent-based converged communication malicious data recognition system of fig. 1, and thus, duplicate descriptions thereof will be omitted.
Fig. 4 shows an application scenario diagram of an AI-intelligent-based converged communication malicious data identification system in accordance with an embodiment of the present application. As shown in fig. 4, in this application scenario, first, a detected mail (e.g., D illustrated in fig. 4) is acquired, and then, the detected mail is input to a server (e.g., S illustrated in fig. 4) in which an AI-intelligent-based converged communication malicious data recognition algorithm is deployed, wherein the server can process the detected mail using the AI-intelligent-based converged communication malicious data recognition algorithm to obtain a classification result for indicating whether the detected mail is spam or not.
In an exemplary embodiment, a non-transitory computer readable storage medium is also provided, such as a memory including computer program instructions executable by a processing component of an apparatus to perform the above-described method.
The present application may be a system, method, and/or computer program product. The computer program product may include a computer readable storage medium having computer readable program instructions embodied thereon for causing a processor to implement aspects of the present application.
The flowcharts and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present application. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of instructions, which comprises one or more executable instructions for implementing the specified logical function(s). In some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.
The foregoing description of embodiments of the application has been presented for purposes of illustration and description, and is not intended to be exhaustive or limited to the embodiments disclosed. Many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the various embodiments described. The terminology used herein was chosen in order to best explain the principles of the embodiments, the practical application, or the improvement of technology in the marketplace, or to enable others of ordinary skill in the art to understand the embodiments disclosed herein.

Claims (6)

1. An AI-intelligence-based converged communication malicious data identification system, which is characterized by comprising:
the mail acquisition module is used for acquiring detected mails;
The mail text content extraction module is used for extracting mail text content from the detected mail;
the content segmentation module is used for carrying out segmentation processing on the mail text content to obtain a sequence of mail text content segments;
The mail text content segment granularity semantic analysis module is used for carrying out semantic analysis based on segment granularity on each mail text content segment in the sequence of mail text content segments to obtain a sequence of mail text content segment granularity semantic understanding feature vectors;
The mail text content full text semantic understanding module is used for carrying out full text semantic association coding on the sequence of the mail text content segment granularity semantic understanding feature vectors so as to obtain mail text content full text semantic understanding features; and
The mail subject classification module is used for determining whether the detected mail is junk mail or not based on the text content full-text semantic understanding characteristics of the mail;
The mail text content segment granularity semantic analysis module is used for:
Passing each mail text content segment in the sequence of mail text content segments through a segment semantic encoder comprising a word embedding layer and a text convolutional neural network model to obtain a sequence of granularity semantic understanding feature vectors of the mail text content segments;
the mail text content full text semantic understanding module is used for:
and the sequence of the mail text content segment granularity semantic understanding feature vector is used for obtaining the mail text content full text semantic understanding feature vector as the mail text content full text semantic understanding feature through a context encoder based on a converter module.
2. The AI-intelligence-based converged communication malicious data identification system of claim 1, wherein the mail subject classification module is configured to:
And the text content full text semantic understanding feature vector of the mail is passed through a classifier to obtain a classification result, wherein the classification result is used for indicating whether the detected mail is junk mail or not.
3. The AI-intelligence-based converged communication malicious data recognition system of claim 2, further comprising a training module for training the segment semantic encoder including a word embedding layer and a text convolutional neural network model, the context encoder based on a converter module, and the classifier.
4. The AI-intelligence-based converged communication malicious data identification system of claim 3, wherein the training module comprises:
The training data acquisition unit is used for acquiring training data, wherein the training data comprises a true value for training the detected mail and whether the detected mail is junk mail or not;
the training mail text content extraction unit is used for extracting training mail text content from the training detected mail;
the training content segmentation unit is used for carrying out segmentation processing on the training mail text content to obtain a sequence of training mail text content segments;
The training mail text content segment granularity semantic analysis unit is used for enabling each training mail text content segment in the training mail text content segment sequence to pass through the segment semantic encoder comprising the word embedding layer and the text convolutional neural network model to obtain a training mail text content segment granularity semantic understanding feature vector sequence;
The text content full text semantic understanding unit of the training mail is used for enabling the sequence of the text content segment granularity semantic understanding feature vectors of the training mail to pass through the context encoder based on the converter module to obtain text content full text semantic understanding feature vectors of the training mail;
The training classification loss unit is used for enabling the text content full-text semantic understanding feature vector of the training mail to pass through the classifier to obtain a classification loss function value; and
And the loss training unit is used for training the segment semantic encoder comprising the word embedding layer and the text convolutional neural network model, the context encoder based on the converter module and the classifier by using the classification loss function value, wherein in each iteration of training, the full-text semantic understanding feature vector of the text content of the training mail is corrected.
5. The AI-intelligence-based converged communication malicious data identification system of claim 4, wherein the training class loss unit is configured to:
processing the text content full text semantic understanding feature vector of the training mail by the classifier according to the following classification training formula to obtain a training classification result;
wherein, the classification training formula is: ; wherein, To the point ofAs a matrix of weights, the weight matrix,To the point ofAs a result of the offset vector,The feature vector is understood for the text content full text semantic meaning of the training mail; and
And calculating a cross entropy value between the training classification result and the true value as the classification loss function value.
6. The method for identifying the converged communication malicious data based on the AI intelligence is characterized by comprising the following steps:
Acquiring detected mails;
extracting mail text content from the detected mail;
segmenting the mail text content to obtain a sequence of mail text content segments;
Performing semantic analysis based on segment granularity on each mail text content segment in the sequence of mail text content segments to obtain a sequence of mail text content segment granularity semantic understanding feature vectors;
Performing full text semantic association coding on the sequence of the mail text content segment granularity semantic understanding feature vector to obtain full text semantic understanding features of mail text content; and
Determining whether the detected mail is junk mail or not based on the text content full text semantic understanding characteristics of the mail;
the method for analyzing the semantic analysis of each mail text content segment in the sequence of mail text content segments based on the segment granularity to obtain the sequence of mail text content segment granularity semantic understanding feature vectors comprises the following steps:
Passing each mail text content segment in the sequence of mail text content segments through a segment semantic encoder comprising a word embedding layer and a text convolutional neural network model to obtain a sequence of granularity semantic understanding feature vectors of the mail text content segments;
the text content granularity semantic understanding feature vector sequence is subjected to full text semantic association coding to obtain text content full text semantic understanding features, and the method comprises the following steps:
and the sequence of the mail text content segment granularity semantic understanding feature vector is used for obtaining the mail text content full text semantic understanding feature vector as the mail text content full text semantic understanding feature through a context encoder based on a converter module.
CN202410143834.1A 2024-02-01 2024-02-01 System and method for identifying converged communication malicious data based on AI intelligence Active CN117978499B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202410143834.1A CN117978499B (en) 2024-02-01 2024-02-01 System and method for identifying converged communication malicious data based on AI intelligence

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202410143834.1A CN117978499B (en) 2024-02-01 2024-02-01 System and method for identifying converged communication malicious data based on AI intelligence

Publications (2)

Publication Number Publication Date
CN117978499A CN117978499A (en) 2024-05-03
CN117978499B true CN117978499B (en) 2024-08-20

Family

ID=90849267

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202410143834.1A Active CN117978499B (en) 2024-02-01 2024-02-01 System and method for identifying converged communication malicious data based on AI intelligence

Country Status (1)

Country Link
CN (1) CN117978499B (en)

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8515736B1 (en) * 2010-09-30 2013-08-20 Nuance Communications, Inc. Training call routing applications by reusing semantically-labeled data collected for prior applications
CN106506327A (en) * 2016-10-11 2017-03-15 东软集团股份有限公司 A kind of spam filtering method and device

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20090265612A1 (en) * 2008-04-17 2009-10-22 Travelocity.Com Lp Methods, apparatuses, and computer program products for specifying content of electronic mail messages using a mail markup language
CN111221970B (en) * 2019-12-31 2022-06-07 论客科技(广州)有限公司 Mail classification method and device based on behavior structure and semantic content joint analysis

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8515736B1 (en) * 2010-09-30 2013-08-20 Nuance Communications, Inc. Training call routing applications by reusing semantically-labeled data collected for prior applications
CN106506327A (en) * 2016-10-11 2017-03-15 东软集团股份有限公司 A kind of spam filtering method and device

Also Published As

Publication number Publication date
CN117978499A (en) 2024-05-03

Similar Documents

Publication Publication Date Title
CN112487149A (en) Text auditing method, model, equipment and storage medium
CN113806547A (en) A deep learning multi-label text classification method based on graph model
CN107766371A (en) A kind of text message sorting technique and its device
CN107908715A (en) Microblog emotional polarity discriminating method based on Adaboost and grader Weighted Fusion
CN109889436B (en) Method for discovering spammer in social network
CN107294834A (en) A kind of method and apparatus for recognizing spam
CN114428855A (en) Service record classification method for hierarchy and mixed data type
CN111859955A (en) A public opinion data analysis model based on deep learning
CN114065749B (en) A text-oriented Cantonese recognition model and system training and recognition method
CN116992304A (en) Policy matching analysis system and method based on artificial intelligence
CN116150404A (en) A multi-modal knowledge map construction method for educational resources based on federated learning
CN116416480A (en) A visual classification method and device based on multi-template hint learning
CN113535946A (en) Text identification method, device and equipment based on deep learning and storage medium
CN113723426B (en) Image classification method and device based on deep multi-stream neural network
CN116911929A (en) Advertising service terminal and method based on big data
US8699796B1 (en) Identifying sensitive expressions in images for languages with large alphabets
CN116226711B (en) Case dispute focus identification method and device based on multi-feature fusion
CN112541082A (en) Text emotion classification method and system
CN117978499B (en) System and method for identifying converged communication malicious data based on AI intelligence
CN117221839B (en) 5G signaling identification method and system thereof
CN115357718B (en) Subject integration service duplicate material discovery method, system, device and storage medium
CN113722495A (en) Financial text relation extraction and classification method fused with regular expression
CN112884009A (en) Classification model training method and system
CN116956289B (en) Method for dynamically adjusting potential blacklist and blacklist
CN115587318B (en) A source code classification method based on neural networks

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant