[go: up one dir, main page]

CN117914599A - Malicious traffic identification method in mobile networks based on graph neural network - Google Patents

Malicious traffic identification method in mobile networks based on graph neural network Download PDF

Info

Publication number
CN117914599A
CN117914599A CN202410088195.3A CN202410088195A CN117914599A CN 117914599 A CN117914599 A CN 117914599A CN 202410088195 A CN202410088195 A CN 202410088195A CN 117914599 A CN117914599 A CN 117914599A
Authority
CN
China
Prior art keywords
graph
node
data
training
attribute
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202410088195.3A
Other languages
Chinese (zh)
Inventor
黑新宏
王欣
姬文江
朱磊
邱原
高苗
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Xian University of Technology
Original Assignee
Xian University of Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Xian University of Technology filed Critical Xian University of Technology
Priority to CN202410088195.3A priority Critical patent/CN117914599A/en
Publication of CN117914599A publication Critical patent/CN117914599A/en
Pending legal-status Critical Current

Links

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L63/00Network architectures or network communication protocols for network security
    • H04L63/14Network architectures or network communication protocols for network security for detecting or protecting against malicious traffic
    • H04L63/1408Network architectures or network communication protocols for network security for detecting or protecting against malicious traffic by monitoring network traffic
    • H04L63/1425Traffic logging, e.g. anomaly detection
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L63/00Network architectures or network communication protocols for network security
    • H04L63/14Network architectures or network communication protocols for network security for detecting or protecting against malicious traffic
    • H04L63/1408Network architectures or network communication protocols for network security for detecting or protecting against malicious traffic by monitoring network traffic
    • H04L63/1416Event detection, e.g. attack signature detection
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L63/00Network architectures or network communication protocols for network security
    • H04L63/14Network architectures or network communication protocols for network security for detecting or protecting against malicious traffic
    • H04L63/1433Vulnerability analysis
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D30/00Reducing energy consumption in communication networks
    • Y02D30/50Reducing energy consumption in communication networks in wire-line communication networks, e.g. low power modes or reduced link rate

Landscapes

  • Engineering & Computer Science (AREA)
  • Computer Security & Cryptography (AREA)
  • Computer Hardware Design (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Signal Processing (AREA)
  • Data Exchanges In Wide-Area Networks (AREA)

Abstract

The invention discloses a mobile network malicious traffic identification method based on a graph neural network, which is implemented according to the following steps: step 1, collecting flow data in a 5G network, and extracting and labeling characteristics; step 2, obtaining importance scores corresponding to all feature columns through XGBoost model training, and screening out 24 features before scoring; step 3, constructing a data set, and dividing the data set into a training set and a testing set; step 4, constructing a malicious flow prediction model EGRAPHSAGE; step 5, training the malicious flow prediction model EGRAPHSAGE constructed in the step 4 by adopting a training set to obtain a classification model; and 6, inputting the test set into the classification model, and evaluating the performance of the classification model. The invention solves the problem of poor accuracy in identifying malicious traffic caused by the fact that the machine learning method ignores the relation among traffic and only focuses on the characteristics of the traffic.

Description

Mobile network malicious traffic identification method based on graph neural network
Technical Field
The invention belongs to the technical field of network and information security intrusion detection methods, and relates to a mobile network malicious traffic identification method based on a graph neural network.
Background
With the development of networks and the explosion of big data, 5G mobile communication technology has been developed. The 5G network has higher data transmission speed, lower delay and larger network capacity, and supports the emerging applications such as large-scale Internet of things connection, smart city, industrial Internet, automatic driving automobile and the like. But network security problems have been present since the advent of the network, and 5G networks are no exception. Malicious traffic in the network, which refers to data streams with malicious intent transmitted in the mobile network, is also generated with the attack. These data streams may contain malware, viruses, worms, botnet command and control traffic, or packets for network attacks, etc. The purpose of malicious traffic may be to intrude into the system, steal sensitive information, destroy or interfere with network services, or perform other illegal activities with the infected computer.
The data traffic of the 5G network and the general network are all used for transmitting information and communicating through the data traffic, and the bottom layer is used for data transmission and communication by using a TCP/IP protocol suite. Unlike general networks, 5G introduces new protocols and technologies on protocols, such as new mobile communication protocols like NGAP, HTTP2, GTP, 5G-NAS, network slicing and multiple access technologies; on core network architecture, a general network, such as a 4G core network architecture, is typically based on a traditional hierarchical model, including a core network, a radio access network, and user equipment. And the 5G network adopts a cloud-based virtualization architecture to decompose the core network function into a plurality of network function virtualization NFVs and entities of a software defined network SDN. The core network structure introduces new concepts such as network slicing and edge calculation; in network attacks, some specific network attacks may occur due to the complexity of the 5G network and the introduction of new technologies. For example, attacks on virtualized and software-defined networks, such as exploits of virtual network functions, intrusion of edge computing nodes, or cross-slice attacks on network slices. The existing active defense method has the problems of few defense schemes, insufficient accuracy and the like because of network specificity, difficult acquisition of network background stream sample flow and low data quality. Mobile network malicious traffic identification may help to solve these security problems to some extent.
At present, the method for identifying malicious traffic of a mobile network is mostly a machine learning method, and a machine learning algorithm is used for modeling and classifying traffic data, such as decision trees, support vector machines, random forests and the like. However, these machine learning methods ignore the relationship between the flows, only focus on the characteristics of the flows, and have poor accuracy in identifying malicious flows, and have limited interpretation of the model.
Disclosure of Invention
The invention aims to provide a mobile network malicious traffic identification method based on a graph neural network, which solves the problem of poor accuracy in identifying malicious traffic caused by the fact that the machine learning method in the prior art only focuses on the characteristics of traffic due to the fact that the relation among traffic is ignored.
The technical scheme adopted by the invention is that the mobile network malicious traffic identification method based on the graph neural network is implemented according to the following steps: step 1, collecting flow data in a 5G network, then extracting features of the collected flow data through CICFlowmeter tools, wherein each flow data comprises 84 columns of features, manually marking the extracted label features, and taking the marked data as original data; step 2, preprocessing the original data, discarding four features of Src IP, dst IP, src Port and Dst Port in each preprocessed original data, and then training the original data discarded with the four features through XGBoost models to obtain importance scores corresponding to all feature columns, and screening out the features with 24 scores; step 3, combining the characteristics of 24 before scoring in the original data extracted in the step 2 with the four characteristics of the discarded Src IP, dst IP, src Port and DstPort to obtain a new data set, and dividing the data set into a training set and a testing set; step 4, constructing a malicious flow prediction model EGRAPHSAGE; step 5, training the malicious flow prediction model EGRAPHSAGE constructed in the step 4 by adopting a training set to obtain a classification model; and 6, inputting the test set into the classification model, and evaluating the performance of the classification model.
The present invention is also characterized in that,
The step 1 of collecting flow data in a 5G network specifically comprises the following steps: capturing flow data captured in four abnormal scenes, namely normal 5G registration, internet surfing flow and deployment, in a virtual machine through a tcpdump command, wherein the flow data is in a pcap file;
the manual labeling in the step 1 refers to manually labeled flow types under the label characteristics, wherein the flow types are as follows: normal background traffic data and abnormal sample traffic data;
the manually marked flow types are obtained by encoding label features by a label encoding method, and specific flow types are indicated by encoding results.
The pretreatment in the step 2 specifically comprises the following steps:
replacing all inf values and nan values in the original data characteristics with 0; all IP addresses in the original data characteristics are mapped to randomly allocated IP addresses.
In the step 2, four features of Src IP, dst IP, src Port and Dst Port in each preprocessed original data are discarded, then a 'Flow ID' feature and a 'Timestamp' feature are discarded, then 77 columns of features except for the label feature are input as features of a XGBoost model, the label feature is input as a label of a XGBoost model, importance scores of all feature columns are obtained through feature_ importances _attribute of a XGBoost model, and feature columns with the top 24 scores in other 77 columns of features except the label feature are obtained through sorting.
The step 3 is specifically as follows:
Step 3.1, constructing four features of Src IP, dst IP, src Port and DstPort of each original data into two binary groups of Src IP and Src Port, dst IP and Dst Port, combining the first 24 features extracted in step 2 of each original data with the two binary groups, namely obtaining each flow data with 26 columns of features;
Step 3.2, taking all flow data processed in the step 3.1 as data set samples, and dividing the data set samples into a training set and a testing set;
Step 3.3, constructing an undirected graph by using all flow data in the training set and the testing set through a from_ pandas _ edgelist method in a networkx library respectively, then constructing a graph G by using the constructed undirected graph according to a from_ networkx method in a DGL library, wherein the graph G obtained by all flow data in the training set is a training graph, and the graph G obtained by all flow data in the testing set is a testing graph;
and 3.4, respectively expanding elements of the training diagram and the diagram G corresponding to the test diagram created in the step 3.3 to obtain an expanded diagram G, namely obtaining an expanded training diagram and a test diagram.
In step 3.3, all flow data in the training set and the testing set are respectively constructed into an undirected graph by a method from_ pandas _ edgelist in networkx library, which specifically comprises the following steps:
The df attribute is the 26-column feature converted form data extracted from the training set or the test current flow data, the df attribute represents the data to be converted into a graph, the format is DATAFRAME, the form data is 26-column feature, the line number is the number of flow data in the training set or the test current flow data, and one data corresponds to one line;
the source attribute is a Src IP and Src Port binary group in the table data, and the source attribute is an effective column name of a source node in df in the constructed undirected graph;
the target attribute is a Dst IP and Dst Port binary group in the table data, and the target attribute is an effective column name of a target node in df in the constructed undirected graph;
The edge_attr attribute is 24 columns of features and label features extracted in the step 2, and is the feature of the edges of the corresponding source node and the target node in the constructed undirected graph, namely the edge feature;
the create_using attribute is MultiGraph ();
setting corresponding attributes to obtain an undirected graph;
The graph G created in step 3.3 includes the number of nodes num_nodes, the number of edges num_edges, the format ndata_schemas of all the attributes of the nodes, the format edata _schemas of all the attributes of the edges, the object ndata containing all the nodes, the object edata containing all the edges.
The step 3.4 specifically comprises the following steps:
Adding an attribute 'h 2' for ndata, wherein the value of the attribute 'h 2' is a numerical matrix with all 3, the number of lines of the numerical matrix is num_nodes, the number of lines of the numerical matrix is the same as the number of the characteristic columns extracted by the XGBoost model in the step 2, and the attribute 'h 2' corresponds to the characteristics of all nodes in the graph G;
edata when creating the graph according to networkx and the DGL library, the edge_attr attribute is already obtained, specifically, the content corresponding to the edge_attr attribute generates attributes "h 1" and "L" of edata, where "h 1" represents the features of all edges in the graph, the 24 columns of features extracted by the XGBoost model in step 2 in the edge_attr attribute are corresponding, "L" attribute represents all edges in the graph, i.e. all labels corresponding to the traffic, and the label feature in the edge_attr attribute is corresponding;
edata _schemas has generated the attributes "S 1" and "label" when creating the graph according to networkx and DGL libraries, where "S 1" represents the shape of the feature stored on any single side in the graph, i.e. (1, 24), "label" represents the format of the label corresponding to each side;
the ndata_schemes are initially empty, and after the attribute "h 2" is added to the ndata, a new attribute "S 2" is generated in the ndata_schemes, which represents the shape of the feature stored in any single node in the graph, that is, (1, 24), so far, the graph G has been expanded; the number of edges in the graph corresponds to the total number of flows in the initial dataset; the number of nodes corresponds to the number of non-duplicate Src IP and Src Port tuples, dst IP and DstPort tuples in all traffic.
The malicious traffic prediction model EGRAPHSAGE in step 4 includes two SAGE layers, one Dropout layer and one MLPPredictor layer connected in sequence.
The step 5 is specifically as follows: inputting the training diagram obtained in the step 3 and the characteristics corresponding to all the node characteristics and the edges into a constructed malicious flow prediction model EGRAPHSAGE for training, wherein the characteristics of all the node characteristics and the edges respectively correspond to the value of the attribute 'h 2' in ndata and the value of the attribute 'h 1' in edata in the step 3.4;
The training process is as follows:
step 5.1, traversing all nodes of the training diagram by the SAGE layer, sampling neighbor nodes of any node v in the training diagram, aggregating the characteristics of all layers of neighbor nodes by using an aggregation function, and using the aggregated characteristics of all the neighbor nodes of the node as new characteristics of corresponding nodes;
Defining neighbor nodes of any node v as first-layer neighbor nodes of the node v, defining neighbor nodes of the first-layer neighbor nodes as second-layer neighbor nodes of the node v, and the like, wherein for any node v, including multi-layer neighbor nodes, the node v is assumed to have k-layer neighbor nodes; then, sampling the neighbor node of any node v in the training diagram and aggregating the features of the neighbor node by using an aggregation function specifically comprises:
Calculating characteristics of any node v after aggregation of k-layer neighbor node information
Wherein CONCAT is a splicing function, W k is a weight matrix of a k layer, and sigma is an activation function; Aggregating the characteristics of the k-1 layer neighbor node information of any node v;
wherein, The calculation is carried out according to the following formula:
Where ε is the set of all edges in the training graph, The characteristic of an edge uv of a sampling neighborhood N (v) of a neighbor node of the k-1 layer of the node v is that the edge uv represents an edge of any node u in the sampling neighborhood of the neighbor node of the k-1 layer of the node v to the neighbor node; AGG is an aggregation function, samples all neighbor nodes with neighbor N (v) being v nodes, and setsRepresenting edges in neighborhood N (v);
traversing all edges in the training diagram to obtain embedded features of any edge uv in the training diagram WhereinThe method is characterized in that the characteristics of the node u and the node v after the k layers of neighbor node information are assembled are spliced, and specifically comprises the following steps:
Wherein, among them, Calculated according to the above formula (1);
Step 5.2, inputting the training graph embedded with the new node characteristics and the edge characteristics in the step 5.1 into a Dropout layer, and discarding elements in the edge characteristics in the training graph; the method comprises the following steps:
for elements in the feature of each edge in the training graph, the Dropout layer randomly sets it to zero with a specified drop probability, resulting in updated edge features Calculated according to formula (4):
Wherein p is the set discard proportion and s is Binominal is a Binomial distribution function;
Step 5.3, inputting the training diagram of the characteristics of the updated edge of the Dropout layer into the MLPPredictor layer, and carrying out classified prediction by adopting a formula (5) to obtain a prediction result A:
wherein, B is a bias vector, exp is an exponential function, sum is a sum function, and a prediction result A is finally obtained;
And 5.4, calculating the cross loss of the predicted result A and the real label to obtain a loss value, optimizing by using an Adam optimizer, wherein the learning rate of the optimizer is 0.001, and obtaining a final model after iteration is finished, namely a trained malicious flow prediction model EGRAPHSAGE.
The step6 is specifically as follows: and (3) performing classification test on the test chart obtained in the step (4) according to the malicious flow prediction model EGRAPHSAGE trained in the step (5), and evaluating the classification model by using the confusion matrix chart, the accuracy, the recall and the F1-score.
The beneficial effects of the invention are as follows:
According to the invention, before model training, the extracted original network flow data is processed, then the feature extraction is carried out by using XGBoost, the problem that the data redundancy feature influences the accuracy of the model is effectively solved, after the features of the important flow data are obtained, the images are constructed by combining the two binary groups of Src IP, src Port, dst IP and Dst Port with networkx and a DGL library, the extracted features are assigned to the features of the edges in the images, the accuracy of network flow detection is improved, the reliability guarantee is provided for intrusion detection, and the accuracy of network flow classification in the mobile network is improved by constructing EGRAPHSAGE-XGBoost malicious flow detection methods.
Drawings
FIG. 1 is a flow chart of a mobile network malicious traffic identification method based on a graph neural network of the present invention;
Fig. 2 is a confusion matrix diagram corresponding to the classification result of the model in embodiment 2 of the present invention.
Detailed Description
The invention will be described in detail below with reference to the drawings and the detailed description.
Example 1
The mobile network malicious traffic identification method based on the graph neural network is implemented by the following steps, wherein a flow chart of the mobile network malicious traffic identification method is shown in fig. 1:
Step 1, collecting Flow data in a 5G network, then extracting features of the collected Flow data through CICFlowmeter tools, wherein each Flow data comprises 84 columns of features, each Flow data comprises a Flow ID (Flow ID, character string type), a source Ip address (Src IP, character string type), a target IP address (Dst IP, character string type), a source Port number (Src Port, integer type), a destination Port number (Dst Port, integer type), a Protocol type (Protocol, integer type), a time stamp (Timestamp, object type), a Flow Packet/s (Flow Packet number type), a label and the like, and manually marking the extracted label features, wherein the marked data is used as original data;
The flow data in the 5G network is collected specifically as follows: capturing flow data captured in four abnormal scenes, namely normal 5G registration, internet surfing flow and deployment, in a virtual machine through a tcpdump command, wherein the flow data is in a pcap file; the method comprises the steps of using 5G network simulation environment Free5gc (open source 5G network simulation project) to grasp normal network data and abnormal network data of 5 types, wherein the normal network data and the abnormal network data are respectively of normal background traffic and DDos abnormal traffic, position leakage abnormal traffic, playback SMC abnormal traffic and UDP port scanning abnormal traffic;
The manual labeling is to manually label each flow according to the type of the abnormal flow field and the time of deploying the abnormal scene in the captured pcap file by analyzing the protocol in the 5G network, five types of manually labeled flows are provided, the total types of the manually labeled flows correspond to the captured 5 types of flows, the labeled flow types are encoded by a label encoding method, the specific flow types are indicated by encoding results, such as five types of data, and the final encoding results comprise (0, 1,2,3, 4). The protocols refer to Protocol types adopted by interaction, and the values of the protocols are three types of protocols, namely TCP, UDP and protocols except for the former two, wherein the number of the TCP is 6, the number of the UDP is 17, and the other protocols are 0;
Step 2, preprocessing original data, discarding four features of Src IP, dst IP, src Port and Dst Port in each preprocessed original data, discarding a Flow ID feature and a Timestamp feature, inputting 77 columns of features except for the label feature as features of a XGBoost model, inputting the label feature as a tag of a XGBoost model, obtaining importance scores of all feature columns through feature_ importances _attribute of a XGBoost model, and obtaining feature columns with the top 24 scores in 77 columns of features except for the label feature through sorting;
The two features "Flow ID" and "Timestamp" are discarded because the Flow ID feature is repeated with the four features Src IP, dst IP, src Port, dst Port, the Timestamp type being object type and all different in value, which is extremely detrimental to encoding in subsequent model analysis.
The pretreatment is specifically as follows:
replacing all inf values and nan values in the original data characteristics with 0; mapping all IP addresses in the original data characteristics to randomly allocated IP addresses;
All inf values and nan values in the data are replaced to be 0 firstly, so that the model can be trained, and excessive data which cannot be trained due to infinite values and null values cannot be lost. Because of the unusual scene of deployment, the attacker's IP address quantity is few and has certain characteristics. The potential problem of source IP addresses inadvertently tagging attack traffic is avoided by IP address random mapping. For example: the source IP address is mapped to a randomly allocated IP address in the range 192.16.0.1 to 192.31.0.1.
Step 3, combining the characteristics of 24 before scoring in the original data extracted in the step 2 with the four characteristics of the discarded Src IP, dst IP, src Port and Dst Port to obtain a new data set, and dividing the data set into a training set and a testing set; the method comprises the following steps:
The method comprises the following steps:
Step 3.1, constructing four features of Src IP, dst IP, src Port and DstPort of each original data into two binary groups of Src IP and Src Port, dst IP and Dst Port, combining the first 24 features extracted in step 2 of each original data with the two binary groups, namely obtaining each flow data with 26 columns of features;
Step 3.2, taking all flow data processed in the step 3.1 as data set samples, and dividing the data set samples into a training set and a testing set;
Step 3.3, constructing an undirected graph by using all flow data in the training set and the testing set through a from_ pandas _ edgelist method in a networkx library respectively, then constructing a graph G by using the constructed undirected graph according to a from_ networkx method in a DGL library, wherein the graph G obtained by all flow data in the training set is a training graph, and the graph G obtained by all flow data in the testing set is a testing graph;
And 3.4, respectively expanding elements of the training diagram and the diagram G corresponding to the test diagram created in the step 3.3 to obtain an expanded diagram G, and obtaining an expanded training diagram and a test diagram.
In step 3.3, all flow data in the training set and the testing set are respectively constructed into an undirected graph by a method from_ pandas _ edgelist in networkx library, which specifically comprises the following steps:
The df attribute is the 26-column feature converted form data extracted from the training set or the test current flow data, the df attribute represents the data to be converted into a graph, the format is DATAFRAME, the form data is 26-column feature, the line number is the number of flow data in the training set or the test current flow data, and one data corresponds to one line;
the source attribute is a Src IP and Src Port binary group in the table data, and the source attribute is an effective column name of a source node in df in the constructed undirected graph;
the target attribute is a Dst IP and Dst Port binary group in the table data, and the target attribute is an effective column name of a target node in df in the constructed undirected graph;
The edge_attr attribute is 24 columns of features and label features extracted in the step 2, and is the feature of the edges of the corresponding source node and the target node in the constructed undirected graph, namely the edge feature;
the create_using attribute is MultiGraph ();
setting corresponding attributes to obtain an undirected graph;
The graph G created in step 3.3 includes the number of nodes num_nodes, the number of edges num_edges, the format ndata_schemas of all the attributes of the nodes, the format edata _schemas of all the attributes of the edges, the object ndata containing all the nodes, the object edata containing all the edges.
The step 3.4 specifically comprises the following steps:
Adding an attribute 'h 2' for ndata, wherein the value of the attribute 'h 2' is a numerical matrix with all 3, the number of lines of the numerical matrix is num_nodes, the number of lines of the numerical matrix is the same as the number of the characteristic columns extracted by the XGBoost model in the step 2, and the attribute 'h 2' corresponds to the characteristics of all nodes in the graph G;
edata when creating the graph according to networkx and the DGL library, the edge_attr attribute is already obtained, specifically, the content corresponding to the edge_attr attribute generates attributes "h 1" and "L" of edata, where "h 1" represents the features of all edges in the graph, the 24 columns of features extracted by the XGBoost model in step 2 in the edge_attr attribute are corresponding, "L" attribute represents all edges in the graph, i.e. all labels corresponding to the traffic, and the label feature in the edge_attr attribute is corresponding;
edata _schemas has generated the attributes "S 1" and "label" when creating the graph according to networkx and DGL libraries, where "S 1" represents the shape of the feature stored on any single side in the graph, i.e. (1, 24), "label" represents the format of the label corresponding to each side;
The ndata_schemes are initially empty, and after the attribute "h 2" is added to the ndata, a new attribute "S 2" is generated in the ndata_schemes, which represents the shape of the feature stored in any single node in the graph, that is, (1, 24), so far, the graph G has been expanded; the number of edges in the graph corresponds to the total number of traffic in the initial dataset, and since the labels are all separate variables, they do not have a matrix shape like the "S 1" attribute; the number of nodes corresponds to the number of non-duplicate Src IP and Src Port tuples, dst IP and DstPort tuples in all traffic.
Step 4, a malicious flow prediction model EGRAPHSAGE is constructed, which comprises two SAGE layers, a Dropout layer and a MLPPredictor layer which are sequentially connected, the node characteristics and the edge characteristics of the processed graph are used as input to the SAGE layers, after two layers of SAGE, the overfitting phenomenon is reduced through the Dropout layer, and the generalization capability of the model is improved. The prediction is finally performed by MLPPredictor layers.
Step 5, training the malicious flow prediction model EGRAPHSAGE constructed in the step 4 by adopting a training set to obtain a classification model; the method comprises the following steps: inputting the training diagram obtained in the step 3 and the characteristics corresponding to all the node characteristics and the edges into a constructed malicious flow prediction model EGRAPHSAGE for training, wherein the characteristics of all the node characteristics and the edges respectively correspond to the value of the attribute 'h 2' in ndata and the value of the attribute 'h 1' in edata in the step 3.4;
The training process is as follows:
step 5.1, traversing all nodes of the training diagram by the SAGE layer, sampling neighbor nodes of any node v in the training diagram, aggregating the characteristics of all layers of neighbor nodes by using an aggregation function, and using the aggregated characteristics of all the neighbor nodes of the node as new characteristics of corresponding nodes;
Defining neighbor nodes of any node v as first-layer neighbor nodes of the node v, defining neighbor nodes of the first-layer neighbor nodes as second-layer neighbor nodes of the node v, and the like, wherein for any node v, including multi-layer neighbor nodes, the node v is assumed to have k-layer neighbor nodes; then, sampling the neighbor node of any node v in the training diagram and aggregating the features of the neighbor node by using an aggregation function specifically comprises:
Calculating characteristics of any node v after aggregation of k-layer neighbor node information
Wherein CONCAT is a splicing function, W k is a weight matrix of a k layer, and sigma is an activation function; Aggregating the characteristics of the k-1 layer neighbor node information of any node v;
wherein, The calculation is carried out according to the following formula:
Where ε is the set of all edges in the training graph, The characteristic of an edge uv of a sampling neighborhood N (v) of a neighbor node of the k-1 layer of the node v is that the edge uv represents an edge of any node u in the sampling neighborhood of the neighbor node of the k-1 layer of the node v to the neighbor node; AGG is an aggregation function, samples all neighbor nodes with neighbor N (v) being v nodes, and setsRepresenting edges in neighborhood N (v);
traversing all edges in the training diagram to obtain embedded features of any edge uv in the training diagram WhereinThe method is characterized in that the characteristics of the node u and the node v after the k layers of neighbor node information are assembled are spliced, and specifically comprises the following steps:
wherein, Calculated according to the above formula (1);
Step 5.2, inputting the training graph embedded with the new node characteristics and the edge characteristics in the step 5.1 into a Dropout layer, and discarding elements in the edge characteristics in the training graph; the method comprises the following steps:
for elements in the feature of each edge in the training graph, the Dropout layer randomly sets it to zero with a specified drop probability, resulting in updated edge features Calculated according to formula (4):
Wherein p is the set discard proportion and s is Binominal is a Binomial distribution function;
Step 5.3, inputting the training diagram of the characteristics of the updated edge of the Dropout layer into the MLPPredictor layer, and carrying out classified prediction by adopting a formula (5) to obtain a prediction result A:
wherein, B is a bias vector, exp is an exponential function, sum is a sum function, and a prediction result A is finally obtained;
And 5.4, calculating the cross loss of the predicted result A and the real label to obtain a loss value, optimizing by using an Adam optimizer, wherein the learning rate of the optimizer is 0.001, and obtaining a final model after iteration is finished, namely a trained malicious flow prediction model EGRAPHSAGE.
Step 6, inputting the test set into the classification model, and evaluating the performance of the classification model, wherein the method specifically comprises the following steps: and (3) performing classification test on the test chart obtained in the step (4) according to the malicious flow prediction model EGRAPHSAGE trained in the step (5), and evaluating the classification model by using the confusion matrix chart, the accuracy, the recall and the F1-score.
The CICFlowmeter tool adopted in the embodiment is a network traffic analysis tool, which is used for monitoring and analyzing traffic data in network communication, and can extract and analyze various characteristics in network traffic, and the embodiment adopts java version.
In this embodiment, a grid search method is used to predict the super parameters in the model EGRAPHSAGE for malicious traffic: the optimizers that learn rate, percentage of neurons discarded by the Dropout layer, activation function, calculation loss optimize to minimize the loss function.
Example 2
Based on the embodiment 1, the Dropout layer can discard some neurons randomly in the training process, so that the risk of overfitting is reduced, the discarding proportion is set to be between 0.2 and 0.5, and the Dropout layer is closed after the model training is finished, so that the accuracy of the model in test or practical application is ensured.
Binominal is a Binomial distribution function based on the principle of generating and sampling from Binomial distributionThe same shape binary mask is used to randomly discard neurons. Each element in the generated mask is independently 1 according to the probability of p, and 0 according to the probability of (1-p). And then the output of part of neurons is set to 0 in a multiplication mode, so that the effect of random inactivation is achieved.
Example 3
Based on example 2, 70% of the data set samples were used as training sets and 30% of the data set samples were used as test sets in step 3.2.
Example 4
On the basis of embodiment 3, the present embodiment selects normal network data and four types of abnormal network data (DDos abnormal traffic, location leakage abnormal traffic, SMC replay abnormal traffic, UDP port scan abnormal traffic) captured in the simulated 5G network environment, the total traffic number being 50218 pieces, wherein the number of normal data 2924 pieces, DDos abnormal data 25221 pieces, SMC replay abnormal data 10832 pieces, UDP port scan abnormal data 9969 pieces, and location leakage data 1272 pieces.
By adopting the method of the invention, 70% of all data after the feature extraction is constructed into a graph, training is carried out, and then the rest 30% of data is constructed into the graph, and the test results are as follows:
Table 1 model evaluation index
Accuracy rate of Recall rate of recall F1 fraction Number of samples
UDP port scanning 1.0000 1.0000 1.0000 5982
Location revealing attack 0.9948 1.0000 0.9974 764
Denial of service attack 1.0000 0.9974 0.9987 15132
Normal flow rate 0.9960 0.9818 0.9888 1754
Replay SMC attacks 0.9909 1.0000 0.9954 6500
The classification accuracy, recall, F1 score, etc. of the method of the present invention are all close to 1 as seen from the results in table 1, as follows from the evaluation of the classification results of malicious traffic of the five types of data as described in table 1. As shown in fig. 2, in order to show the confusion matrix corresponding to the classification result of the model in this embodiment, as can be seen from fig. 2, the accuracy of the model is high, which proves that the invention has remarkable effect in terms of malicious traffic detection.

Claims (10)

1. The mobile network malicious traffic identification method based on the graph neural network is characterized by comprising the following steps of: step 1, collecting flow data in a 5G network, then extracting features of the collected flow data through CICFlowmeter tools, wherein each flow data comprises 84 columns of features, manually marking the extracted label features, and taking the marked data as original data; step 2, preprocessing the original data, discarding four features of Src IP, dst IP, src Port and Dst Port in each preprocessed original data, and then training the original data discarded with the four features through XGBoost models to obtain importance scores corresponding to all feature columns, and screening out the features with 24 scores; step 3, combining the characteristics of 24 before scoring in the original data extracted in the step 2 with the four characteristics of the discarded Src IP, dst IP, src Port and Dst Port to obtain a new data set, and dividing the data set into a training set and a testing set; step 4, constructing a malicious flow prediction model EGRAPHSAGE; step 5, training the malicious flow prediction model EGRAPHSAGE constructed in the step 4 by adopting a training set to obtain a classification model; and 6, inputting the test set into the classification model, and evaluating the performance of the classification model.
2. The method for identifying malicious traffic of a mobile network based on a graph neural network according to claim 1, wherein the collecting traffic data in the 5G network in step 1 specifically includes: capturing flow data captured in four abnormal scenes, namely normal 5G registration, internet surfing flow and deployment, in a virtual machine through a tcpdump command, wherein the flow data is in a pcap file;
The manual labeling in the step 1 refers to manually labeled flow types, and the flow types are as follows: normal background traffic data and abnormal sample traffic data;
the manually marked flow types are obtained by encoding label features by a label encoding method, and specific flow types are indicated by encoding results.
3. The method for identifying malicious traffic of a mobile network based on a graph neural network according to claim 1, wherein the preprocessing in the step 2 specifically comprises:
replacing all inf values and nan values in the original data characteristics with 0; all IP addresses in the original data characteristics are mapped to randomly allocated IP addresses.
4. The method for identifying malicious traffic of a mobile network based on a graph neural network according to claim 1, wherein in the step 2, four features of Src IP, dst IP, src Port and Dst Port in each preprocessed original data are discarded, then a "Flow ID" feature and a "Timestamp" feature are discarded, then 77 columns of features except for the label feature are used as feature input of a XGBoost model, label feature is used as tag input of a XGBoost model, importance scores of all feature columns are obtained through feature_ importances _attribute of a XGBoost model, and feature columns with the top 24 scores in the other 77 columns of features except for the label feature are obtained through sorting.
5. The method for identifying malicious traffic in a mobile network based on a graph neural network according to claim 1, wherein the step 3 is specifically:
Step 3.1, constructing four features of Src IP, dst IP, src Port and DstPort of each original data into two binary groups of Src IP and Src Port, dst IP and Dst Port, combining the first 24 features extracted in step 2 of each original data with the two binary groups, namely obtaining each flow data with 26 columns of features;
Step 3.2, taking all flow data processed in the step 3.1 as data set samples, and dividing the data set samples into a training set and a testing set;
Step 3.3, constructing an undirected graph by using all flow data in the training set and the testing set through a from_ pandas _ edgelist method in a networkx library respectively, then constructing a graph G by using the constructed undirected graph according to a from_ networkx method in a DGL library, wherein the graph G obtained by all flow data in the training set is a training graph, and the graph G obtained by all flow data in the testing set is a testing graph;
and 3.4, respectively expanding elements of the training diagram and the diagram G corresponding to the test diagram created in the step 3.3 to obtain an expanded diagram G, namely obtaining an expanded training diagram and a test diagram.
6. The method for identifying malicious traffic of a mobile network based on a graph neural network according to claim 5, wherein in the step 3.3, all traffic data in the training set and the test set are respectively constructed into an undirected graph by a method from_ pandas _ edgelist in a networkx library, specifically:
The df attribute is the 26-column feature converted form data extracted from the training set or the test current flow data, the df attribute represents the data to be converted into a graph, the format is DATAFRAME, the form data is 26-column feature, the line number is the number of flow data in the training set or the test current flow data, and one data corresponds to one line;
The source attribute is a SrcIP and SrcPort binary group in the table data, and the source attribute is an effective column name of a source node in df in the constructed undirected graph;
the target attribute is DstIP and DstPort binary groups in the table data, and the target attribute is the effective column name of the target node in df in the constructed undirected graph;
The edge_attr attribute is 24 columns of features and label features extracted in the step 2, and is the feature of the edges of the corresponding source node and the target node in the constructed undirected graph, namely the edge feature;
the create_using attribute is MultiGraph ();
setting corresponding attributes to obtain an undirected graph;
the graph G created in the step 3.3 includes a node number num_nodes, an edge number num_edges, a format ndata_schemas of all attributes of the nodes, a format edata _schemas of all attributes of the edges, an object ndata containing all nodes, and an object edata containing all edges.
7. The method for identifying malicious traffic in a mobile network based on a graph neural network according to claim 6, wherein the step 3.4 specifically comprises:
Adding an attribute 'h 2' for ndata, wherein the value of the attribute 'h 2' is a numerical matrix with all 3, the number of lines of the numerical matrix is num_nodes, the number of lines of the numerical matrix is the same as the number of the characteristic columns extracted by the XGBoost model in the step 2, and the attribute 'h 2' corresponds to the characteristics of all nodes in the graph G;
edata when creating the graph according to networkx and the DGL library, the edge_attr attribute is already obtained, specifically, the content corresponding to the edge_attr attribute generates attributes "h 1" and "L" of edata, where "h 1" represents the features of all edges in the graph, the 24 columns of features extracted by the XGBoost model in step 2 in the edge_attr attribute are corresponding, "L" attribute represents all edges in the graph, i.e. all labels corresponding to the traffic, and the label feature in the edge_attr attribute is corresponding;
edata _schemas has generated the attributes "S 1" and "label" when creating the graph according to networkx and DGL libraries, where "S 1" represents the shape of the feature stored on any single side in the graph, i.e. (1, 24), "label" represents the format of the label corresponding to each side;
The ndata_schemes are initially empty, and after the attribute "h 2" is added to the ndata, a new attribute "S 2" is generated in the ndata_schemes, which represents the shape of the feature stored in any single node in the graph, that is, (1, 24), so far, the graph G is already constructed; the number of edges in the graph corresponds to the total number of flows in the initial dataset; the number of nodes corresponds to the number of non-duplicate Src IP and Src Port tuples, dstIP and DstPort tuples in all traffic.
8. The method for identifying malicious traffic in a mobile network based on a graph neural network according to claim 7, wherein the malicious traffic prediction model EGRAPHSAGE in step 4 includes two layers of SAGE, one layer of Dropout and one layer of MLPPredictor connected in sequence.
9. The method for identifying malicious traffic in a mobile network based on a graph neural network according to claim 8, wherein the step 5 is specifically: inputting the training diagram obtained in the step 3 and the characteristics corresponding to all the node characteristics and the edges into a constructed malicious flow prediction model EGRAPHSAGE for training, wherein the characteristics of all the node characteristics and the edges respectively correspond to the value of the attribute 'h 2' in ndata and the value of the attribute 'h 1' in edata in the step 3.4;
The training process is as follows:
step 5.1, traversing all nodes of the training diagram by the SAGE layer, sampling neighbor nodes of any node v in the training diagram, aggregating the characteristics of all layers of neighbor nodes by using an aggregation function, and using the aggregated characteristics of all the neighbor nodes of the node as new characteristics of corresponding nodes;
Defining neighbor nodes of any node v as first-layer neighbor nodes of the node v, defining neighbor nodes of the first-layer neighbor nodes as second-layer neighbor nodes of the node v, and the like, wherein for any node v, including multi-layer neighbor nodes, the node v is assumed to have k-layer neighbor nodes; then, sampling the neighbor node of any node v in the training diagram and aggregating the features of the neighbor node by using an aggregation function specifically comprises:
Calculating characteristics of any node v after aggregation of k-layer neighbor node information
Wherein CONCAT is a splicing function, W k is a weight matrix of a k layer, and sigma is an activation function; Aggregating the characteristics of the k-1 layer neighbor node information of any node v;
wherein, The calculation is carried out according to the following formula:
Where ε is the set of all edges in the training graph, The characteristic of an edge uv of a sampling neighborhood N (v) of a neighbor node of the k-1 layer of the node v is that the edge uv represents an edge of any node u in the sampling neighborhood of the neighbor node of the k-1 layer of the node v to the neighbor node; AGG is an aggregation function, samples all neighbor nodes with neighbor N (v) being v node, and setsRepresenting edges in neighborhood N (v);
traversing all edges in the training diagram to obtain embedded features of any edge uv in the training diagram WhereinThe method is characterized in that the characteristics of the node u and the node v after the k layers of neighbor node information are assembled are spliced, and specifically comprises the following steps:
wherein, Calculated according to the above formula (1);
Step 5.2, inputting the training graph embedded with the new node characteristics and the edge characteristics in the step 5.1 into a Dropout layer, and discarding elements in the edge characteristics in the training graph; the method comprises the following steps:
for elements in the feature of each edge in the training graph, the Dropout layer randomly sets it to zero with a specified drop probability, resulting in updated edge features Calculated according to formula (4):
Wherein p is the set discard proportion and s is Binominal is a Binomial distribution function;
Step 5.3, inputting the training diagram of the characteristics of the updated edge of the Dropout layer into the MLPPredictor layer, and carrying out classified prediction by adopting a formula (5) to obtain a prediction result A:
wherein, B is a bias vector, exp is an exponential function, sum is a sum function, and a prediction result A is finally obtained;
And 5.4, calculating the cross loss of the predicted result A and the real label to obtain a loss value, optimizing by using an Adam optimizer, wherein the learning rate of the optimizer is 0.001, and obtaining a final model after iteration is finished, namely a trained malicious flow prediction model EGRAPHSAGE.
10. The method for identifying malicious traffic in a mobile network based on a graph neural network according to claim 9, wherein the step 6 is specifically: and (3) performing classification test on the test chart obtained in the step (4) according to the malicious flow prediction model EGRAPHSAGE trained in the step (5), and evaluating the classification model by using the confusion matrix chart, the accuracy, the recall and the F1-score.
CN202410088195.3A 2024-01-22 2024-01-22 Malicious traffic identification method in mobile networks based on graph neural network Pending CN117914599A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202410088195.3A CN117914599A (en) 2024-01-22 2024-01-22 Malicious traffic identification method in mobile networks based on graph neural network

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202410088195.3A CN117914599A (en) 2024-01-22 2024-01-22 Malicious traffic identification method in mobile networks based on graph neural network

Publications (1)

Publication Number Publication Date
CN117914599A true CN117914599A (en) 2024-04-19

Family

ID=90683563

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202410088195.3A Pending CN117914599A (en) 2024-01-22 2024-01-22 Malicious traffic identification method in mobile networks based on graph neural network

Country Status (1)

Country Link
CN (1) CN117914599A (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN118174958A (en) * 2024-05-10 2024-06-11 中移(苏州)软件技术有限公司 Traffic classification method, device, electronic device, storage medium and program product
CN119135387A (en) * 2024-08-22 2024-12-13 西安理工大学 Network intrusion detection method based on graph attention mechanism combined with GCN
CN119675896A (en) * 2024-11-05 2025-03-21 广东工业大学 Network abnormal attack behavior detection method based on SDN and graph neural network

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN118174958A (en) * 2024-05-10 2024-06-11 中移(苏州)软件技术有限公司 Traffic classification method, device, electronic device, storage medium and program product
CN119135387A (en) * 2024-08-22 2024-12-13 西安理工大学 Network intrusion detection method based on graph attention mechanism combined with GCN
CN119675896A (en) * 2024-11-05 2025-03-21 广东工业大学 Network abnormal attack behavior detection method based on SDN and graph neural network

Similar Documents

Publication Publication Date Title
CN109450842B (en) Network malicious behavior recognition method based on neural network
Ye et al. A DDoS attack detection method based on SVM in software defined network
CN117914599A (en) Malicious traffic identification method in mobile networks based on graph neural network
Vlăduţu et al. Internet traffic classification based on flows' statistical properties with machine learning
CN110417729B (en) A service and application classification method and system for encrypted traffic
CN113821793B (en) Multi-stage attack scenario construction method and system based on graph convolutional neural network
CN111935063B (en) A system and method for monitoring abnormal network access behavior of terminal equipment
CN109088903A (en) A kind of exception flow of network detection method based on streaming
CN114401516A (en) 5G slice network anomaly detection method based on virtual network traffic analysis
Yang et al. Deep learning-based reverse method of binary protocol
CN115021986A (en) A method and apparatus for constructing a deployable model for identification of IoT devices
CN118694617A (en) Network data transmission monitoring system and method based on big data analysis
CN116915450A (en) Topology pruning optimization method based on multi-step network attack recognition and scene reconstruction
CN118233171A (en) A cross-domain intelligent intrusion detection method for industrial Internet
Kozik et al. Pattern extraction algorithm for NetFlow‐based botnet activities detection
CN117749477A (en) A network traffic anomaly detection method based on generative adversarial networks
CN118827211A (en) Encrypted malicious traffic detection method based on traffic interaction behavior and attention mechanism
Erdenebaatar et al. Analyzing traffic characteristics of instant messaging applications on android smartphones
CN116418565B (en) A Domain Name Detection Method Based on Attribute Heterogeneous Graph Neural Network
CN113904841A (en) Network attack detection method applied to IPv6 network environment
CN117201646A (en) An in-depth analysis method of power Internet of Things terminal messages
Tang et al. HSLF: HTTP header sequence based lsh fingerprints for application traffic classification
Kousar et al. DDoS attack detection system using Apache spark
Zou et al. An identification decision tree learning model for self-management in virtual radio access network: IDTLM
US20250106185A1 (en) Apparatus and method for intrusion detection and prevention of cyber threat intelligence

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination