CN118747340B

CN118747340B - Network data analysis system based on deep learning

Info

Publication number: CN118747340B
Application number: CN202410792325.1A
Authority: CN
Inventors: 冯洁心; 南琳芝; 白薇
Original assignee: Haojing College of Shaanxi University of Science and Technology
Current assignee: Haojing College of Shaanxi University of Science and Technology
Priority date: 2024-06-19
Filing date: 2024-06-19
Publication date: 2024-12-24
Anticipated expiration: 2044-06-19
Also published as: CN118747340A

Abstract

The present application discloses a network data analysis system based on deep learning, which relates to the field of computer network technology, including: a data flow acquisition module, which is used to acquire real-time data flow; a primary classification module, which is used to determine whether the data point to be classified is abnormal data flow according to the distance between the data point to be classified and other data points corresponding to the real-time data flow; a secondary classification module, which is used to determine whether the real-time data flow from multiple API interfaces has reached the bandwidth upper limit after the primary classification module determines that the data point to be classified is abnormal data flow, and if so, correct the distance between the data point to be classified and other data points, and determine whether the data point to be classified is normal data flow according to the corrected distance. The present application identifies normal data flow with data flow fluctuations caused by bandwidth limitations from data flow considered abnormal, thereby improving the accuracy of data flow analysis results.

Description

Network data analysis system based on deep learning

Technical Field

The application relates to the technical field of computer networks, in particular to a network data analysis system based on deep learning.

Background

Computer networks have penetrated the life of people and various electronic devices are connected together through computer networks for communication of data and information. Data flow can be generated in the process of data transmission in the network, obvious flow abnormality can be generated after network attack, such as DDoS (distributed denial of service attack), virus infection and the like, and network abnormality can be timely found through identifying the flow abnormality behaviors.

The identification of abnormal traffic is performed manually, and network experts can obtain the conclusion of whether abnormal network traffic occurs through experience by analyzing real-time network traffic data. This method relies heavily on manual experience and is not satisfactory in terms of accuracy and efficiency. In order to solve the problem, various automatic analysis methods are developed at present, and the most widely applied recognition method based on the deep learning technology is adopted. Deep learning is a machine learning method, and analysis conclusion can be obtained more quickly and accurately by deep analysis of data flow. For example, CN113378990a provides a flow data anomaly detection method based on deep learning that classifies data points according to their distance in sample space and density over a range of radii.

However, the data traffic is closely related to the network bandwidth of the computer device, and when the data traffic suddenly increases or decreases, it does not necessarily mean that an abnormality of the data traffic occurs, and it may be caused by that the normal data traffic is limited in bandwidth, however, the prior art does not consider the situation.

Disclosure of Invention

The embodiment of the application provides a network data analysis system based on deep learning, which is used for solving the problem that the influence on data flow caused by bandwidth limitation is not considered in the prior art.

The embodiment of the application provides a network data analysis system based on deep learning, which comprises the following steps:

the data flow acquisition module is used for acquiring real-time data flow;

The primary classification module is used for establishing a feature space, expressing real-time data flow as data points to be classified in the feature space, and determining whether the data points to be classified are abnormal data flow according to the distances between the data points to be classified and other data points in the feature space;

And the secondary classification module is used for determining whether the real-time data flow from the multiple API interfaces reaches the upper bandwidth limit after the primary classification module determines that the data point to be classified is the abnormal data flow, if so, determining the weight of the real-time data flow of each API interface according to the network speed of the multiple API interfaces at the same moment, recalculating the distance between the data point to be classified and other data points in the feature space according to the weight, and determining whether the data point to be classified is the normal data flow according to the recalculated distance.

The network data analysis system based on deep learning has the following advantages:

Firstly, determining whether flow abnormality exists according to the distance between data points in a feature space, if so, judging whether the network speed at the moment of occurrence of the data flow reaches the upper bandwidth limit, if so, correcting the distance, and judging whether the flow abnormality exists again, so that the normal data flow with the data flow fluctuation caused by the bandwidth limitation is effectively screened from the data flow considered to be abnormal, and the accuracy of the data flow analysis result is improved.

Drawings

In order to more clearly illustrate the embodiments of the application or the technical solutions in the prior art, the drawings that are required in the embodiments or the description of the prior art will be briefly described, it being obvious that the drawings in the following description are only some embodiments of the application, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

Fig. 1 is a schematic diagram of a functional module of a network data analysis system based on deep learning according to an embodiment of the present application.

Detailed Description

The following description of the embodiments of the present application will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present application, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the application without making any inventive effort, are intended to be within the scope of the application.

Fig. 1 is a functional block diagram of a network data analysis system based on deep learning according to an embodiment of the present application. The embodiment of the application provides a network data analysis system based on deep learning, which comprises the following steps:

the data traffic acquisition module 100 is configured to acquire real-time data traffic.

The system of the present application is illustratively deployed on a computer device, which may be a desktop, notebook or tablet computer in the general sense, as well as a variety of portable smart devices. Since a computer device receives or transmits data through multiple API (application programming interface) interfaces at the same time, when acquiring real-time data traffic, the data traffic acquisition module 100 needs to monitor the condition of each API interface of the computer device, and acquire the real-time data traffic of each API interface.

In the embodiment of the present application, the real-time data traffic is multidimensional data, and after the real-time data traffic of the computer device is obtained, the data traffic obtaining module 100 represents and saves the real-time data traffic in a multidimensional array manner.

The primary classification module 110 is configured to establish a feature space, represent the real-time data traffic as data points to be classified in the feature space, and determine whether the data points to be classified are abnormal data traffic according to distances between the data points to be classified and other data points in the feature space.

Illustratively, after the data traffic acquisition module 100 acquires the real-time data traffic of each API interface, the primary classification module represents the real-time data traffic of each API interface as data points to be classified in different feature spaces, respectively.

The feature space is a multidimensional space with the same dimension as the real-time data flow, and multidimensional data of the real-time data flow in the feature space are reflected in corresponding dimensions to form data points capable of reflecting positions in coordinate points.

In an embodiment of the present application, a plurality of data points, i.e., other data points, already exist in the feature space before the real-time data traffic is input into the feature space. These other data points are data traffic which has been determined to be normal in advance, and all data points are gathered in a certain range in the feature space because of the same attribute, namely, the normal attribute, and the distance between any two data points is not very large, if the distance between the newly added data point to be classified and any other data point is relatively small, the data point to be classified can be classified as normal data point.

Further, the primary classification module calculates distances between the data point to be classified and a plurality of other data points, averages the distances, compares the average distance with a distance threshold, and determines that the data point to be classified is abnormal data flow when the average distance is greater than the distance threshold. Specifically, the distance threshold is an average of the distances between all other data points. In practical application, the distances between the data point to be classified and each other data point can be calculated in sequence, after the distance values with the same number as the other data points are obtained, the average value of all the distance values is calculated, and then the average distance is compared with the distance threshold value. This calculation method considers the positions of all other data points, and when the other data points are far from the center, the consideration of the data points again may cause the problem that the performance capability of the average distance is not accurate enough, so that the application can select a part of other data points to calculate the average distance.

Further, when determining the number of other data points to be selected to calculate a more accurate average distance, the primary classification module performs classification test on the known data points in advance, and uses the number of other data points selected when the classification accuracy is highest as the number of other data points when calculating the distance between the data points to be classified and the other data points. Specifically, the data traffic prepared in advance may be divided into other data traffic and known data traffic according to a certain proportion, for example, 7:3, where the other data traffic and the known data traffic are represented as other data points and known data points in the feature space, and the other data points and the known data points have the same classification, so that the other data traffic may be input into the feature space to form other data points, then each time one known data traffic is input, the corresponding known data points are classified, and when all the known data points are classified, the current classification accuracy is calculated. And repeating the classification operation, and selecting the number of different other data points when repeating each time, so that the number of different other data points and the corresponding classification accuracy are obtained, and selecting the number of other data points with the highest classification accuracy as the number of other data points when classifying the data points to be classified later.

The secondary classification module 120 is configured to determine whether the real-time data traffic from the plurality of API interfaces reaches the bandwidth upper limit after the primary classification module determines that the data point to be classified is the abnormal data traffic, if so, determine weights of the real-time data traffic of the plurality of API interfaces according to network speeds of the plurality of API interfaces at the same time, recalculate distances between the data point to be classified and other data points in the feature space according to the weights, and determine whether the data point to be classified is the normal data traffic according to the recalculated distances.

Illustratively, the network accessed by the computer device has a limitation of the upper bandwidth limit, that is, the network speed of the computer network cannot exceed the upper bandwidth limit, for example, the network accessed by the computer device has the upper bandwidth limit of 100M, and the sum of the network speeds of all API interfaces in the computer device can only reach 100M at maximum, but cannot exceed 100M. Based on this situation, if the computer device is connected to multiple API interfaces at the same time, these API interfaces will share bandwidth, and when a new API interface is suddenly accessed, or after the data transmission of a certain API interface is finished, the other API interfaces will suddenly reduce or increase the network speed, which is represented by the data traffic, that is, the data traffic of some API interfaces suddenly reduces or increases within a certain period of time. However, this abrupt change is not necessarily due to a network attack, but may be due to a limitation of the upper bandwidth limit. Therefore, when the primary classification module determines that the data point to be classified is abnormal, the secondary classification module is utilized to judge again, so that erroneous judgment on the data flow is reduced as much as possible.

In an embodiment of the application, the secondary classification module detects an upper bandwidth limit of the computer device before determining whether real-time data traffic from the plurality of API interfaces reaches the upper bandwidth limit. Specifically, when detecting the upper limit of the bandwidth, the secondary classification module firstly interrupts data transmission of all API interfaces of the computer equipment, then establishes an API interface pointing to the appointed link, downloads a data packet with a known size through the API interface, calculates the ratio of the size of the data packet to the download duration after the downloading is completed, and then can obtain the most suitable upper limit of the bandwidth. It should be appreciated that the upper bandwidth limit is typically a few fixed values, such as 100M, 300M, 500M, etc., and that a value closest to and greater than the ratio of the packet size and the download duration may be used as the upper bandwidth limit after calculating the ratio.

Further, the upper bandwidth limit is the upper limit of the network speed, and the secondary classification module determines the network speed according to the data volume and the transmission time in the real-time data traffic. Because the real-time data flow already contains the starting time, the ending time and the data quantity, the difference between the ending time and the starting time is the transmission time of the data, and the network speed in the data transmission time can be obtained by calculating the ratio of the data quantity to the transmission time. Summing the network speeds of all API interfaces can determine if the network speed of the computer device has reached an upper bandwidth limit at a time or within a time period.

Further, after determining the network speed, the secondary classification module decomposes the upper bandwidth limit according to the number of API interfaces with the network speed different from zero at the moment when the real-time data flow occurs to obtain a corresponding theoretical speed value, calculates the difference value between the network speed of each API interface and the theoretical speed, and determines the weight value according to the speed difference value of each API interface. Specifically, the bandwidth upper limit may be calculated on average according to the number of API interfaces whose network speed is not zero, that is, the ratio of the bandwidth upper limit to the number of API interfaces whose network speed is not zero is taken as the theoretical speed value.

However, in general, not all API interfaces have the same network speed, so the present application sets a weight factor for each API interface, the sum of the weight factors of all API interfaces is 1, and after decomposing the bandwidth upper limit, the duty ratio of the theoretical speed value corresponding to the API interface in the bandwidth upper limit is the corresponding weight factor. The weight factors can be set manually according to the actual conditions of the API interface, and the larger the maximum network speed which can be achieved by the API interface is, the larger the corresponding weight factors are. After determining the theoretical speed value, the network speed of each API interface is differenced from the corresponding theoretical speed value, and the weight for correcting the distance can be determined according to the difference. It should be appreciated that since it is not possible to determine in advance which API interfaces the network speed is not zero, a person can set appropriate weight values within the same numerical range for all API interfaces, whether or not the current network speed is zero. After a specific data transmission task is finished, API interfaces with non-zero network speed can be determined, then the sum of the weight values of the API interfaces is calculated, and the ratio of the weight value of each API interface to the sum of the weight values can be used as a weight factor.

Further, no matter the network speed is reduced or increased due to the access of a new API interface or the release of resources by a certain API interface, the data traffic is an increase in distance, so that the distance needs to be reduced in correction so that the abnormal data traffic is as close to the normal data traffic as possible. In the embodiment of the application, the weight is determined by adopting a fuzzy control strategy. Specifically, the correspondence between the speed difference and the weight may be preset, and when the speed difference is within a certain numerical range, the weight corresponding to the numerical range may be selected, and the distance may be reduced by using the weight. The weight may take a value less than 1, so that when the distance is reduced, the product of the weight and the distance determined by the one-time classification model may be taken as the recalculated distance.

After the distance is recalculated, comparing the recalculated distance with a distance threshold, and if the recalculated distance is still greater than the distance threshold, indicating that the data flow corresponding to the data point to be classified is still abnormal data flow.

While preferred embodiments of the present application have been described, additional variations and modifications in those embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. It is therefore intended that the following claims be interpreted as including the preferred embodiments and all such alterations and modifications as fall within the scope of the application.

It will be apparent to those skilled in the art that various modifications and variations can be made to the present application without departing from the spirit or scope of the application. Thus, it is intended that the present application also include such modifications and alterations insofar as they come within the scope of the appended claims or the equivalents thereof.

Claims

1. A network data analysis system based on deep learning, characterized by comprising:

A data flow acquisition module is used to acquire real-time data flow;

A primary classification module, used to establish a feature space, represent the real-time data flow as a data point to be classified in the feature space, and determine whether the data point to be classified is abnormal data flow according to the distance between the data point to be classified and other data points in the feature space;

A secondary classification module is used to determine whether the real-time data traffic from multiple API interfaces has reached the bandwidth upper limit after the primary classification module determines that the data point to be classified is abnormal data traffic, and if so, determine the weight of the real-time data traffic of each API interface according to the network speed of the multiple API interfaces at the same time, recalculate the distance between the data point to be classified and other data points in the feature space according to the weight, and determine whether the data point to be classified is normal data traffic according to the recalculated distance;

The data flow acquisition module monitors the status of each API interface of the computer device, obtains the real-time data flow of each API interface, and then the primary classification module represents the real-time data flow of each API interface as different data points to be classified in the feature space;

The primary classification module calculates the distance between the data point to be classified and the other data points, takes the average of the distances, compares the average distance with a distance threshold, and determines that the data point to be classified is abnormal data flow when the average distance is greater than the distance threshold;

After determining the network speed, the secondary classification module decomposes the bandwidth upper limit according to the number of the API interfaces whose network speed is not zero at the time when the real-time data traffic occurs, obtains the corresponding theoretical speed value, calculates the difference between the network speed of each API interface and the theoretical speed, and determines the weight according to the speed difference of each API interface;

Each of the API interfaces is provided with a weight factor, the sum of the weight factors of all the API interfaces is 1, and after the bandwidth upper limit is decomposed, the proportion of the theoretical speed value corresponding to the API interface in the bandwidth upper limit is the corresponding weight factor.

2. According to a deep learning-based network data analysis system according to claim 1, it is characterized in that the distance threshold is the average value of the distances between all other data points.

3. A network data analysis system based on deep learning according to claim 1, characterized in that the primary classification module performs classification tests on known data points, and uses the number of other data points selected when the classification accuracy is the highest as the number of other data points when calculating the distance between the data point to be classified and the other data points.

4. According to a deep learning-based network data analysis system according to claim 1, it is characterized in that the secondary classification module detects the bandwidth upper limit of the computer device before determining whether the real-time data traffic from multiple API interfaces has reached the bandwidth upper limit.

5. According to a deep learning-based network data analysis system according to claim 1, it is characterized in that the bandwidth upper limit is the upper limit of the network speed, and the secondary classification module determines the network speed according to the data volume and transmission time in the real-time data traffic.

6. A network data analysis system based on deep learning according to claim 1, characterized in that the weight is determined based on a fuzzy control strategy.