Detailed Description
The following description of the embodiments of the present application will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present application, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the application without making any inventive effort, are intended to be within the scope of the application.
Fig. 1 is a functional block diagram of a network data analysis system based on deep learning according to an embodiment of the present application. The embodiment of the application provides a network data analysis system based on deep learning, which comprises the following steps:
the data traffic acquisition module 100 is configured to acquire real-time data traffic.
The system of the present application is illustratively deployed on a computer device, which may be a desktop, notebook or tablet computer in the general sense, as well as a variety of portable smart devices. Since a computer device receives or transmits data through multiple API (application programming interface) interfaces at the same time, when acquiring real-time data traffic, the data traffic acquisition module 100 needs to monitor the condition of each API interface of the computer device, and acquire the real-time data traffic of each API interface.
In the embodiment of the present application, the real-time data traffic is multidimensional data, and after the real-time data traffic of the computer device is obtained, the data traffic obtaining module 100 represents and saves the real-time data traffic in a multidimensional array manner.
The primary classification module 110 is configured to establish a feature space, represent the real-time data traffic as data points to be classified in the feature space, and determine whether the data points to be classified are abnormal data traffic according to distances between the data points to be classified and other data points in the feature space.
Illustratively, after the data traffic acquisition module 100 acquires the real-time data traffic of each API interface, the primary classification module represents the real-time data traffic of each API interface as data points to be classified in different feature spaces, respectively.
The feature space is a multidimensional space with the same dimension as the real-time data flow, and multidimensional data of the real-time data flow in the feature space are reflected in corresponding dimensions to form data points capable of reflecting positions in coordinate points.
In an embodiment of the present application, a plurality of data points, i.e., other data points, already exist in the feature space before the real-time data traffic is input into the feature space. These other data points are data traffic which has been determined to be normal in advance, and all data points are gathered in a certain range in the feature space because of the same attribute, namely, the normal attribute, and the distance between any two data points is not very large, if the distance between the newly added data point to be classified and any other data point is relatively small, the data point to be classified can be classified as normal data point.
Further, the primary classification module calculates distances between the data point to be classified and a plurality of other data points, averages the distances, compares the average distance with a distance threshold, and determines that the data point to be classified is abnormal data flow when the average distance is greater than the distance threshold. Specifically, the distance threshold is an average of the distances between all other data points. In practical application, the distances between the data point to be classified and each other data point can be calculated in sequence, after the distance values with the same number as the other data points are obtained, the average value of all the distance values is calculated, and then the average distance is compared with the distance threshold value. This calculation method considers the positions of all other data points, and when the other data points are far from the center, the consideration of the data points again may cause the problem that the performance capability of the average distance is not accurate enough, so that the application can select a part of other data points to calculate the average distance.
Further, when determining the number of other data points to be selected to calculate a more accurate average distance, the primary classification module performs classification test on the known data points in advance, and uses the number of other data points selected when the classification accuracy is highest as the number of other data points when calculating the distance between the data points to be classified and the other data points. Specifically, the data traffic prepared in advance may be divided into other data traffic and known data traffic according to a certain proportion, for example, 7:3, where the other data traffic and the known data traffic are represented as other data points and known data points in the feature space, and the other data points and the known data points have the same classification, so that the other data traffic may be input into the feature space to form other data points, then each time one known data traffic is input, the corresponding known data points are classified, and when all the known data points are classified, the current classification accuracy is calculated. And repeating the classification operation, and selecting the number of different other data points when repeating each time, so that the number of different other data points and the corresponding classification accuracy are obtained, and selecting the number of other data points with the highest classification accuracy as the number of other data points when classifying the data points to be classified later.
The secondary classification module 120 is configured to determine whether the real-time data traffic from the plurality of API interfaces reaches the bandwidth upper limit after the primary classification module determines that the data point to be classified is the abnormal data traffic, if so, determine weights of the real-time data traffic of the plurality of API interfaces according to network speeds of the plurality of API interfaces at the same time, recalculate distances between the data point to be classified and other data points in the feature space according to the weights, and determine whether the data point to be classified is the normal data traffic according to the recalculated distances.
Illustratively, the network accessed by the computer device has a limitation of the upper bandwidth limit, that is, the network speed of the computer network cannot exceed the upper bandwidth limit, for example, the network accessed by the computer device has the upper bandwidth limit of 100M, and the sum of the network speeds of all API interfaces in the computer device can only reach 100M at maximum, but cannot exceed 100M. Based on this situation, if the computer device is connected to multiple API interfaces at the same time, these API interfaces will share bandwidth, and when a new API interface is suddenly accessed, or after the data transmission of a certain API interface is finished, the other API interfaces will suddenly reduce or increase the network speed, which is represented by the data traffic, that is, the data traffic of some API interfaces suddenly reduces or increases within a certain period of time. However, this abrupt change is not necessarily due to a network attack, but may be due to a limitation of the upper bandwidth limit. Therefore, when the primary classification module determines that the data point to be classified is abnormal, the secondary classification module is utilized to judge again, so that erroneous judgment on the data flow is reduced as much as possible.
In an embodiment of the application, the secondary classification module detects an upper bandwidth limit of the computer device before determining whether real-time data traffic from the plurality of API interfaces reaches the upper bandwidth limit. Specifically, when detecting the upper limit of the bandwidth, the secondary classification module firstly interrupts data transmission of all API interfaces of the computer equipment, then establishes an API interface pointing to the appointed link, downloads a data packet with a known size through the API interface, calculates the ratio of the size of the data packet to the download duration after the downloading is completed, and then can obtain the most suitable upper limit of the bandwidth. It should be appreciated that the upper bandwidth limit is typically a few fixed values, such as 100M, 300M, 500M, etc., and that a value closest to and greater than the ratio of the packet size and the download duration may be used as the upper bandwidth limit after calculating the ratio.
Further, the upper bandwidth limit is the upper limit of the network speed, and the secondary classification module determines the network speed according to the data volume and the transmission time in the real-time data traffic. Because the real-time data flow already contains the starting time, the ending time and the data quantity, the difference between the ending time and the starting time is the transmission time of the data, and the network speed in the data transmission time can be obtained by calculating the ratio of the data quantity to the transmission time. Summing the network speeds of all API interfaces can determine if the network speed of the computer device has reached an upper bandwidth limit at a time or within a time period.
Further, after determining the network speed, the secondary classification module decomposes the upper bandwidth limit according to the number of API interfaces with the network speed different from zero at the moment when the real-time data flow occurs to obtain a corresponding theoretical speed value, calculates the difference value between the network speed of each API interface and the theoretical speed, and determines the weight value according to the speed difference value of each API interface. Specifically, the bandwidth upper limit may be calculated on average according to the number of API interfaces whose network speed is not zero, that is, the ratio of the bandwidth upper limit to the number of API interfaces whose network speed is not zero is taken as the theoretical speed value.
However, in general, not all API interfaces have the same network speed, so the present application sets a weight factor for each API interface, the sum of the weight factors of all API interfaces is 1, and after decomposing the bandwidth upper limit, the duty ratio of the theoretical speed value corresponding to the API interface in the bandwidth upper limit is the corresponding weight factor. The weight factors can be set manually according to the actual conditions of the API interface, and the larger the maximum network speed which can be achieved by the API interface is, the larger the corresponding weight factors are. After determining the theoretical speed value, the network speed of each API interface is differenced from the corresponding theoretical speed value, and the weight for correcting the distance can be determined according to the difference. It should be appreciated that since it is not possible to determine in advance which API interfaces the network speed is not zero, a person can set appropriate weight values within the same numerical range for all API interfaces, whether or not the current network speed is zero. After a specific data transmission task is finished, API interfaces with non-zero network speed can be determined, then the sum of the weight values of the API interfaces is calculated, and the ratio of the weight value of each API interface to the sum of the weight values can be used as a weight factor.
Further, no matter the network speed is reduced or increased due to the access of a new API interface or the release of resources by a certain API interface, the data traffic is an increase in distance, so that the distance needs to be reduced in correction so that the abnormal data traffic is as close to the normal data traffic as possible. In the embodiment of the application, the weight is determined by adopting a fuzzy control strategy. Specifically, the correspondence between the speed difference and the weight may be preset, and when the speed difference is within a certain numerical range, the weight corresponding to the numerical range may be selected, and the distance may be reduced by using the weight. The weight may take a value less than 1, so that when the distance is reduced, the product of the weight and the distance determined by the one-time classification model may be taken as the recalculated distance.
After the distance is recalculated, comparing the recalculated distance with a distance threshold, and if the recalculated distance is still greater than the distance threshold, indicating that the data flow corresponding to the data point to be classified is still abnormal data flow.
While preferred embodiments of the present application have been described, additional variations and modifications in those embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. It is therefore intended that the following claims be interpreted as including the preferred embodiments and all such alterations and modifications as fall within the scope of the application.
It will be apparent to those skilled in the art that various modifications and variations can be made to the present application without departing from the spirit or scope of the application. Thus, it is intended that the present application also include such modifications and alterations insofar as they come within the scope of the appended claims or the equivalents thereof.