1. Introduction
Human society has now entered an era of big data as a result of the quick uptake of mobile Internet and the digital revolution. Particularly in recent years, with the widespread adoption of cloud technology, 5G, and Internet of Things (IoT), the number of people using the Internet has increased; and more and more smart devices, such as smartphones and mobile homes, are connected to the Internet [
1]. Additionally, various applications and services, such as social media and online videos [
2], have become available to users. All of these factors have contributed to the tremendous growth in Internet traffic in the big data era [
3].
Network security and performance are challenged by massive amounts of network traffic. The rise in traffic volume has led to an increase in the amount of sensitive data being transmitted, which in turn has necessitated the use of security measures such as encryption technology [
4]. However, an increase in traffic also means an increase in potential attacks such as spam and malware [
5], which can cause network failure. With the widespread adoption of encryption protocols such as SSL/TLS, it has become increasingly challenging to identify malicious activities hidden within massive network traffic. Moreover, in the past couple of years, the impact of the COVID-19 pandemic has led to a significant increase in online activities, resulting in a higher frequency of VPN and Tor tunneling encryption technologies being used. The need for automated traffic analysis tools and techniques has increased in response to the complex network environment [
6].
Traffic classification, as one of the key functions of an automated network intrusion detection system, is essential for maintaining the security and stability of a network environment. Traffic classification enables network administrators to control specific network traffic by identifying its sources and destinations, which can prevent the inappropriate use of the network and the leakage of sensitive information [
7]. Furthermore, this technology can assist with identifying potential security risks and block or restrict malicious traffic [
8,
9], thereby ensuring the integrity and dependability of the network.
Internet traffic identification techniques can be broadly categorized into four types: port-based, packet-payload-based, behavior-based, and flow-based [
10]. The port-based approach [
11] used to be one of the common measures of traffic identification. This method of identifying applications by looking up the corresponding port number in the list of port numbers published by the Internet Assigned Numbers Authority (IANA) is gradually losing its usefulness in scenarios of incomplete port assignments or dynamic port allocation. Packet-payload-based traffic classification techniques [
12] gather payload features from network packets and match them with existing attribute identification databases to classify the traffic. However, the payload may contain limited information, making it challenging to accurately classify all forms of traffic. Additionally, the utilization of encryption hinders the examination of network packet contents. Behavior-based network traffic identification [
13] utilizes the examination of device or user behavior to identify and categorize network traffic. This method offers a more in-depth understanding of network traffic compared with other techniques such as port-based or packet-payload-based identification. Additionally, it faces challenges in managing the complexities of various endpoints and users and requires more computational power. Flow-based traffic classification techniques [
14] concentrate more on the network traffic itself. Currently, the common approach is to utilize feature engineering, which involves preprocessing and organizing raw data, extracting various features to meet specific task objectives, and utilizing specially designed algorithms for recognition. The two main steps in flow-based traffic classification techniques involve feature selection and extraction, followed by model designing and training [
15]. As encrypted traffic becomes increasingly prevalent, many approaches are now utilizing statistical features to train classifiers for classification using machine learning and deep learning. Statistical features, such as flow length, average packet length, flow start and end times, etc., are often calculated for the entire data stream [
16]. Relationships between multiple packets, such as minimum interpacket delay and packet up/down correlation properties are also considered [
16]. The design and construction of these statistical features require expertise in related fields. Deep learning approaches [
17,
18], on the other hand, use the concept of representation learning to eliminate the need for manual feature engineering. They automatically extract features that are closer to the actual traffic from the encrypted raw data, allowing for better differentiation of traffic classes. Therefore, deep-learning-based feature extraction techniques have been extensively deployed for the identification of traffic in networks.
Numerous studies have been conducted on deep-learning-based traffic identification, and the identification accuracy of these systems is satisfactory. However, as the network environment has changed so rapidly, more and more flaws in the current traffic identification systems have come to light. Currently, the biggest issue is that network traffic identification is carried out in closed sets, meaning that all potential classes in a classification task are known at the time of training. Existing detection systems cannot correctly identify new types of traffic when they arrive. These new unknown types of traffic are often misclassified as known traffic categories, resulting in a high rate of false positives in the detection system. Many unknown attacks, such as zero-day attacks and new variants of malware, generate malicious traffic that can exploit the vulnerability to evade detection by traffic monitoring systems, posing a serious threat to network security. Therefore, the unknown traffic identification problem, also known as the open-world traffic identification or zero-day application identification problem [
19], needs to be studied.
The process of identifying unknown traffic in a detection system can generally be divided into three stages. Firstly, the system separates known traffic from unknown traffic, thus achieving an accurate classification of known traffic. Secondly, the system detects new classes from the separated unknown traffic, labels the recognized new classes, and adds them to the known classes. Finally, there is a phase of incremental learning, where the previous model is updated based on the updated known class dataset. Due to the unlabeled characteristics of unknown traffic, most current research in this area tends to use unsupervised machine learning models for the recognition of unknown traffic.
Existing unknown traffic recognition algorithms have the following limitations: In terms of traffic feature selection, the existing methods usually need to undergo complex and time-consuming feature engineering. Feature selection is highly dependent on domain expert experience, and the type of features selected is relatively single, resulting in low identification efficiency. Another problem is that the entire system needs to be retrained every time new traffic data are collected, which requires a lot of time and effort, causing the system to have poor real-time performance and utility. As a result, these methods are not well suited for real-time applications and are primarily used for offline data classification, making it difficult to meet the demands of modern intelligent network supervision.
In this study, we combined the benefits of automatic feature extraction and deep learning to design and implement an algorithm for classifying unknown traffic. This algorithm is tailored to meet the requirements of current network identification tasks and not only identifies known traffic but also distinguishes unknown traffic in real time with high identification accuracy. The main contributions of this paper are as follows:
We propose an intelligent feature processing method that uses a multiple-channel parallel neural network to extract temporal and spatial features from raw network traffic data, combined with feature fusion based on the mRMR algorithm, which achieves a good balance between accuracy and time consumption in traffic detection.
We present a clustering approach based on the density ratio to distinguish between known and unknown traffic features and construct new classes for unknown traffic. The method can dynamically expand the clustering results by adding new data, thus improving efficiency and handling noise.
We improve an incremental learning multi-class SVM classifier that autonomously learns based on the features of known traffic and detected unknown traffic, without the need to train the classifier from scratch each time.
We establish an incremental learning model for unknown traffic identification and validate its performance on the public datasets ISCX-VPN-Tor, NSL-KDD, and a self-coself-collected dataset SelfDataset.
The remainder of this paper is structured as follows: The study background and related studies are given in
Section 2, with an emphasis on the commonly used techniques for identifying unknown network activity. Our suggested framework for an incremental unknown traffic detection model is detailed in
Section 3. The setup and results of the experiment are described and discussed in
Section 4. Finally, the entire paper is summarized in
Section 5.
3. Design of Framework
This paper suggests a framework that combines a data preprocessing algorithm, a feature extraction and fusion algorithm, unknown traffic recognition, and an incremental learning algorithm to achieve the identification of unknown types of traffic in a closed set.
Figure 1 depicts the model architecture suggested in this paper.
3.1. Data Preprocessing
The flow–image transformation method [
42], which was used in our previous study for raw traffic features, is utilized in the data preprocessing stage. Data preprocessing converts raw traffic data from PCAP packages to IDX file format through three steps: traffic split, traffic clean, and image conversion. We split traffic packets containing the same five-tuple information as a flow using SplitCap so that each small PCAP file contains a single TCP or UDP session. In order to lessen the subsequent workload, the traffic cleaning step removes packets that are not useful for traffic classification, such as short sessions with insufficient payload size and some auxiliary packets, such a Domain Name Service (DNS) packets for host name resolution. As MAC and IP addresses cannot be used as training features for traffic classification and can obstruct the feature extraction process of the neural network, which can result in model overfitting, traffic anonymization is also carried out in this step. Each data-cleaned PCAP file is cropped into a group of 1024 (
) bytes, each of which is represented as a pixel, and is then converted into a single-channel grayscale image of size
. These pictures show the natural characteristics of the traffic. It is evident that the various traffic image types can be easily distinguished from one another. To provide data to the deep learning model, the file containing all pixel sequences is transformed into an IDX format file. Through the preprocessing process, we convert the raw traffic data into images, facilitating the application of advanced techniques in image processing to traffic processing.
3.2. Feature Extraction and Fusion
In this study, we extract ed features from the preprocessed traffic using three pretrained models. In order to extract spatial information, we use two CNN models, AlexNet and VGG16, and LSTM architecture is used to learn temporal features. The top 100 deep features for each neural network architecture were chosen after the collected deep features were downscaled using the minimum redundancy maximum relevance (mRMR) algorithm. The mRMR feature selection method produces the features, which are then combined and input into the classifier.
3.2.1. Spatial and Temporal Feature Extraction
CNNs, as a biologically inspired model, have a classical structure consisting of convolutional, pooling, and fully connected layers, allowing them to learn sophisticated representations of the input data and extract significant features from them, reflecting the essential patterns in the data. This is why they are frequently utilized and achieve outstanding results in demanding tasks such as object detection [
43] and semantic segmentation [
44]. AlexNet and VGG16 have both achieved state-of-the-art results in image classification tasks and have subsequently become popular benchmarks for evaluating new CNN architectures. The two different CNN networks were employed in this study with the network structure shown in
Figure 2, and they were used to automatically extract the spatial features of the original traffic from the input images.
AlexNet [
45] is a seminal deep learning model, introduced in 2012, that earned the top place in the ImageNet Large Scale Visual Recognition Challenge (ILSVRC), a benchmark contest for image classification, in that year. AlexNet uses a relatively simple convolutional and pooling layer structure, which makes the structure and parameters of the network easy to understand and interpret, and the deep network structure in turn makes it excellent for image feature extraction. We trained an AlexNet model without a local response normalization layer in this study. It had 5 convolutional layers and 3 fully connected layers with an input image size of
. Each convolutional layer convolves the input with the convolutional kernel and then maps the output features via the activation function. In our model, the filter size of the convolutional layer is
pixels, and the step size is 1. The ReLU function is the activation function of the convolutional layer. The first, second, and fifth convolutional layers are connected to the maxpooling layer afterward, which can effectively reduce the number of parameters and complexity of network operation. The filter size of the max pooling layer was designed to be
pixels with a stride of 3. The third and fourth convolutional layers are then directly connected. As for the fully connected layers, the first 2 of which have 4096 neurons, and the dropout technique is added to avoid overfitting during training. The third fully connected layer has 1000 neurons, corresponding to the extraction of 1000 feature vectors
, where each feature vector has 256 dimensions. Each convolutional layer and fully connected layer is followed by a batch normalization layer to solve the problem of unstable output in deep neural networks.
VGG16 [
46] is another popular CNN architecture, consisting of 13 convolutional layers, 5 pooling layers, and 3 fully connected layers. This deep structure allows VGG16 to better capture details and complex features in images, thus improving the accuracy of feature extraction. Like AlexNet, VGG16 also has 5 sets of convolutions, but each set contains more convolutional layers and all convolutional layers use
convolution kernels. Each of the first two convolutional sets contains two convolutional layers, and each of the third, fourth, and fifth sets contains three. The difference from AlexNetis that a
max pooling is performed after each set of convolutions in VGG16. By continuously using multiple stacked small-sized kernel filters instead of large-sized filters, VGG16 is able to learn more complex features but, at the same time, the multiple nonlinear layers increase the depth of the network and the number of parameters. Therefore, to alleviate the issue of excessive parameter count, VGG16 eliminates the batch normalization (BN) layers after the convolutional layers. Additionally, the parameters are shared between different layers, which reduces the number of parameters in the network, makes the model more compact and efficient, and helps prevent the problem of overfitting. As for the fully connected layers used for deep feature extraction, they are configured the same as in AlexNet. VGG16 gradually reduces the size of the feature map by stacking convolutional and pooling layers several times, thus producing a smooth feature representation. This representation is useful for subsequent classification tasks to better distinguish between the different classes of traffic images. The output of the VGG16 network channel is also 1000 256-dimensional feature vectors.
AlexNet and VGG16 are used in parallel and complement each other to extract sufficient spatial traffic features from images for classification. In addition to single packet features, the temporal features between the previous and subsequent packets significantly contribute to the classification of traffic. Due to the lack of memory function in CNN networks, they cannot capture the temporal features contained in the contextual relationships between packets and flows in traffic data. RNNs perform exceptionally well in handling these features. Long short-term memory (LSTM ) [
47] is a variation of RNN that has a time-sequential structure. It retains the advantage of RNN in capturing contextually correlated information while effectively overcoming the gradient vanishing and exploding problems. LSTM can retain the information of time sequences through internal memory units and gating mechanisms, its output at a certain time is related not only to the input of the current time but also to the input and state of the previous time. In this study, we used LSTM structures to extract these time-dimensional features and combine them with space-dimensional features extracted by AlexNet and VGG16 as metric features to better process traffic data. The temporal features extracted by the LSTM are also 1000 256-dimensional vectors, which are defined as follows:
3.2.2. Feature Selection and Fusion
Many features are extracted from deep neural networks and may contain many redundant elements. Given that our research application is mainly focused on encrypted traffic, where there is more redundant information than normal traffic, it affects the accuracy of subsequent classifiers, so it is necessary to select the extracted deep features.
We adopted the minimum redundancy maximum relevance (mRMR) algorithm [
48] to select the features that have the least redundancy and the most impact on the target from a large number of features. It evaluates the importance of each feature by calculating the correlation of each feature with the target feature and the independence of other features. Finally, based on the evaluation results, the features are selected from high to low until the required number of features is reached. In other words, this method can effectively balance relevance and redundancy to sort the feature set. The specific calculation procedure for feature selection using the mRMR algorithm is as follows:
In this study, we used mutual information as a criterion to measure the redundancy between features and the correlation between features and class variables, which was calculated using Equation (
5), where
X and
Y are two different types of variables;
and
are the respective marginal probability distribution functions; and
is their joint probability distribution function.
Our goal is to find a feature subset
S of the deep feature set
F. First, we initialize a set
S. For each feature
in
F, we calculate its correlation
W with the target category
c using mutual information
, as shown in Equation (
6). The redundancy is calculated using the average of the mutual information between features, as shown in Equation (
7).
We want the correlation
W between features and classes to be as large as possible and the redundancy
R between features to be as small as possible. Therefore, the features can be sorted by the simple combinations of the two conditions, as indicated in Equation (
8). The features that satisfy the following formula are selected and stored in the set
S.
We keep repeating the above steps until the set S has the required number of elements or ends the loop when it is identical to the elements in F.
In our proposed architecture, 1000 features of each deep neural network channel are extracted as the input to the mRMR algorithm, and the number is reduced to 100 after filtering. The selected feature representations are shown in Equations (9)–(11).
These selected features are next merged and stored in the feature database. The fused feature vector
is a simple concatenation of the selection vectors of the three parallel channels AlexNet, VGG16 and LSTM and is represented as follows:
3.3. Unknown Traffic Recognition
A trained SVM multi-classifier is used to recognize the extracted and fused features from the previous step. We first construct the SVM classifier with classes. If the feature belongs to a known traffic of class N in the training set, it is classified into a specific traffic class; otherwise, it is classified as unknown traffic and enters the unknown traffic identification process. The purpose of this process is to separate known traffic from unknown traffic to improve the accuracy of known traffic identification.
Unlike the feature distribution pattern of known traffic, unknown traffic has certain similarities among similar features, but its spatial distribution is relatively random, and the shape of clusters is irregular, so it is obviously not feasible to classify unknown features using fixed distance alone. In order to better adapt to the characteristics of the distribution of unknown traffic features, we designed a density-ratio-based clustering that can classify unknown traffic in real time. The execution process of the algorithm is shown in Algorithm 1.
In our scenario, clustering methods that require prior setting of the number of clusters are not applicable because the unknown features do not have labels, and the number of categories for the unknown traffic is unknown. Density-based clustering algorithms can identify clusters of any shape and size in datasets with noise, but due to the use of a global density threshold, it is often difficult to identify all clusters in datasets with large density variations. We modified density-based clustering algorithms using the density ratio, which calculates the ratio of the density of a core point and of its neighborhood. In the algorithm proposed in this section, we divide the unknown features with sufficient density ratio into an unknown class. The feature points located in a very small neighborhood
of a feature
are define as
, those in a larger neighborhood
are calculated as
in the same way, and the ratio of the number of the two is the density ratio
.
Algorithm 1: Unknown traffic classification |
![Applsci 13 07649 i001]() |
When the density ratio of a feature is greater than the set density threshold , a new unknown class is created for it. The feature points that meet the requirements are selected in the same density ratio screening method within the neighborhood of the feature and added to the newly created unknown class. That is to say, an unknown class is the largest set of features connected by the density ratio.
If the neighborhood density ratio of the unknown feature does not reach the specified threshold, the distance is compared with the average feature of each known traffic category. If it is less than the distance threshold T, it is judged to be an unknown feature of a known class, and the corresponding class label is output; otherwise, it is output as a noise point N.
3.4. Incremental SVM Classifier
The features of the unknown traffic detected by the algorithm are also stored in the feature database, which poses higher demands on our classifier. It needs to gradually learn from the updated feature library, becoming a new multi-classifier capable of recognizing a wider range of application types.
The essence of SVM is the estimation of the optimal decision function. In
N-classes classification problems, the goal is to find the optimal
hyperplanes. We first consider the classification of known traffic using SVM. The mathematical description of the known feature database
is shown in Equation (
13), which contains
n features, each with a label that belong to one of the
N traffic classes. Equation (
14) demonstrates the optimization objective function of a multi-class SVM based on the Lagrange equation, where
is a one-versus-rest hyperplane vector,
b is bias,
is a slack variable,
denotes the kernel function, and
denotes a penalty factor.
In the current state
s, the weight vector
and bias vector
of the hyperplane are described in Equations (15) and (16), respectively.
When the state comes to
, it indicates that the features of a new traffic category are added to the feature database; at this time, the feature database is updated to
, where the number of features changes to
. After the feature library is updated, the features of the existing categories in the previous state may increase, so the hyperplane of the classifier needs to be adjusted. Moreover, a new classification plane should be added in order to learn the features of the new category. The weight vector and bias parameter of the hyperplanes in the state
are updated as given in Equations (17)–(20).
In summary, the optimization function of the incremental multi-class SVM is improved, as shown in Equation (
21), where
is the weight of the old classification plane, and
is the knowledge weight of the old feature base. The classifier is able to learn new knowledge based on the classification of known types of traffic using the existing model.
4. Evaluation
4.1. Datasets for Evaluation
In real networks, new applications frequently produce unknown flows, particularly when encryption is involved. We used two publicly accessible datasets, ISCXVPN2016 and ISCXTor2016, and extracted some of the data to construct the dataset ISCX-VPN-Tor as experimental datasets to evaluate the performance of the proposed approach under different encrypted traffic scenario tasks. In the ISCXVPN2016 dataset, each class of traffic is encapsulated by regular encryption and VPN protocols. The traffic in the ISCXTor2016 dataset is encrypted by Tor technology. Both datasets were published by the University of New Brunswick, who created accounts to run a number of representative applications (such as facebook, skype, spotify, and gmail) on virtual machine workstations that were connected to the Internet through a gateway virtual machine, which in turn routed all traffic through the Tor network or VPN tunnel. Flows were generated from PCAP files captured at the gateway and labeled according to the applications executing on the workstations. The traffic in each dataset includes six application categories, namely Email, Chat, Streaming, File Transfer, VoIP, and P2P. Therefore, a total of 12 application categories of traffic are included in the experimental dataset, as detailed in
Table 1. Dataset ISCX-VPN-Tor represents real-world VPN tunnel and Tor network traffic to some extent.
Another intrusion detection dataset, NSL-KDD, was used to validate the detection performance of the proposed method for unknown attacks. NSL-KDD contains over 96,000 flows, including normal traffic and 22 attack patterns, namely DoS attacks, probe attacks, U2R attacks, and R2L attacks, as detailed in
Table 2. This dataset was synthesized by the traffic generator, which is not fully representative of the traffic in a real environment but has become one of the relatively authoritative intrusion detection datasets in the field of network security due to its suitability for studying the characteristics of various attack classes. From the statistics of the number of flows of various traffic categories of NSL-KDD, it can be seen that there is a traffic imbalance problem in this dataset, with normal samples accounting for most of the attack traffic, and DoS dominating the attack traffic, with U2R attacks accounting for a small percentage, which is a problem we needed to consider.
In addition, we constructed a self-collected dataset, SelfDataset, by capturing traffic generated by commonly used scientific software as normal traffic and malicious traffic from malicious samples running in a sandbox. These malicious samples were also captured in real-world scenarios. A demo of the SelfDataset was used for conducting experiments, with a size of 4.18 GB, as shown in
Table 3.
Some applications in these datasets were chosen as the unknown classes, and the rest of the dataset was the known classes. We used 80% of the known samples to train the model and 10% of the known samples as the validation set to adjust the parameters of the model through the deviation from ground truth and improve its prediction performance. Furthermore, the remaining 10% of the known and all unknown samples was used as the test set to evaluate the proposed method.
4.2. Experiment Settings
The testing and comparative experiments of the model were compiled with the support of a Nvidia Tesla T4 GPU. They were run on a Ubuntu 20.04 LTS operating system and implemented using Python 3.6.5. The main modules used in this experiment were TensorFlow 1.14.0 and sklearn 1.0.2.
The workflow of the model is as follows: Python captures and parses the network traffic. The parsed traffic is collected by Kafka and sent to the pre-trained feature extraction and fusion module for processing. The results are then stored in a MySQL server as a feature database. The SVM classifier retrieves the traffic features from the MySQL database for classification. Unknown traffic features are clustered using the density-ratio-based clustering algorithm to form new categories, and labeled data are stored in the MySQL database. During this process, the classifier dynamically adjusts and updates. The classification results of the traffic can be visualized through a frontend interface. The pretraining of the model is completed on a GPU server, and the online model is deployed in Docker using TensorFlow Serving.
Pretrained deep models are often adapted to a new task by means of transfer learning. The parameter values of the pretrained neural networks used in the experiments are given in
Table 4. Each neural network had an input image size of
, the models were optimized using stochastic gradient descent (SGD) with a momentum of 0.9, a decay of 10
−4, a mini-batch size of 32, and a learning rate of 0.001. The pretraining process of the model was able to be completed with high accuracy. However, the validation process of the pretrained model showed oscillation loss, indicating that the model could not fully converge. That is, the pretrained model could not fully learn the local features of the traffic.
Figure 3 shows how the loss function changes when the model is pretraining. During the training process, the loss function gradually decreased until it reached a slower convergence speed after 80 steps. After 200 steps, it tended to converge. This demonstrates that the model has a good training speed.
In order to measure the performance of the models, accuracy (
), true positive rate (
), false positive rate (
),
, and F1 score (
) metrics derived from the confusion matrix were used, and the formulations of the metrics are described in Equations (22)–(26). In the experiment, the unknown traffic flow was considered as positive samples. Thus,
(true positive) and
(false positive) in the equations are the number of unknown flows that were correctly determined to be unknown and incorrectly determined to be known, respectively; and
(true negative) and
(false negative) are the number of known flows that were correctly identified as known and incorrectly identified as unknown, respectively. As can be seen from the equations,
characterizes the proportion of all flows that are correctly classified,
indicates the proportion of all unknown flows that are correctly predicted as unknown, and
indicates the proportion of all known flows that are incorrectly predicted as unknown.
takes both
and
into account, and
is a metric that combines the precision and recall rate of a model. The
factor is also a measure of model consistency based on a confusion matrix that is calculated as shown in Equation (
27), where
represents the overall classification accuracy, and
refers to the expected agreement rate. To evaluate the performance of different classification algorithms, we also plotted ROC curves and compared their
values.
4.3. Selection of Feature Number and Threshold
The number of features extracted per neural network channel in the feature extraction and fusion module and the threshold value in the unknown traffic classification algorithm are significant parameters that affect the performance of our model. In this section, we discuss how to select the values of the two parameters.
We first considered the effect of the number of features extracted by the mRMR algorithm on the recognition effect, and the results are shown in
Figure 4. We set the number of features extracted from each neural network channel to be the same, so the number of features after fusion is a multiple of 3. As can be seen from
Figure 4, the accuracy results are better when each neural network extracts and selects the optimal first 100 features with the mRMR algorithm, that is, when the number of fused features is 300. When the number of fused features is over 360, the accuracy of model detection gradually decreases, which we analyze may be caused by the problem of feature redundancy in encrypted traffic, and the excessive number of features might cause interference in traffic classification.
The threshold value in the unknown traffic detection algorithm also has an important impact on model performance. As for the threshold selection problem, we conducted the following experiments.
We calculated the distribution of the distance between the features extracted from all classes of traffic and the average known features, as shown in
Figure 5, where the red bars represent known features, the blue bars represent unknown features, the vertical coordinates show the number of features, and the horizontal coordinates indicate the Euclidean distance from the average known features. It can be seen that the known features are close, whereas most of the unknown features are farther away, and the two features can be better distinguished when the distance is between 1.0 and 1.5. Therefore, we chose a threshold value of 1.3.
4.4. Necessity of Multiple-Channel and Feature Fusion
In order to demonstrate the effectiveness of the multiple-channel architecture and the necessity of the feature fusion method, we conducted 10 experiments on the extracted test set using different architectures. The compared architectures included single-channel AlexNet, single-channel VGG16, single-channel LSTM, and three-channel parallel network structures, where each single-channel network is directly connected to the classifier without feature selection and fusion. The multiple-channel structure is further divided into two types: with and without the feature fusion algorithm. The average test accuracy and average test time results for the five different approaches are shown in
Figure 6.
The experimental results showed that compared with the multi-channel structure, the three single-channel architectures consume less time, but their test accuracy is also very low, below 0.75. This is because the single-channel structures are more simple, and the extracted effective features are limited. The single-channel structures produced varying results. AlexNet consumes less time; VGG16, with a deeper network, can extract more features, thereby improving accuracy; and LSTM does not have efficient feature extraction capabilities like CNNs, so the accuracy is low when used alone. The multiple-channel structure can combine the extracted features using different deep neural networks, including spatial and temporal features, significantly improving accuracy, but, it takes more time than the single-channel structure. Since the features extracted by different channels may have different degrees of relevance to the classification task, and there may be redundant information between them, a large number of features can affect the time performance and accuracy of the overall model. Therefore, we considered selecting and fusing the features from both correlation and redundancy perspectives. After adding feature selection and fusion algorithms, the average test accuracy of the multi-channel structure improved by more than 10%, reaching 0.942, while the time consumption only increased by 0.87 seconds. Considering both accuracy and time consumption, we think that using a multiple-channel structure and a feature fusion algorithm is necessary.
4.5. Comparison Experiments
To demonstrate the effectiveness and robustness of the proposed method, experiments were conducted under the five scenarios detailed in
Table 5.
Scenario A: All unknown traffic in the test dataset has a completely different type of application service than the known traffic in the training dataset.
Scenario B: The unknown traffic in the test dataset has similar application service types as the known traffic in the training dataset.
Scenario C: All unknown attacks in the test dataset have completely different attack mechanisms than the known attacks in the training dataset.
Scenario D: The unknown attacks in the test dataset have similar attack patterns to the known attacks in the training dataset.
Scenario E: All unknown normal and malicious traffic in the test dataset has a completely different type of application service than the known traffic in the training dataset.
Figure 7 and
Figure 8 show the normalized confusion matrix of the proposed MI-UTR model for the classification results of known and unknown traffic for each of the five scenarios. The vertical coordinate of the confusion matrix is the true label of the traffic, and the horizontal coordinate is the predicted label from the classifier, where the darker the color in the matrix, the higher the probability of correct classification. We calculated the accuracy, misclassification rate, and
coefficient for each scenario based on the confusion matrix, and these values are indicated below the matrix.
In addition to our proposed MI-UTR model, the other methods used for comparison were as follows:
CNN-GR [
49] employs a standard CNN architecture consisting of three convolutional layers and one fully connected layer to identify unknown traffic through the gradient of the first backpropagation.
Open-CNN [
30] model adds an OpenMax layer to the standard CNN model with two convolutional layers and two fully connected layers. The OpenMax layer improves upon the SoftMax layer by adding an extra class “unknown” to the activation vectors of SoftMax to detect the unknown attack.
HMCD [
50] uses the WGAN-GP generation algorithm to enhance the data and a hybrid neural network based on CNN and LSTM to extract the hierarchical spatial–temporal features of the traffic, which can effectively detect unknown HTTP-based malicious communication traffic.
We employed a simple CNN architecture as a benchmark for comparison with the aforementioned unknown traffic classification algorithm. The comparison results are shown in
Table 6, where the optimal results are in bold font.
As a whole, our proposed MI-UTR achieved the highest accuracy and outperformed all the other methods under four different scenarios, which reflected both the excellent accuracies of classifying the known classes and rejecting the unknown classes. This result demonstrated that MI-UTR performs better in classifying the network traffic in the open world than the other three methods. As for the trade-off between TPR and FPR, MI-UTR achieved better FTFs than the other methods. Although the FPRs of HMCD and CNN-GR were lower than those of the other two methods, the TPRs were lower as well; namely, HMCD and CNN-GR both performed poorly on recognizing the unknown samples. Additionally, the TPRs of Open-CNN were extremely high, while the FPR was also high, resulting in low FTFs, which illustrates that Open-CNN misclassifies numerous known samples when rejecting unknown samples. Together, the results demonstrated that MI-UTR performs the best in identifying unknown samples while misclassifying the known samples as little as possible.
To comprehensively compare the precision performance of these classifiers in different scenarios, we plotted their respective ROC curves and calculated the AUC values, as shown in
Figure 9.
In
Section 4.4, we discussed the considerations regarding the accuracy and time consumption of multi-channel and single-channel architectures. To demonstrate the superiority of our proposed method over other unknown traffic detection algorithms, it is important to consider the comparison not onlyof model accuracy but also of time consumption among different models. From
Figure 10, it can be observed that although the three datasets used have different sizes, the processing rate of each algorithm for handling traffic remains relatively constant. CNN-GR and Open-CNN are both optimized versions of the basic CNN architecture to enable the recognition of unknown traffic. As the overall structure changes are not significant, there is not much difference in terms of time consumption among CNN, CNN-GR, and Open-CNN. Due to the use of generative algorithms to augment data, HMCD has more complex parameters and structures, resulting in significantly higher time consumption than the other algorithms. Our proposed MI-UTR algorithm exhibits slightly higher time consumption than several CNN architectures in Scenario A and Scenario B, in the other scenarios, their time consumptions isaresimilar. Overall, our method achieves higher accuracy at a relatively small time cost, demonstrating its superiority in terms of accuracy and efficiency.
4.6. The Percentage of Unknown Traffic Classes
In order to further demonstrate that the proposed method also performs well with different numbers of unknown classes, the following experiments were performed. We introduced the concept of
[
51] to characterize the ratio of unknown traffic to known traffic in the testing dataset. Openness is mathematically defined as:
where
is the number of the known classes used in the training dataset,
is the number of all traffic classes in the testing dataset, and
is the number of traffic classes to be recognized. In this experiment,
was equal to
because of the need to classify all traffic classes in the testing dataset.
As can be seen from the above equation, openness reflects the amount of knowledge available for training the detection model. An openness of 0 indicates a completely known classification problem, while a high openness means that we have less knowledge of what is known for training, and distinguishing known traffic from unknown traffic is more difficult. To investigate the effect of openness on model performance, we conducted experiments under different openness settings.
Figure 11,
Figure 12 and
Figure 13 demonstrate the performance of several unknown traffic classification methods mentioned in the previous section under different openness, with the measures of
,
, and
.
From these figures, it can be seen that our proposed method significantly outperformed the other three methods at high openness. To be specific, the accuracy basically tended to decrease as the openness increased in all four scenarios, but the decreasing trend of our proposed model was slower and even slightly increased in scenario C. In other words, while the accuracy of the four methods under low openness did not much differ, the accuracy of our proposed model under higher openness was much higher than that the other methods, especially for the detection of unknown attacks in scenario C and scenario D. In the test of FTF metrics, our method achieved the best performance at all openness values in the four scenarios, which also means that the method is able to balance the true positive rate of unknown traffic and the false positive rate of known traffic. As for the comparison of F1 score, the scores of OPEN-CNN and of our proposed MI-UTR under scenario A were both relatively close for different openness values; OPEN-CNN performed better at low openness for scenario B, but the performance rapidly decreased with increasing openness; the CNN-GR model under scenario C experienced a similar problem. The MI-UTR model in scenario D always maintained an advantage under all openness values. In summary, our proposed MI-UTR model has strong robustness in various scenarios and for various openness values.
5. Conclusions and Further Work
A novel strategy for detecting unknown traffic using incremental learning was developed in this study. The approach implements an mRMR-based multiple-channel parallel neural network to select the best features from both the temporal and spatial dimensions, which are then fed to the classifier to address the redundancy problem of encrypted traffic features. By learning from the feature database, the classifier can immediately classify the features of the known traffic into specific application categories.
Unknown traffic features are either added to a newly created unknown traffic class or classified as new features of the known traffic category or noise through the clustering algorithm based on the density ratio. The clustered unknown traffic features are incrementally updated into the feature database. The results of experiments on several publicly available datasets demonstrated that our model significantly outperforms existing approaches in a number of application scenarios, including classification of encrypted traffic in VPN tunnel and Tor applications, as well as intrusion detection for unknown traffic classification tasks.
In this section, we also want to highlight some of the limitations of our existing work as well as future work. In the experimental environment, our model was capable of processing traffic at a speed of several hundred megabits per second, indicating its good traffic analysis capabilities in medium-sized networks. Furthermore, due to its incremental learning feature, it exhibits strong online processing capabilities and scalability. However, in larger-scale networks, the processing speed and memory requirements of the model may become a bottleneck. To address this, one approach is to incorporate a sliding window mechanism to control the traffic rate. Additionally, further methods to improve speed include deploying a distributed database, employing higher-performance hardware such as high-speed parsers, and implementing the model in high-performance languages such as C++.
We employed a class-incremental learning (CIL) SVM classifier for active learning to update the existing traffic classifiers in this study. A key criterion for CIL is to strike a balance between stability and plasticity, where the model should possess sufficient stability to retain previous class knowledge while also demonstrating enough plasticity to learn new concepts within new classes. The experimental results showed that the misclassification rate and false positive rate for known traffic are both very low, indicating that the classifier sufficiently learned the known traffic categories. This suggests that the CIL classifier exhibits strong stability in learning known traffic classes. Regarding the learning of unknown traffic, although the misclassification rate was also relatively low in the experiments, it could be attributed to the robust traffic representation being effective. However, it is important to consider that the traffic categories in these datasets are limited to fewer than 20 classes, whereas in real-world scenarios, the number of unknown traffic categories is likely to be much larger. When dealing with a larger number of traffic categories, there is a possibility that the CIL SVM classifier’s risk of errors may increase. To mitigate this risk, strategies such as rehearsal techniques, regularization methods, and episodic memory can be employed to preserve and consolidate knowledge of previously learned classes. Further validation will be conducted when deploying the model in a real production environment, and in the future, balancing stability and plasticity will be two of the optimization directions for the incremental classifier.
Another important limitation is the generalizability of our method. Due to the variability of content and behavior of Internet services, encrypted traffic exhibits significant changes over time, and known application categories may present new patterns and features. Our model may classify the unknown features of these known traffic categories as unknown traffic, leading to a high false negative rate. Therefore, there is a need for more adaptive and dynamic methods to keep up with the evolving nature of encrypted traffic.
Furthermore, the robustness of our approach to adversarial attacks and evasion techniques is an area that requires further investigation. Attackers are constantly evolving their tactics to evade detection, and it is essential to develop defences that can effectively detect and mitigate these advanced evasion techniques. We plan to further investigate the model in the future in an effort to strengthen defences against attacks that use poisoned samples.
In conclusion, although our method has demonstrated promising results, there are significant limitations that need to be addressed. Future research efforts should focus on developing adaptive and dynamic methods, tackle scalability challenges, and improve the robustness of the classification process against adversarial attacks. By addressing these limitations, we can further advance the field of traffic classification and contribute to the development of more effective and efficient solutions for network security.