Open AccessArticle

PREIUD: An Industrial Control Protocols Reverse Engineering Tool Based on Unsupervised Learning and Deep Neural Network Methods

Bowei Ning

^1,2

Xuejun Zong

^1,2,*,

Kan He

^1,2 and

Lian Lian

^1,2

College of Information Engineering, Shenyang University of Chemical Technology, Shenyang 110142, China

Key Laboratory of Information Security for Petrochemical Industry in Liaoning Province, Shenyang 110142, China

Author to whom correspondence should be addressed.

Symmetry 2023, 15(3), 706; https://doi.org/10.3390/sym15030706

Submission received: 10 January 2023 / Revised: 7 March 2023 / Accepted: 9 March 2023 / Published: 11 March 2023

(This article belongs to the Section Computer)

Download

Browse Figures

Figure 1
The system architecture of PRE. Statistics for output types are reflected in the word cloud. "> Figure 2
The processing steps of PREIUD. "> Figure 3
An example of the BVE algorithm’s voting process. "> Figure 4
The overall architecture of the BiLSTM-AM-CRF model. "> Figure 5
The PLC simulation platform used in the experiment. "> Figure 6
Accuracy of feature extraction. "> Figure 7
Conciseness of feature extraction. "> Figure 8
Coverage of feature extraction. "> Figure 9
Performance Comparison of three protocol reverse tools. ">

Versions Notes

Abstract

The security of industrial control systems relies on the communication and data exchange capabilities provided by industrial control protocols, which can be complex, and may even use encryption. Reverse engineering these protocols has become an important topic in industrial security research. In this paper, we present PREIUD, a reverse engineering tool for industrial control protocols, based on unsupervised learning and deep neural network methods. The reverse process is divided into stages. First, we use the bootstrap voting expert algorithm to infer the keyword segment boundaries of the protocols, considering the symmetry properties. Then, we employ a bidirectional long short-term memory conditional random field with an attention mechanism to classify the protocols and extract their format and semantic features. We manually constructed data sample sets for six commonly used industrial protocols, and used them to train and test our model, comparing its performance to two advanced protocol reverse tools, MSERA and Discoverer. Our results showed that PREIUD achieved an average accuracy improvement of 7.4% compared to MSERA, and 15.4% compared to Discoverer, while also maintaining a balance between computational conciseness and efficiency. Our approach represents a significant advancement in the field of industrial control protocol reverse engineering, and we believe it has practical implications for securing industrial control systems.

Keywords:

industrial control protocols; protocol reverse engineering; unsupervised learning; deep neural network

1. Introduction

Due to the industrial internet’s rapid development, computer networks are employed more frequently in industrial control systems (ICS). While new technologies have significantly improved production efficiency and operational convenience, they have also created opportunities for cyber attackers (commonly known as hackers). The total number of industrial information security vulnerabilities identified by the Industrial Control Systems Cyber Emergency Response Team (ICS-CERT), as stated by the U.S in 2021, was as high as 637, including up to 449 attacks on industrial control systems in petrochemical, energy, and critical manufacturing industries [1]. These attacks, which initially occurred country-to-country, are becoming more frequent and increasingly common among hackers. Sometimes the purpose is to obtain a ransom, but there are also more harmful ones: forced shutdowns; leaks of asset information; destruction of facilities; and even nuclear accidents. Therefore, the demand for Industrial Security Protection (ISP) technology for ICS and equipment is growing daily.

An Industrial Control Protocol (ICP) is the “highway” of communication between the various components of an industrial control system. The protocol information, composed of binary numbers, defines the “traffic rules” between industrial equipment. Due to the privacy and profitability issues of industrial control manufacturers, the communication protocols used by ordinary PLCs—such as S7 Communication, GE SRTP, and Omron Fins—are not publicly available, which promotes the development of protocol reverse technology in reasoning ICP specifications.

Protocol reverse engineering (PRE) is an effective means of obtaining unknown protocol format specifications and reasons for protocol syntax and semantics [2]. The format includes the protocol field’s keyword, length, and data type, and the semantics specify the protocol field’s specific meaning and content constraints, taking into account the symmetry properties of the protocol. This information plays a crucial role in intrusion detection systems [3], vulnerability mining [4], and malware detection for protocol security [5]. For example, in order to fuzz a protocol, it is necessary to build a model of the protocol specification, based on the reverse results of multiple messages. In addition, in the design of the intrusion detection system, it is necessary to realize field separation and data extraction in the protocol message, which requires the transformation of the natural language description model into a specific implementation. Efficient representation of protocol reverse results can improve the effectiveness of the process, and can help to advance research in the field of industrial control protocol security, by considering the symmetry concept. While existing PRE methods are highly effective in the reverse analysis of traditional internet protocols, no efficient reverse engineering tool has been developed for industrial protocols. An ICP differs from an internet protocol because its structure is relatively flat, and it has no delimiters [6]. The protocol field contains many function codes corresponding to the ICS’ specific functions and control actions. Without prior knowledge, reverse tools for internet protocols cannot achieve desirable performance in deriving ICPs. Therefore, there is potential for broad application and high research significance in developing practical industrial protocol reverse tools.

For ICP reverse engineering, the critical challenges of research at this stage focus on the coverage and generality of industrial data sources, and the extraction methods and semantic inference of protocol features [7]. Various molding solutions have been successfully applied to reversing unknown protocols, to solve the above problems. The data streams are typically processed by combining bioinformatics, statistical analysis, or machine learning methods, which perform clustering analysis on messages [8]. However, deep learning is not widely used in protocol reversal. In recent years, unsupervised deep learning has developed rapidly, and it has been widely used in natural language inference, sentiment analysis prediction, and text classification, with outstanding results. Therefore, given the shortcomings of the existing technology, this paper proposes an automatic reverse tool for industrial control system protocols, based on deep learning: PREIUD. Specifically, by examining the changing characteristics of the protocol fields, we propose an unsupervised learning field segmentation method that reflects the field sequence characteristics of various protocols. We can increase the success rate of protocol field recognition. A bidirectional long short-term memory conditional random field (BiLSTM-AM-CRF) with an attention mechanism is used to classify protocols, infer their format and semantic features, and output labels for protocol states. In order to address the issue that the majority of network protocol reversing tools on the market are unable to reliably and efficiently parse industrial protocols, a deep-learning-based method for reversing ICPs is finally proposed. The following are the primary contributions of our work:

We propose a novel unsupervised learning approach—namely, bootstrap voting expert arithmetic—to address the challenge of protocol feature extraction. Our experimental results demonstrated that this method outperforms several commonly used unsupervised feature extraction algorithms, in accurately inferring field boundaries.
The proposed tool for industrial control protocol reversal, PREIUD, leverages a deep neural network model to facilitate protocol format and semantic inference. The model incorporates an attention mechanism and a bidirectional long short-term memory conditional random field (BiLSTM-AM-CRF) model, which enables the learning of potential dependencies between protocol fields, and enhances the accuracy of ICP reversal. Notably, we have introduced the concept of sequence tagging into the field of protocol inversion, which represents a significant contribution that further enhances the accuracy and interpretability of the reversal results.
In contrast to most prior investigations of protocol reverse engineering, we generated sample datasets for our experiments, by utilizing traffic that had been collected from an offensive and defensive exercise platform that was based on actual industrial scenarios. The platform comprised control systems from various leading manufacturers, and a diverse range of industrial control protocols. In contrast to approaches that rely solely on static datasets or simulated traffic, this novel approach provided a more authentic representation of industrial traffic characteristics. It also surmounted the challenge of low test coverage, thereby enhancing the credibility of experimental findings and the generalizability of the model.
We employed a multidimensional quantitative evaluation method, based on fuzzy comprehensive evaluation, to compare the performance of two advanced protocol reverse tools (MSERA, Discoverer) against PREIUD. The experimental results indicated that PREIUD is more effective and practical for the reverse analysis of industrial control protocols.

The remainder of this work is structured as follows: Section 2 discusses the current state of protocol reverse engineering and its limitations; Section 3 describes the system design process and the technical aspects of protocol reverse engineering; Section 4 gives the PREIUD experimental and evaluation results; Section 5 highlights limitations as well as future work; Section 6 is a summary of the full study.

2. Related Work

In previous studies, traditional protocol reverse engineering has relied on manual analysis. In large-scale industrial control scenarios that require the use of PLC combinations from multiple manufacturers, many general-purpose and private protocols are mixed, and manual analysis of protocols has significant limitations. In order to improve reverse efficiency, research in recent years has mainly revolved around automatic protocol reverse engineering.

According to the difference in input data sources, the method of automatic protocol reversal is either by inference based on execution trace or by inference based on network trace [9]. PRE based on execution trace usually needs to access the executable file that generates the communication protocol, and to analyze the execution process of the protocol parsing program, which is difficult to achieve in a relatively fixed industrial control system environment. PRE based on network tracking uses network traffic in the natural environment as input, and the reverse analysis is faster, more versatile, easier to operate, and can efficiently process the message samples of various protocol categories; therefore, we focused our research on web-tracking-based PRE.

At present, research on PRE revolves around several methods, such as bioinformatics algorithms, statistical analysis, data mining, and natural language processing. The approximate process of PRE for ICP is shown in Figure 1. For the internet protocol Meng F [10], a binary protocol inversion method combining hierarchical clustering and probabilistic alignment was proposed. The algorithm mainly parsed the frame format of the protocol at the byte level, but the inference affected paragraphs and keywords in general.

Kleber S [11] proposed the NEMETYL reverse tool, combining the Hirschberg alignment and the DBSCAN clustering algorithm by comparing the similarity of consecutive fields to infer binary protocols. This method inferred that the protocol message boundary was better than the previous sequence alignment algorithm; however, the data source contained fewer protocols, and the robustness of verifying more unknown protocols needed to be improved. Yang C [12] proposed a binary protocol inversion method based on deep learning, using the LSTM-FCN model to infer field change features, and to propose a field sequence encoding method. Nevertheless, the input only contained TCP and IPv4 protocol information, and the training model was not universal. Wang Y [13] proposed an unknown protocol analysis method based on a convolutional neural network (CNN), by processing and converting the protocol data into images for deep learning analysis. The accuracy of this method was better than that of the clustering algorithm; however, the data processing took too long, being unable to match the standard of real-time analysis, and only theoretical solutions were given, without further development of reverse tools. Kiechle V [14] used a combination of modular deep learning methods to design a set of reverse tool PREUNN for text protocols, auto-encoder for feature extraction, long short-term memory for feature reverse engineering and state recognition, and particle swarm algorithm for clustering; however, due to the black-box nature of neural networks, the performance could not be compared to other preprocessing methods.

For the development and application of reverse tools on industrial control protocols, Wang R [15] used progressive multi-sequence to cluster the algorithm, generate a V-gram separation message sequence, and extract keywords through XGBoost, and used the data collected by the SIEMENS ICS simulation platform; however, due to the input samples’ low coverage level, this model’s application scope was limited to Siemens PLC, and the detection accuracy was average. IPART [16] adopted the extended voting expert (GVE) method, to infer the protocol field boundary and the format, and the data set included Modbus, iec104, and ethernet. When this unsupervised learning method was used alone, it could easily be filtered when inferring variable-length protocol fields: this reduced the reliability of field boundary inference, and the low coverage of protocol samples also affected its generality.

There are various techniques and tools that can be used to determine the format of a protocol: Table 1 summarizes the methods, and features of a selection of representative tools. But using deep learning methods with high accuracy and efficiency can help to maintain a broad coverage of protocol samples. However, further research is still needed, to optimize the reverse scheme for ICPs. In order to address the challenges mentioned above, this study combined unsupervised learning and deep neural network methods. Specifically, it used bootstrap voting experts to efficiently segment protocol fields, and to generate input sequences of features. It also employed a network structure that combined a BiLSTM module and attention mechanism. These neural network models were able to independently learn relevant features directly from sequences. The BiLSTM module extracted temporal features, while the attention mechanism supplemented this, by learning relevant information from various representation subspaces. This resulted in a more comprehensive feature representation that took multiple perspectives into account. Finally, the protocol sequence was identified and annotated, using a combination of random conditional fields (CRF) to obtain complete format and semantic information.

3. Materials and Methods

In this section, we describe the principle and architecture of the proposed PRE, which consists of multiple steps that are processed sequentially. These steps include preprocessing of industrial protocol data, protocol field segmentation and feature extraction, inference of protocol format and semantics, and generation of tag sequences. The processing steps of the proposed PRE are shown in Figure 2.

3.1. Data Collection and Preprocessing

For data collection and preprocessing, we use Wireshark, a widely used network protocol analyzer, to capture and analyze industrial traffic. The traffic generated in the industrial environment often comes in the form of multiple protocols. In order to separate the target protocol data from other data, which is necessary if the target protocol is unidentified, it is necessary to use some characteristics based on the target protocol payload, to distinguish it. We use the Wireshark protocol filtering function to filter and classify the protocol information of the captured industrial traffic, deleting some unnecessary communication protocol information, and organizing the target protocol into a separate process characteristic analysis package (PCAP) for further analysis.

On the other hand, for ICP at the application layer, such as Siemens’ S7comm protocol, the analysis should remove invalid, duplicated, and retransmitted packets, filter the underlying communication protocol (such as TCP/IP), and extract the payload (application layer data). For data processing, we transform binary data to hexadecimal form. The message is then split into various samples, based on the size of the packet length.

3.2. Protocol Field Segmentation and Feature Extraction

ICP fields are fundamental elements that enable industrial automation and control systems to operate, as they allow devices to communicate with one other in a standardized manner. Different ICPs may have varying requirements for the types of information that can be included in their fields, and may also have different rules for formatting and transmitting that information. In the process of reverse network protocols, the characters encoding content typically consist of public characters such as ASCII code, making it easy to determine the keywords in the protocol format during the reverse process. However, ICPs are mostly binary protocols that use custom characters, making reverse analysis more challenging than for network protocols. Industrial control manufacturers design their own character codes, which are oriented towards data structures. In the reverse analysis of protocols, it is difficult to determine the boundaries between fields without prior knowledge, such as delimiters used in network protocols. Therefore, before analyzing the format and semantics of ICPs, it is necessary to perform protocol field segmentation and feature extraction operations. The result of field segmentation has a significant impact on subsequent semantic reasoning: if sequences of fields representing different meanings and functions are erroneously combined, the inference of the protocol format can be incredibly biased.

To avoid the reverse failure of the protocol caused by the wrong segmentation of fields, the field segmentation method before the inference of the protocol format should perform feature extraction according to the characteristics of ICPs. Key fields in protocols, which are at fixed positions and have fixed lengths and special meanings—including the protocol ID, serial number, parameter length, timestamp, function code, and constant parameters—are referred to as protocol features [17]. ICPs vary, in terms of length, location, encoding form, and encryption method, for these key fields; therefore, the reasoning of these key fields is a crucial factor in the success of protocol reverse engineering.

We choose the unsupervised learning Voting Expert (VE) algorithm [18] to extract the required fields of the target protocol. The algorithm is good at dealing with long fields without spaces and identifiers, and autonomously determines where to split the sequence, by calculating statistics on sub-sequences or grammars. VE focuses on the frequency of occurrences of field sequences and the uncertainty of characters following the sequence: essentially, it creates a sliding window on the text, to determine field boundaries by voting. If a field occurs frequently, and the subsequent bytes are highly variable, the field is likely to be a statement with a fixed meaning.

VE generally sets two voting experts, which cuts the text at the position with the local maximum number of votes. The first voting expert specifies the word’s internal entropy

H_{1}

, which means that if a word always appears in a binary message, it should be retained as a whole.

H_{1}

is defined as (1):

H_{1} = - {log}_{2} P (ω)

(1)

where

P (ω)

represents the probability of occurrence of the sub-sequence ω in the binary message. The lower

H_{1}

is, the lower w usually appears, on the whole, and the more frequently w occurs. The other voting expert specifies the outer boundary entropy,

H_{2}

, of the word, which means that if there are many changes in the following word content, a boundary should be added between the word and the subsequent content. The definition of

H_{2}

is shown in (2):

H_{2} = - \sum_{c \in C} P (c ∣ ω) {log}_{2} P (c ∣ ω)

(2)

where c represents the set of all possible variables after the sub-sequence

ω

, and

P (c ∣ ω)

represents the probability that the variable c occurs after the sequence

ω

. The larger

H_{2}

is, the more content changes after ω, and the more likely the point after

ω

is a word boundary. To realize the effective division of words, a sliding window size of n is designed, to generate a search tree with a depth of

n + 1

, in which all possible character combinations in the data set are stored.

The voting expert algorithm is divided into two stages: voting and judgment. In the voting stage, a window of size n is used to slide on the sequence, and according to the search tree, two experts vote in the sliding window according to (3) and (4). There is a voting score

V (a)

for each a point, as shown in (5). Finally, the public fields of the protocol are obtained. We judge the boundary position by setting the rules. If the score of point a exceeds the score of adjacent points, or a is greater than the threshold set by the system, it can be determined that a is the field boundary.

a_{i}^{I} = arg min_{a_{i}^{I} = i + j} (H_{1} (ω_{i, i + j}) + H_{1} (ω_{i + j + 1, i + n}))

(3)

a_{i}^{2} = arg max_{a_{i}^{2} = i + j} H_{2} (ω_{i, i + j})

(4)

V (a) = \sum_{i} (1 (a = a_{i}^{1}) + 1 (a = a_{i}^{2}))

(5)

However, the effect of the VE algorithm processing industrial protocol information is not as good as processing natural language information. The reasoning algorithm has prior knowledge of the natural language, and can limit the occurrence of sub-sequences when processing. The probability of the occurrence of protocol information sub-sequences is increased, sparser data streams will lead to more incorrect byte combinations, data streams are more difficult to compress, and computation takes up more memory and reduces work efficiency. In particular, this sparsity problem will lead to an explosion of node space in the algorithm design. Therefore, optimizing and improving the algorithm based on VE is necessary.

Bootstrap Voting Expert (BVE) [19] is an extension of the voting expert algorithm for unsupervised segmenting and extracting sequences. BVE can repeat the process of splitting a sequence, with each split incorporating knowledge gained from previous splits. First, after performing sequential voting, BVE reverses the sequence of protocol information, creates a new sliding window, and votes again from back to front. The feature information obtained in the reverse direction and the forward result are intersected, to obtain a set of high-precision segmentation points with the same forward and reverse directions. Subsequently, BVE continuously lowers the voting threshold, through an iterative process of repeated cuts multiple times, and stores the knowledge obtained from the previous cuts in the knowledge tree. At the same time, a third voting expert (knowledge expert) will be added during the voting process: its statistics are provided by the knowledge tree, placing sliding windows at the most frequently occurring fields in the previous iteration. We regard the sequence before segmentation as the end of a keyword, corresponding to the internal entropy

H_{1} (# a f t e r)

of the latter segment. The segmented sequence is regarded as the beginning of a keyword, corresponding to the internal entropy

H_{1} (b e f o r e #)

of the previous segment. Knowledge experts vote to select the location of the minimum sum of these two internal entropies, as shown in (6).

\underset{before, after}{argmin} (H_{1} (b e f o r e #) + H_{1} (# a f t e r))

(6)

Each iteration of BVE generates a candidate split, until the voting threshold is reduced to a minimum value. Finally, the boundaries of the protocol fields are jointly selected by three voting experts, as shown in Figure 3. After BVE processing, we get the protocol’s feature sequence, X = (

x_{1}

x_{2}

,…,

x_{n}

), which becomes the input of the protocol Format and Semantic Inference model.

3.3. Protocol Format and Semantic Inference Model

The format and semantic inference of the protocol employs a bidirectional long short-term memory conditional random field model, which is a neural network that is effective for modeling sequential data. The network consists of long short-term memory (LSTM) cells that can store and access information over a prolonged period, and conditional random fields (CRF) that model the dependencies between different elements of the sequence. The attention mechanism allows the network to selectively focus on different parts of the input sequence, enabling it to attend to the most relevant information, and to make more accurate predictions.

The model is trained by providing it with sequences of protocol messages or a sequence of features along with the corresponding labels. As training progresses, the weights of the network are adjusted, to improve the accuracy of the predictions. The output of the model is a probability distribution over the possible labels for each element of the input sequence. The most likely label is then chosen for each element, forming a complete, labeled sequence. This labeled sequence can be used to infer the format and semantic information of the input protocol. The overall architecture of our proposed model is depicted in Figure 4. The model comprises five layers: an embedding layer; a BiLSTM layer; an attention layer; a dense layer; and a CRF layer. Each part of our proposed model will be described in more detail in the following sections.

3.3.1. Embedding Layer

When we process data with a neural network model, we need to convert the input sentence into vector data. Previous studies have shown that one-hot vectors, which are simple and sparse, do not represent the correlations between sequences well, and are prone to the curse of dimensionality [20]. Therefore, we use the word embedding model word2vec to process the sequence of features. This embedding method, which is based on distributed vector representation, can learn the semantic representation of words based on the context information of words in the text, and express words as low-dimensional dense vectors. These dense vectors are called word embedding vectors, and can be used to represent the semantic information of words, and further embedded into corresponding low-dimensional dense semantic vectors [21]. To enhance the embedded feature representation, we integrate a feature repository containing key field semantics and offsets (i.e., the position of the specified field in the protocol). Unlike traditional feature embedding methods, which only represent feature labels as a single vector, our method combines each sequence and its feature information, to obtain more hidden protocol states, and to simultaneously satisfy multiple reasoning requirements for an ICP specification. Semantic feature embedding vectors and offset embedding vectors are represented by word2vec. Formally, we map the input feature sequence X into a distributed feature representation on the embedding layer. The vector matrix

X_{i}

output by the embedding layer is shown in formula (7):

X_{i} = [X_{l_{t}} (x_{i}) \oplus X_{o_{t}} (num (x_{i}))]

(7)

where

X_{l_{t}}

and

X_{o_{t}}

denote semantic feature label embedding vectors and offset label embedding vectors, respectively, and ⊕ is the concatenation operation.

3.3.2. BiLSTM Layer

The BiLSTM layer is a variant of the long short-term memory (LSTM) layer, which is a type of recurrent neural network (RNN) that can capture long-term dependencies in sequential data [22]. LSTMs are designed to address the issue of long-term dependencies, by incorporating a memory cell, and using several gates to control the input, output, and forgetting of information from the current and previous states. The hidden layer of an LSTM consists of memory cells that have four components: an input control gate

i_{t}

, to regulate the flow of the input signal; an output gate

o_{t}

, to control the strength of the signal passed to the next unit; a cyclically linked cell,

c_{t}

; and a forget gate

f_{t}

, to control the cell state before forgetting.

One advantage of using a BiLSTM layer is that it can capture both short-term and long-term dependencies in the data, because the LSTM cells in the layer have a “memory” that allows them to store and access information from earlier in the sequence. By processing the data in both the forward and backward directions, a BiLSTM layer can capture even more context, and make more accurate predictions. The BiLSTM layer takes the protocol message as the input sequence, and produces another output sequence vector, h = (

h_{1}

h_{2}

,…,

h_{n}

), that represents the sequence at each time step in the input sequence.

The input vector matrix

{F S}_{i}

contains n embedding vectors, each represented as a d-dimensional vector

g_{t}

, combining the corresponding feature label vector

X_{l_{t}}

and offset label vector

X_{o_{t}}

. For each position t, the LSTM computes the current hidden state

h_{t}

, using the input vector

e_{t}

and the previous state

h_{t - 1}

. We used the following implementation:

i_{t} = σ (W_{i} g_{t} + U_{i} h_{t - 1} + b_{i})

(8)

f_{t} = σ (W_{f} g_{t} + U_{f} h_{t - 1} + b_{f})

(9)

o_{t} = σ (W_{o} g_{t} + U_{o} h_{t - 1} + b_{o})

(10)

u_{t} = tanh (W_{u} g_{t} + U_{u} h_{t - 1} + b_{u})

(11)

c_{t} = i_{t} \otimes u_{t} + f_{t} \otimes c_{t - 1}

(12)

h_{t} = o_{t} \otimes tanh (c_{t})

(13)

where

W_{i}

U_{i}

W_{f}

U_{f}

, and

W_{o}

U_{o}

represent the weight matrix of the input gate

i_{t}

, the forget gate

f_{t}

, and the output gate

o_{t}

, respectively; the parameters

b_{i}

b_{f}

, and

b_{o}

are bias vectors of the input gate, forget gate, and output gate, respectively; the parameters

W_{u}

U_{u}

, and

b_{u}

are the weight matrix and bias vectors of new memory content

u_{t}

;

σ

is the sigmoid function; tanh is a hyperbolic tangent function; and the operator ⊗ denotes the element-wise multiplication.

The main advantage of using a BiLSTM layer is that it can capture both short-term and long-term dependencies in the data. By processing the data in both the forward and backward directions, a BiLSTM layer is able to capture even more context, and make more accurate predictions. It takes the protocol message as the input sequence, and returns another output sequence vector, h = (

h_{1}

h_{2}

,…,

h_{n}

), which represents the sequence at every time step in the input sequence. The hidden state of BiLSTM is as follows:

h_{t} = [\vec{h_{t}} \oplus \overset{\leftarrow}{h_{t}}]

(14)

where

\vec{h_{t}}

and

\overset{\leftarrow}{h_{t}}

are the hidden states of, respectively, the forward and backward LSTM at position t.

3.3.3. Attention Layer

The role of the attention layer is to put more weight on the features of critical fields that can more effectively filter out important semantic information from complex protocol information, to reduce the impact of non-key fields on protocol semantic reasoning. The attention mechanism maps a query and a set of key-value pairs to an output vector representation [23]. To apply the attention mechanism, The output vector matrix E from the BiLSTM layer is projected into three input matrices of dimension d, which serve as the queries Q, keys K, and values V. Using the correlation between queries and keys, the attention function calculates weights for the values and obtains a mixed vector representation, as shown in Equation (9).

Attention (Q, K, V) = softmax (\frac{Q K^{T}}{\sqrt{d}}) V

(15)

where T represents the transpose operation of a matrix, d is the number of hidden units of our network, and

\sqrt{d}

is mainly used to adjust the dimension size of the computation result.

Our approach to implementing the attention mechanism differs from popular attention mechanisms, using a bidirectional attention network [24]: it focuses on perceiving the context of the critical fields of the protocol, and assists in the prediction of the protocol syntax, by providing complementary information. We define the forward hidden state sequence

E f_{i}

and the backward hidden sequence

E b_{j}

, and use the similarity of the forward and backward hidden sequences, to measure the weight of the feature words. We use the softmax function to get the normalized weight

α_{i j}

of each key field, as expressed in (7). The attention-weighted sum obtains the context vector of the response, and then we calculate the bidirectional keyword-weighted sum based on the weight, as shown in (8) and (9).

α_{i j} = \frac{exp (E f_{i} * E b_{j})}{\sum_{i, j = 1}^{M, N} (exp (E f_{i} * E b_{j}))}

(16)

C_{f} = \sum_{i = 1}^{M} (α_{i} * E f_{i})

(17)

C_{b} = \sum_{j = 1}^{N} (α_{j} * E b_{j})

(18)

where

C_{f}

is the output of the forward attention layer, and

C_{b}

is the output of the backward attention layer. BiLSTM-AM obtains annotations for a given feature sequence, by linking

C_{f}

and

C_{b}

; then, through the dense layer, we concatenate the output vector

C_{f}

and

C_{b}

, and the BiLSTM output

h_{t}

into a vector

Z_{t}

to represent each field. The vector is fed to a

t a n h

function, to produce the output of attention layer.

Z_{t} = tanh (W_{z} [C_{f} \oplus C_{b}; h_{t}])

(19)

3.3.4. CRF Layer

CRFs are a type of probabilistic graphical model that can be used for tasks such as sequence labeling, segmentation, and structured prediction [25]. One of the key advantages of CRFs is that they are able to capture dependencies between the output variables in the sequence, which can be useful for tasks such as protocol semantic labeling and state generation, where the output labels for a given input word depend on the labels of the previous and following words.

The CRF layer takes sequence Z = (

z_{1}

z_{2}

,…,

z_{n}

) as input, and predicts the most possible output sequence Y = (

y_{1}

y_{2}

,…,

y_{n}

). The score of the tag sequence

S (X, Y)

is defined as follows:

S (X, Y) = \sum_{i = 1}^{n} P_{i, y_{i}} + \sum_{i = 0}^{n} T_{y_{i}, y_{i + 1}}

(20)

where

P_{i, y_{i}}

is the score of the jth tag of the ith fields in the sequence, and

T_{y_{i}, y_{i + 1}}

is the transition score matrix, which denotes the scores of transition from tag i to tag j. Then, the probability of the ground-truth label sequence Y is defined as

\ln (P (Y ∣ x)) = S (X, Y) - \ln (\sum_{\tilde{y} \in y_{x}} e^{S (X, \tilde{y})})

(21)

where

\tilde{y}

denotes an arbitrary label sequence.The maximum log-likelihood can be calculated over the entire training set, using dynamic programming, as shown in Equation (20). The Viterbi algorithm [26] can then be used to find the optimal label sequence for any input sentence, by maximizing the score, as shown in Equation (21), allowing us to efficiently use past and future tags to predict the current tag and obtain the best label sequence for the ICPs. While decoding, we predict the output sequence that obtains the maximum score mark sequence

y^{*}

y^{*} = \underset{\tilde{y} \in Y_{x}}{argmax} S (X, \tilde{y})

(22)

4. Experiment and Evaluation

In this section, we present the experiments performed to test the performance of the proposed industrial protocol reverse tool, PREIUD. The experiments were divided into two parts: (1) pre-implementation and correctness verification on the manually constructed ICPs dataset; (2) comparison of the performance to recent advanced protocol reverse tools or solutions.

4.1. Datasets

We collected our data from an offensive and defensive exercise platform that simulated the entire process of liquefied natural gas distribution and transportation in the Liaoning Provincial Key Laboratory of Petrochemical Industry Information Security [27]. The platform used real physical devices to create a realistic industrial environment, and included controllers from various major manufacturers, such as Siemens, Rockwell, Mitsubishi, Omron, and Emerson. We experimented with various protocols, including proprietary protocols—such as Siemens’ S7 Communication, and Omron’s FINS—and general industrial protocols, such as Modbus/TCP, Ethernet/IP, DNP3, and IEC104. These protocols are commonly used in various industrial and power control systems, with the DNP3 and IEC104 protocols coming from third-party applications and packets. The equipment utilized in the simulation platform for the programmable logic controllers employed in the experiment is depicted in Figure 5. A summary of the protocol sources, types, and message flows used in the dataset is provided in Table 2.

4.2. Evaluation of Feature Extraction

In this work, we aimed to train voting experts to accurately identify key fields of different ICPs. The most critical question was whether the algorithm could accurately vote for key fields, and segment the protocol sequence. In voting and judgment of the BVE algorithm, too many key fields could be acquired. The factors that produced this phenomenon included an insufficient number of protocol samples used for training, a high repetition rate, and setting a short window length. Key fields would reduce the importance of protocol features in subsequent models. At the same time, the sequence was segmented too finely, and the protocol classification model could not typically obtain input from the feature extraction link, which affected the effectiveness of the entire reverse scheme. However, we conducted experiments on the influence of the sliding window size and voting threshold used in the BVE algorithm on key field extraction. We found that the setting of the sliding window size had little impact on the segmentation results in our experiments, where 20% of messages were extracted from each protocol: this was due to the sparsity of binary protocols, while the effect of the threshold parameter was more direct. Increasing the threshold effectively improved the accuracy of feature extraction, but at the same time reduced the confidence interval. In order to achieve a more balanced result in segmentation of fields of multiple ICPs, we set the sliding window size to 4 and the threshold to 3.

After determining the parameter settings of the BVE algorithm, we compared the performance to several representative unsupervised feature extraction methods. We chose traditional VE [18], latent Dirichlet distribution (LDA) model [28], and n-gram model [29] as analogy items. The two-sample Kolmogorov–Smirnov test (KS test) [30], an efficient and general nonparametric method for comparing two samples, was performed on these four feature extraction methods.

For performance testing, we mainly considered the following three indicators:

Accuracy of feature extraction: we selected the first s key fields inferred, to compare to the prior knowledge of related protocols, so as to judge whether each method could accurately extract the features of the target protocol. No points were awarded if a specific keyword was omitted or a fixed field was wrongly split.
Conciseness of feature extraction: we counted the ratio of the number of extracted top s feature words to the number of all key fields of the corresponding protocol. An overly conservative segmentation strategy produced redundant protocol features; a lower ratio, therefore, reduced redundancy in the protocol format.
Coverage of feature extraction: we counted the proportion of the first s key fields covering the entire protocol information. Higher coverage could reflect the comprehensiveness of feature extraction.

The performance of four different feature extraction methods was compared. Figure 6 shows that the critical field recognition accuracy using the VE algorithm and the LDA model was significantly better than the n-gram model. The VE algorithm’s voting mechanism helped to improve the identification of keyword boundaries, and it was able to more accurately identify key fields than the equal-length field division method used by the n-gram model. In theory, the LDA model was better at extracting subject information from sequences, as it was able to mine sequence topics in short processing fields. However, its recognition accuracy decreased when confronted with a wide variety of keywords and protocol information consisting mostly of long paragraphs. The BVE-based method, which included prior knowledge, had more stable recognition accuracy for the three protocols, which was 10% higher than the standard VE algorithm. Figure 7 shows the simplicity of feature extraction. When using the n-gram method to divide fields, the strategy was too conservative, leading to the phenomenon of protocol format explosion. While the LDA model had a high recognition accuracy, its parameter k was generally too large, leading to a negative correlation between recognition accuracy and simplicity of feature extraction. The BVE-based method had less redundancy, with a ratio of less than 2:1 in the tests for the three protocols. In terms of coverage, as shown in Figure 8, the message coverage rate of the BVE method on the test set reached over 80%, which was higher than the other models and algorithms.

From a macro perspective, the VE-based method outperformed the n-gram model in feature extraction performance, and the stability was better than the LDA model. At the same time, it also showed that the voting mechanism of VE was not very sensitive to different protocol fields, and was not prone to extreme field segmentation results. The BVE algorithm, combined with prior knowledge, improved the upper limit of VE performance, and overcame the problem that VE faced with the difficulty of identifying low-frequency critical fields in short sequence protocols. The above performance indicators were acceptable, considering that the test set data came from a mixture of natural industrial and network environments simulated by the simulation platform. The feature extraction results were also affected by the size and source of the data, in addition to the protocol type.

4.3. Evaluation of Format and Semantic Inference Model

4.3.1. Model Training Setting

Our experimental code was written in Python 3.6, choosing Tensorflow 1.13 and keras as the deep learning framework. The server used to train the model was equipped with a 4.1 GHz CPU and an NVIDIA GeForce GTX 3060 (12 g). Approximately 60% of the industrial control protocol dataset was used as the training set for the model, while the remaining 20% was used as the test set. For the training set, we used the Adam optimization algorithm and sample tasks to train on, based on the number of training examples for each task [31]. The model was trained using Adam with a learning rate of

10^{- 3}

and a mini-batch size of 64. Training was stopped when the learning rate dropped below

10^{- 5}

or when performance did not improve after five validation checks, to prevent overfitting. We also saved intermediate models for all epochs, and selected the best model as the baseline for evaluating performance. To prevent the gradient from exploding, we also used gradient clipping. The detailed parameter configurations for our approach can be found in Table 3.

4.3.2. Results of Format and Semantic Inference

When evaluating the reverse effect of the model on ICP format and semantic inference, we compared the results output by PREIUD with the protocol priors obtained by the network protocol analyzer Wireshark in Table 4.

For the Modbus/TCP protocol, used in a variety of industrial control equipment, we obtained many request messages for reading and writing coil states and registers, reading discrete values, and diagnosing, in the process of PLC communication data capture based on industrial scenarios. The Modbus/TCP protocol model could accurately infer the format of the application layer. Nevertheless, for the data part, the model output results differed from the analysis result of Wireshark. PREIUD splits data into several different combinations of critical fields. Considering that the one-way communication data between PLCs in the actual industrial scene remains unchanged, the arbitrary division of data does not change the correctness of semantic reasoning.

At the same time, we also collected a small number of unknown functional protocol fields that Wireshark could not fully parse. The feature was that the analyzer could not recognize the function code, and the content after the function code was mostly empty characters, such as x000000. Combined with the analysis of the industrial environment, such unknown function fields may represent the abnormal response sent by the PLC to the host in the slave state. For such fields, PREIUD is classified into a particular set of function codes, or a branching tree of new function codes is created through features such as data length. Compared to the network traffic analyzer based on matching rules, PREIUD pays more attention to the context structure and relationship of the protocol, and has a better ability to handle the protocol information of abnormal function codes and error codes.

4.3.3. Evaluation and Comparison Analysis

In order to verify that PREIUD had good performance in general, and private ICP reverser domains, we compared the performance to two advanced and open-source protocol reverser tools under the same conditions. We referred to, and improved, a multi-dimensional quantitative evaluation method based on the fuzzy comprehensive evaluation method [32]. We calculated performance metrics through the following dimensions: precision; recall; F1-score; simplicity; and efficiency. We evaluated each factor according to the following criteria:

Precision: described the rate at which samples of a particular type were correctly predicted. A higher precision score meant that the model was making fewer false positives.
Recall: described the rate at which samples of a particular type were correctly identified. A higher recall score meant that the model was better at identifying positive samples.
F1-score: this metric combined precision and recall into a single measure. A higher F1-score meant that the model performed well in both precision and recall.
Conciseness: described the complexity of the model. A simpler model was more interpretable and easier to understand. We counted the ratio of the number of protocol states inferred by PRE to the sample protocol types: the higher the ratio, the higher the redundancy. A ratio of less than 1 indicated that the semantic inference was incomplete, and that the item could not be scored.
Efficiency: described the speed and scalability of the model. A more efficient model could handle larger datasets and make predictions faster. We counted the time interval required to complete the reverse work of the protocol for a fixed-size data stream (1kb): the shorter the time interval, the higher the efficiency.

We created the following factors to facilitate quantitative analysis and data visualization:

S = {S 1, S 2, . . ., S 5}

={Precision, Recall, F1-score, Conciseness, Efficiency}.

Each performance index was weighted according to its importance, and the weight vector is shown in Table 5.

As most of the works on reverse engineering of ICP did not disclose their source code and datasets, we considered MSERA [33] and Discoverer [34], two comprehensive protocols that had achieved results in the field of protocol reverse engineering, in the selection of the comparative PRE tool. Both tools could analyze binary and text protocols. For ICPs, especially private ICPs with encryption protection, we adjusted the sensitive parameters of tools for different protocols, and obtained the optimal performance range of PRE, which was used as the performance evaluation index of the above projects. As the two tools only considered the communication messages from one end, to avoid the data source’s influence on the evaluation results, we clustered the protocol fields of the client and server sides separately, and then performed a reverse analysis. The analysis results are shown in Table 6.Finally, we calculated the total evaluation value according to the maximum membership principle. The results are shown in Table 7. As shown in Figure 9, we drew an intuitive performance comparison chart.

Through the test data, we found that the inference accuracy of PERIUD was higher than that of MSERA and Discoverer, both for public and private ICPs. MSERA improved the ICP analysis based on the internet protocol reverse tool Netzob [35]. PERIUD’s format inference accuracy and coverage in all protocols were much higher than MSERA and Discoverer. Both methods used clustering algorithms to extract features and to infer protocol types using sequence alignment [36]. MSERA used heuristic methods for minefields, with special semantics for fields in ICP. The effect was still not as good as the combination of unsupervised learning and neural network models, when analyzing data collected in industrial scenarios. In several ICPs with a small amount of data, the advantage of PREIUD in accuracy decreased slightly. The size reflected the characteristics of the deep learning model. The larger the amount of training data, and the more complex the target protocol, the stronger the learning ability of the model, and the better the results that could be obtained. The performance of PREIUD was poor, in terms of simplicity and efficiency. The reason was that the neural network based on the attention mechanism needed more data training sets to grasp the context information of the protocol, which improved the format inference and semantic analysis capabilities, and prolonged the data processing time, which affected the efficiency of the protocol reverse analysis. The mature algorithm, based on clustering algorithm and multi-sequence comparison, was more concise and efficient. It can be seen that the neural network model still has excellent research space and development potential in the application of protocol reverse analysis.

5. Discussion

Through the above experiments and evaluations, we verified the efficiency of the unsupervised learning bootstrap voting expert algorithm for boundary segmentation and for feature extraction of crucial fields of the protocol on six commonly used industrial control protocols. In the semantic reasoning and recognition classification of industrial protocol fields, this paper adopted the bidirectional long short-term memory neural network model with an attention mechanism, for the first time. This method achieved remarkable results in text classification problems in natural language processing. By adding bidirectional attention weights to perceive the context content of protocol fields, we made up for the defect of BiLSTM in understanding the meaning of fields, strengthened the ability of protocol format reasoning, and met the basic requirements of potential behavior prediction of industrial protocol information. In the performance evaluation test, it was verified that the method used by PERIUD performed better than the classical protocol inversion tool, using the clustering algorithm and sequence alignment method, in essential performance indicators of protocol inversion.

Advantages: There are currently no fully labeled public dataset samples in intrusion detection for the reverse technology of industrial protocols. Private industrial protocols are more complex and diverse in format and content than general industrial protocols. The data samples used in this paper were obtained in various ways, combining accurate industrial scene simulation platform capture and third-party libraries and applications, to generate data, overcoming the problem of incomplete coverage of PRE domain datasets. The protocol feature extraction and semantic reasoning methods of unsupervised learning were adopted, to avoid the inconsistency of the evaluation indicators of the reverse result caused by the non-homologous sample label data.

Limitations: PREIUD’s ability to format inference and to predict protocol behavior is significantly compromised when faced with private industrial protocols protected by data encryption. For encrypted information, the segmentation and feature extraction of plaintext and ciphertext regions are the key directions for PREIUD improvement. On the other hand, the addition of a deep neural network and attention mechanism reduces the processing speed of PRE. Reverse processing for large amounts of protocol information is less efficient than traditional clustering and sequence alignment methods. The next step is to simplify the neural network model while ensuring the accuracy of recognition and classification, and reducing the information redundancy generated by the redundant front and rear hidden layers.

Future work: Given the lack of simplicity and efficiency of current PREIUD, in future research we will continue to make lightweight improvements to the protocol format inference and classification models, and to explore the performance of other emerging language processing models, such as BERT [37] XLNet [38], on protocol analysis, to enhance our method. Meanwhile, we will expand the scope of the target protocol, and move towards the reverse field of unknown private protocols other than industrial control systems, such as drone control protocols and in-vehicle IoT protocols. In addition, we will try to deploy PREIUD on embedded devices, build a full-flow protocol reverse analysis system for industrial scenarios, and perform real-time capture-and-reverse analysis of dynamic traffic in industrial scenarios. Furthermore, we will generate fake but credible data through reinforcement learning, and will build a fuzzing framework for industrial protocols based on reverse tools [39], to solve practical industrial control systems security problems, such as anomaly detection and vulnerability mining.

6. Conclusions

This paper presents PREIUD, a reverse engineering industrial control protocol tool based on unsupervised learning and deep neural network methods. The proposed tool is designed to reverse engineer industrial control protocols. It consists of multiple steps that are sequentially carried out. The first step is to infer the boundaries of key fields in the protocols, using the bootstrap voting expert (BVE) algorithm. Then, a bidirectional long short-term memory conditional random field, with an attention mechanism (BiLSTM-AM-CRF), is used to classify the protocols, infer their format and semantic features, and generate labels for the protocol states and key elements. We combined industrial traffic captured from a simulation platform of actual physical devices with protocol information generated by open-source applications, to construct a comprehensive and diverse sample set of reverse protocol data. The effectiveness of PREIUD was demonstrated by the reverse analysis results of six industrial protocols, and the performance was compared to the advanced protocol reverse tools, MSERA and Discoverer. The results show that PREIUD performs better in feature extraction and semantic inference.

In future work, we will aim to enhance the efficiency of protocol reversal, while maintaining accuracy by implementing a lightweight model architecture. Additionally, we will investigate the applicability of large-scale language processing models to the domain of ICP reversal, and we will broaden the range of target protocols, to construct a comprehensive protocol inversion analysis system for industrial settings. This will enable us to establish a fuzzy testing framework for industrial protocols, which could help to identify latent high-risk vulnerabilities in industrial control systems, and to address the fundamental challenges of industrial information security.

7. Patents

An efficient industrial control protocol analysis method based on deep learning

Inventor(s): Bowei Ning, Xuejun Zong and Kan He

Assignee: Zhigang Zhang

Patent number: CN2022102012027

Date of grant: 3 March 2022

Content: The method involves training a neural network on datasets collected by industrial control protocols and their hardware-in-the-loop platforms, and then using the trained network to accurately classify new protocols. The method has been shown to outperform existing methods, in terms of accuracy and speed, and has the potential to significantly improve the efficiency of industrial control systems.

Author Contributions

Software, B.N. and X.Z.; validation, B.N.; conceptualization, X.Z.; methodology, X.Z., K.H. and L.L. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported in part by the LiaoNing Revitalization Talents Program, under Grant number XLYC2002085.

Data Availability Statement

Data is available upon request to the authors.

Conflicts of Interest

The authors declare no conflict of interest.

References

ICS-CERT 2021 Annual Vulnerability Coordination Report. Available online: https://www.cisa.gov/uscert/ics/alerts (accessed on 14 January 2022).
Narayan, J.; Shukla, S.K.; Clancy, T.C. A survey of automatic protocol reverse engineering tools. ACM Comput. Surv. (CSUR) 2015, 48, 1–26. [Google Scholar] [CrossRef]
Aldallal, A. Toward Efficient Intrusion Detection System Using Hybrid Deep Learning Approach. Symmetry 2022, 14, 1916. [Google Scholar] [CrossRef]
Luo, J.Z.; Shan, C.; Cai, J.; Liu, Y. IoT Application-Layer Protocol Vulnerability Detection using Reverse Engineering. Symmetry 2018, 10, 561. [Google Scholar] [CrossRef] [Green Version]
Alomari, E.S.; Nuiaa, R.R.; Alyasseri, Z.A.A.; Mohammed, H.J.; Sani, N.S.; Esa, M.I.; Musawi, B.A. Malware Detection Using Deep Learning and Correlation-Based Feature Selection. Symmetry 2023, 15, 123. [Google Scholar] [CrossRef]
Galloway, B.; Hancke, G.P. Introduction to industrial control networks. IEEE Commun. Surv. Tutor. 2012, 15, 860–880. [Google Scholar] [CrossRef] [Green Version]
Sija, B.D.; Goo, Y.H.; Shim, K.S.; Hasanova, H.; Kim, M.S. A survey of automatic protocol reverse engineering approaches, methods, and tools on the inputs and outputs view. Secur. Commun. Netw. 2018, 2018, 8370341. [Google Scholar] [CrossRef]
Kowsari, K.; Jafari Meimandi, K.; Heidarysafa, M.; Mendu, S.; Barnes, L.; Brown, D. Text classification algorithms: A survey. Information 2019, 10, 150. [Google Scholar] [CrossRef] [Green Version]
Xiao, M.M.; Luo, Y.P. Automatic protocol reverse engineering using grammatical inference. J. Intell. Fuzzy Syst. 2017, 32, 3585–3594. [Google Scholar] [CrossRef]
Meng, F.; Zhang, C.; Wu, G. Protocol reverse based on hierarchical clustering and probability alignment from network traces. In Proceedings of the 2018 IEEE 3rd International Conference on Big Data Analysis (ICBDA), Shanghai, China, 9–12 March 2018; pp. 443–447. [Google Scholar]
Kleber, S.; van der Heijden, R.W.; Kargl, F. Message type identification of binary network protocols using continuous segment similarity. In Proceedings of the IEEE INFOCOM 2020-IEEE Conference on Computer Communications, Toronto, ON, Canada, 6–9 July 2020; pp. 2243–2252. [Google Scholar]
Yang, C.; Fu, C.; Qian, Y.; Hong, Y.; Feng, G.; Han, L. Deep learning-based reverse method of binary protocol. In Proceedings of the International Conference on Security and Privacy in Digital Economy, Quzhou, China, 30 October–1 November 2020; Springer: Berlin/Heidelberg, Germany, 2020; pp. 606–624. [Google Scholar]
Wang, Y.; Bai, B.; Hei, X.; Zhu, L.; Ji, W. An unknown protocol syntax analysis method based on convolutional neural network. Trans. Emerg. Telecommun. Technol. 2021, 32, e3922. [Google Scholar] [CrossRef]
Kiechle, V.; Börsig, M.; Nitzsche, S.; Baumgart, I.; Becker, J. PREUNN: Protocol Reverse Engineering using Neural Networks. In Proceedings of the ICISSP, Online Streaming, 9–11 February 2022; pp. 345–356. [Google Scholar]
Wang, R.; Shi, Y.; Ding, J. Reverse Engineering of Industrial Control Protocol By XGBoost with V-gram. In Proceedings of the 2020 IEEE 6th International Conference on Computer and Communications (ICCC), Chengdu, China, 11–14 December 2020; pp. 172–176. [Google Scholar]
Wang, X.; Lv, K.; Li, B. IPART: An automatic protocol reverse engineering tool based on global voting expert for industrial protocols. Int. J. Parallel Emergent Distrib. Syst. 2020, 35, 376–395. [Google Scholar] [CrossRef]
Zhang, Z.; Zhang, Z.; Lee, P.P.; Liu, Y.; Xie, G. ProWord: An unsupervised approach to protocol feature word extraction. In Proceedings of the IEEE INFOCOM 2014-IEEE Conference on Computer Communications, Toronto, ON, Canada, 27 April–2 May 2014; pp. 1393–1401. [Google Scholar]
Cohen, P.; Adams, N.; Heeringa, B. Voting experts: An unsupervised algorithm for segmenting sequences. Intell. Data Anal. 2007, 11, 607–625. [Google Scholar] [CrossRef] [Green Version]
Hewlett, D.; Cohen, P. Bootstrap voting experts. In Proceedings of the Twenty-First International Joint Conference on Artificial Intelligence, Hainan, China, 25–26 April 2009. [Google Scholar]
Mikolov, T.; Chen, K.; Corrado, G.; Dean, J. Efficient estimation of word representations in vector space. arXiv 2013, arXiv:1301.3781. [Google Scholar]
Mikolov, T.; Sutskever, I.; Chen, K.; Corrado, G.S.; Dean, J. Distributed representations of words and phrases and their compositionality. arXiv 2013, arXiv:1310.4546. [Google Scholar]
Graves, A. Long short-term memory. In Supervised Sequence Labelling with Recurrent Neural Networks; Springer: Berlin/Heidelberg, Germany, 2012; pp. 37–45. [Google Scholar]
Jang, B.; Kim, M.; Harerimana, G.; Kang, S.u.; Kim, J.W. Bi-LSTM model to increase accuracy in text classification: Combining Word2vec CNN and attention mechanism. Appl. Sci. 2020, 10, 5841. [Google Scholar] [CrossRef]
Liu, G.; Guo, J. Bidirectional LSTM with attention mechanism and convolutional layer for text classification. Neurocomputing 2019, 337, 325–338. [Google Scholar] [CrossRef]
Zheng, S.; Jayasumana, S.; Romera-Paredes, B.; Vineet, V.; Su, Z.; Du, D.; Huang, C.; Torr, P.H. Conditional random fields as recurrent neural networks. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 7–13 December 2015; pp. 1529–1537. [Google Scholar]
Lou, H.L. Implementing the Viterbi algorithm. IEEE Signal Process. Mag. 1995, 12, 42–52. [Google Scholar] [CrossRef]
Zong, X.; Zhang, J.; He, K. An Offensive and Defensive Exercise Platform for Industrial Control System Network Information Security. J. Shenyang Univ. Chem. Technol. 2021, 36, 296–304. [Google Scholar]
Li, H.; Shuai, B.; Wang, J.; Tang, C. Protocol reverse engineering using LDA and association analysis. In Proceedings of the 2015 11th International Conference on Computational Intelligence and Security (CIS), Shenzhen, China, 19–20 December 2015; pp. 312–316. [Google Scholar]
Wang, Y.; Yun, X.; Shafiq, M.Z.; Wang, L.; Liu, A.X.; Zhang, Z.; Yao, D.; Zhang, Y.; Guo, L. A semantics aware approach to automated reverse engineering unknown protocols. In Proceedings of the 2012 20th IEEE International Conference on Network Protocols (ICNP), Austin, TX, USA, 30 October–2 November 2012; pp. 1–10. [Google Scholar]
Lopes, R.H.; Reid, I.; Hobson, P.R. The Two-Dimensional Kolmogorov-Smirnov Test. In Proceedings of the Xi International Workshop on Advanced Computing & Analysis Techniques in Physics Research, Amsterdam, The Netherlands, 23–27 April 2007. [Google Scholar]
Zhang, Z. Improved adam optimizer for deep neural networks. In Proceedings of the 2018 IEEE/ACM 26th International Symposium on Quality of Service (IWQoS), Banff, AB, Canada, 4–6 June 2018; pp. 1–2. [Google Scholar]
Huang, Y.; Shu, H.; Kang, F.; Guang, Y. Protocol Reverse-Engineering Methods and Tools: A Survey. Comput. Commun. 2022, 182, 238–254. [Google Scholar] [CrossRef]
Wang, Q.; Sun, Z.; Wang, Z.; Ye, S.; Su, Z.; Chen, H.; Hu, C. A Practical Format and Semantic Reverse Analysis Approach for Industrial Control Protocols. Secur. Commun. Netw. 2021, 2021, 6690988. [Google Scholar] [CrossRef]
Cui, W.; Kannan, J.; Wang, H.J. Discoverer: Automatic Protocol Reverse Engineering from Network Traces. In Proceedings of the USENIX Security Symposium; USENIX Association: Berkeley, CA, USA, 2007; pp. 1–14. [Google Scholar]
Bossert, G.; Guihéry, F.; Hiet, G. Towards automated protocol reverse engineering using semantic information. In Proceedings of the 9th ACM Symposium on Information, Computer and Communications Security, Kyoto, Japan, 4–6 June 2014; pp. 51–62. [Google Scholar]
Meng, F.; Liu, Y.; Zhang, C.; Li, T.; Yue, Y. Inferring protocol state machine for binary communication protocol. In Proceedings of the 2014 IEEE Workshop on Advanced Research and Technology in Industry Applications (WARTIA), Ottawa, ON, Canada, 29–30 September 2014; pp. 870–874. [Google Scholar]
Devlin, J.; Chang, M.W.; Lee, K.; Toutanova, K. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv 2018, arXiv:1810.04805. [Google Scholar]
Yang, Z.; Dai, Z.; Yang, Y.; Carbonell, J.; Salakhutdinov, R.R.; Le, Q.V. Xlnet: Generalized autoregressive pretraining for language understanding. arXiv 2019, arXiv:1906.08237. [Google Scholar]
Hu, Z.; Shi, J.; Huang, Y.; Xiong, J.; Bu, X. GANFuzz: A GAN-based industrial network protocol fuzzing framework. In Proceedings of the 15th ACM International Conference on Computing Frontiers, Ischia, Italy, 8–10 May 2018; pp. 138–145. [Google Scholar]

Figure 1. The system architecture of PRE. Statistics for output types are reflected in the word cloud.

Figure 2. The processing steps of PREIUD.

Figure 3. An example of the BVE algorithm’s voting process.

Figure 4. The overall architecture of the BiLSTM-AM-CRF model.

Figure 5. The PLC simulation platform used in the experiment.

Figure 6. Accuracy of feature extraction.

Figure 7. Conciseness of feature extraction.

Figure 8. Coverage of feature extraction.

Figure 9. Performance Comparison of three protocol reverse tools.

Table 1. An Overview of Methodologies and Characteristics of selected representative Tools.

Tool or Author	Protocol Type	Feature Extraction	Format Inference	Features
Meng F	Binary	Hierarchical clustering	Probabilistic alignment	Efficient but moderately accurate inference
NEMETYL	Binary	DBSCAN clustering	Hirschberg alignment	Field dissimilarity considered, but limited test protocols
Yang C	Binary	Sequence coding	LSTM-FCN	Deep learning and field encoding proposed for reverse engineering
Wang R	Binary(ICPs)	Progressive multi-sequence clustering	XGBoost	Effective SIEMENS protocol reversal, but low test coverage
IPART	Binary(ICPs)	Extended voting expert	Global voting expert algorithm	Able to reverse Modbus, iec104, and Ethernet
PREIUD	Binary(ICPs)	Bootstrap voting expert	BiLSTM-AM-CRF	Combines unsupervised learning and deep neural networks for efficient reversal of most ICPs.

Table 2. Industrial control protocol data sample set information summary.

Protocol	Source	Type	Flow
S7comm	PLC: SIEMENS S7-300 SIEMENS S7-1200	request, response, Read, Write, upload, download, run, stop	42,920
Fins	PLC: OMRON CP1L	Read, multiple read, Transfer, write, upload, download, run, stop	5464
Modbus /TCP	PLC: Rockwell Mircologix1400 Emerson CPE100 Application:pymodbus	Read and write registers/ coils, report slave, unknown function	35,324
Ethernet/IP	PLC:AB CompactLogixL30ER MITSUBISHI FX5U32M	Send/reserved data	16,782
DNP3	Application: Gec-dnp3	File_read, file_list_directory, full_exchange	4538
IEC104	Packet:lib60870	U-format	5381

Table 3. Parameter configurations of our proposed approach.

Parameter	Value
Field embedding size	50
Feature embedding size	50
Size of LSTM hidden unit	200
Mini-batch size	64
Learning rate	0.001
Dropout rate	0.3
Time steps	100

Table 4. An example of format and semantic tag sequence generation.

Field Sequence	00	15	00	00	00	06	ff	04	01	f4	00	64
Wireshark results	Transaction ID		Protocol ID		Length		Unit ID	Function code: read	Reference number		Word count
Semantic tag	Transaction ID		Protocol: Modbus/TCP		Length		Unit ID	Function code: read	Data
Offset tag	00	01	02	03	04	05	06	07	08	09	10	11

Table 5. Assignment of the weight vector.

Evaluation Factor	Precision	Recall	F1-Score	Conciseness	Efficiency
Weight	0.3	0.25	0.25	0.1	0.1

Table 6. Performance comparison results of the three protocol reverse engineering tools.

Protocol	Precision(%)			Recall(%)			F1-Score(%)			Conciseness			Efficiency
Protocol	PRE	MSE	DIS	PRE	MSE	DIS	PRE	MSE	DIS	PRE	MSE	DIS	PRE	MSE	DIS
S7comm	86.8	81.2	55.6	84.2	83.8	67.3	85.5	82.5	60.9	2.5	2	0.88	0.3	0.15	0.1
Fins	81.2	69.4	43.8	80.2	73.5	52.8	80.7	71.4	47.9	3	2.63	0.67	0.3	0.1	0.1
Modbus	91.4	82.4	83.7	92.5	85.6	86.2	91.9	84.0	84.9	1.5	1.33	1.2	0.3	0.1	0.1
Ethernet	88.2	81.1	76.8	89.6	78.2	74.5	88.9	79.6	75.6	1.33	1.5	1.33	0.3	0.1	0.1
DNP3	84.1	79.4	82.5	83.3	82.5	81.9	83.7	80.9	82.2	1.67	0.88	1.12	0.3	0.15	0.1
IEC104	79.3	73.5	76.1	80.6	78.2	80.4	79.9	75.8	78.2	2	1.5	0.3	0.15	0.1	0.1
Total(%)	85.2	77.8	69.8	85.1	80.3	73.9	85.1	79.0	71.6	0.17	0.33	0.66	0	0.5	0.83

Table 7. Combined score of three protocol reverse engineering tools.

Tool	PREIUD	MSERA	Discoverer
Combined score	70.15	64.35	60.2

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Ning, B.; Zong, X.; He, K.; Lian, L. PREIUD: An Industrial Control Protocols Reverse Engineering Tool Based on Unsupervised Learning and Deep Neural Network Methods. Symmetry 2023, 15, 706. https://doi.org/10.3390/sym15030706

AMA Style

Ning B, Zong X, He K, Lian L. PREIUD: An Industrial Control Protocols Reverse Engineering Tool Based on Unsupervised Learning and Deep Neural Network Methods. Symmetry. 2023; 15(3):706. https://doi.org/10.3390/sym15030706

Chicago/Turabian Style

Ning, Bowei, Xuejun Zong, Kan He, and Lian Lian. 2023. "PREIUD: An Industrial Control Protocols Reverse Engineering Tool Based on Unsupervised Learning and Deep Neural Network Methods" Symmetry 15, no. 3: 706. https://doi.org/10.3390/sym15030706

APA Style

Ning, B., Zong, X., He, K., & Lian, L. (2023). PREIUD: An Industrial Control Protocols Reverse Engineering Tool Based on Unsupervised Learning and Deep Neural Network Methods. Symmetry, 15(3), 706. https://doi.org/10.3390/sym15030706

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

PREIUD: An Industrial Control Protocols Reverse Engineering Tool Based on Unsupervised Learning and Deep Neural Network Methods

Abstract

1. Introduction

2. Related Work

3. Materials and Methods

3.1. Data Collection and Preprocessing

3.2. Protocol Field Segmentation and Feature Extraction

3.3. Protocol Format and Semantic Inference Model

3.3.1. Embedding Layer

3.3.2. BiLSTM Layer

3.3.3. Attention Layer

3.3.4. CRF Layer

4. Experiment and Evaluation

4.1. Datasets

4.2. Evaluation of Feature Extraction

4.3. Evaluation of Format and Semantic Inference Model

4.3.1. Model Training Setting

4.3.2. Results of Format and Semantic Inference

4.3.3. Evaluation and Comparison Analysis

5. Discussion

6. Conclusions

7. Patents

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI