CN103441990B

CN103441990B - The automatic estimating method of protocol state machine based on state fusion

Info

Publication number: CN103441990B
Application number: CN201310348136.7A
Authority: CN
Inventors: 洪征; 吴礼发; 郑成辉; 潘璠; 赖海光; 李华波; 周振吉
Original assignee: PLA University of Science and Technology
Current assignee: PLA University of Science and Technology
Priority date: 2013-08-09
Filing date: 2013-08-09
Publication date: 2016-03-30
Anticipated expiration: 2033-08-09
Also published as: CN103441990A

Abstract

The present invention relates to the field of network technology, and provides an automatic state machine inference method based on state fusion, including the following steps: message format extraction and message classification, session abstraction and initial state machine construction, and state fusion based on output messages . The protocol state machine automatic inference method of the present invention adopts the enhanced prefix tree converter EPTT to describe the session process of the protocol entity, focuses on the output message of the protocol, integrates the same state in the state machine, and conducts test interaction with the protocol entity Verify the feasibility of protocol state fusion, ensure the automation of protocol state machine inference, and improve the accuracy of inference results.

Description

Automatic inference method of protocol state machine based on state fusion

技术领域technical field

本发明涉及网络技术领域，具体而言涉及一种依据协议实体程序接收及发送的网络报文，自动化推断相应网络协议的协议状态机的方法。The invention relates to the field of network technology, in particular to a method for automatically inferring a protocol state machine of a corresponding network protocol based on network messages received and sent by a protocol entity program.

背景技术Background technique

网络协议是网络通信功能实现的支撑要素，也是网络安全领域的重点研究对象。入侵检测、模糊测试、协议重用、协议脆弱性分析等大量网络安全技术都以详尽的协议规范信息为基础。Network protocols are the supporting elements for the realization of network communication functions, and are also key research objects in the field of network security. A large number of network security technologies such as intrusion detection, fuzz testing, protocol reuse, and protocol vulnerability analysis are based on detailed protocol specification information.

网络中使用了大量缺乏描述文档的私有协议，这使得各类依赖于信息规范的网络安全技术在应用范围上受到极大限制。为了解决协议信息未知的问题，研究人员开始采用协议逆向的方法获取未知的协议规范。协议逆向是指在不依赖于协议描述的情况下，通过对协议实体的网络输入输出、系统行为和指令执行流程进行监控和分析，提取网络协议具体规范信息的过程。A large number of private protocols lacking description documents are used in the network, which greatly limits the application range of various network security technologies that rely on information specifications. In order to solve the problem of unknown protocol information, researchers began to use the protocol reverse method to obtain unknown protocol specifications. Protocol reverse refers to the process of extracting specific specification information of network protocols by monitoring and analyzing the network input and output, system behavior and instruction execution process of protocol entities without relying on the protocol description.

网络协议规范主要包括协议格式和协议状态机两部分。协议格式关注的是通信报文中各协议域的组成和结构。协议状态机关注的是协议系统中的协议状态数量以及协议系统在接收不同输入的情况下从一个协议状态向另外一个协议状态迁移的规则。The network protocol specification mainly includes two parts: the protocol format and the protocol state machine. The protocol format focuses on the composition and structure of each protocol field in the communication message. The protocol state machine focuses on the number of protocol states in the protocol system and the rules for the protocol system to transition from one protocol state to another protocol state when it receives different inputs.

传统的协议逆向采用人工方式，过程冗长耗时，准确度依赖于分析人员的技术水平和实践经验。随着网络规模的扩大和协议种类的增多，对逆向分析准确度和时效性的要求越来越高，传统人工方式的协议逆向分析已不能满足实际应用的需要。协议自动逆向可以显著减少人工分析，提高私有协议的分析效率，获得了越来越多的重视。Traditional protocols are reversed manually, the process is tedious and time-consuming, and the accuracy depends on the technical level and practical experience of the analysts. With the expansion of the network scale and the increase of the types of protocols, the requirements for the accuracy and timeliness of the reverse analysis are getting higher and higher, and the traditional manual reverse analysis of the protocol can no longer meet the needs of practical applications. Automatic protocol reverse can significantly reduce manual analysis and improve the analysis efficiency of private protocols, which has gained more and more attention.

目前大部分协议自动逆向研究集中于协议格式的提取，分析结果中缺乏协议状态机信息，制约了协议逆向结果的实际应用。近年来，随着协议格式提取技术的相对成熟，一些研究人员开始尝试对协议状态机进行逆向分析。目前的协议状态机推断主要存在以下问题：（1）已有的状态融合方法（如Prospex系统）出于简单性的考虑，针对的状态机模型是无输出的有限状态机。这种有限状态机中，只存在报文输入，而不考虑报文输出，忽视了协议系统输入输出报文之间的内在联系。协议系统是带输出的状态迁移系统，这种简单化的处理使得状态融合得到的状态机与实际协议系统存在较大差异。（2）为了解决样本集不完备的问题，在协议状态机推断过程中往往需要不断产生新样本，并根据新样本是否隶属于协议状态机，实施进一步推断。新样本对于协议状态机而言是正例还是反例，依赖于人工判定。人工判定的处理方式一方面难以保证准确度，另一方面，这种处理方式自动化程度低，制约了逆向分析的效率。At present, most of the protocol automatic reverse research focuses on the extraction of the protocol format, and the lack of protocol state machine information in the analysis results restricts the practical application of the protocol reverse results. In recent years, with the relative maturity of the protocol format extraction technology, some researchers have begun to try to reverse the protocol state machine. The current protocol state machine inference mainly has the following problems: (1) For the sake of simplicity, the existing state fusion methods (such as the Prospex system) aim at a finite state machine with no output. In this finite state machine, there is only message input, and no message output is considered, ignoring the internal relationship between the input and output messages of the protocol system. The protocol system is a state transition system with output. This simplification makes the state machine obtained by state fusion quite different from the actual protocol system. (2) In order to solve the problem of incomplete sample set, it is often necessary to continuously generate new samples during the inference process of the protocol state machine, and further infer according to whether the new sample belongs to the protocol state machine. Whether a new sample is positive or negative for the protocol state machine depends on manual judgment. On the one hand, it is difficult to ensure the accuracy of the manual judgment processing method. On the other hand, this processing method has a low degree of automation, which restricts the efficiency of reverse analysis.

发明内容Contents of the invention

针对现有技术中存在的问题，本发明旨在提供一种基于状态融合的协议状态机自动推断方法，针对未知协议的协议状态机推断问题，在已有的报文协议格式推断技术的基础上，依据收集的报文样本构造增强前缀树转换器EPTT(ExtendedPrefixTreeTransducer)描述协议实体在会话过程涉及的输入、输出报文组成的抽象符号串，并通过与协议实体的测试性交互判定状态融合的可行性，确保了状态机推断的自动化，提高了推断结果的准确度。Aiming at the problems existing in the prior art, the present invention aims to provide a method for automatically inferring the protocol state machine based on state fusion, aiming at the problem of inferring the protocol state machine of an unknown protocol, on the basis of the existing packet protocol format inference technology According to the collected message samples, construct an enhanced prefix tree converter EPTT (ExtendedPrefixTreeTransducer) to describe the abstract symbol string composed of the input and output messages involved in the session process of the protocol entity, and determine the feasibility of state fusion through the test interaction with the protocol entity It ensures the automation of state machine inference and improves the accuracy of inference results.

为达成上述目的，本发明所采用的技术方案如下：In order to achieve the above object, the technical scheme adopted in the present invention is as follows:

一种基于状态融合的协议状态机自动推断方法，包括以下步骤：A method for automatically inferring a protocol state machine based on state fusion, comprising the following steps:

（1）报文格式提取与报文分类：获取协议实体程序相关的输入、输出报文的具体格式信息，并依据报文格式分别对输入、输出报文分类，将结构相同的报文样本归为一类，以抽象符号表示分类的类别信息；(1) Message format extraction and message classification: obtain the specific format information of the input and output messages related to the protocol entity program, and classify the input and output messages according to the message format, and group the message samples with the same structure into is a category, and the category information of the classification is represented by an abstract symbol;

（2）会话抽象与初始状态机构建：基于抽象符号表示的分类类别，以会话为单位，对网络通信行为进行抽象，将会话过程中的输入输出报文序列描述为抽象的输入输出符号串，进而依据会话样本集，构建与输入输出符号串集合一致的初始状态机；(2) Session abstraction and initial state machine construction: Based on the classification categories represented by abstract symbols, the network communication behavior is abstracted with sessions as units, and the input and output message sequences in the session process are described as abstract input and output symbol strings. Then, according to the session sample set, construct an initial state machine consistent with the set of input and output symbol strings;

（3）基于输出报文的状态融合：依据相似度高低对候选状态进行融合，并生成测试符号串，再通过自动化的测试，比较协议实体与融合后的状态机在接收到测试符号串后做出的输出响应，验证状态融合的可行性；(3) State fusion based on the output message: The candidate states are fused according to the similarity, and a test symbol string is generated, and then through an automated test, the protocol entity is compared with the fused state machine after receiving the test symbol string. output response to verify the feasibility of state fusion;

（4）重复上述步骤（3）直到状态机中不再有符合融合条件的状态；(4) Repeat the above step (3) until there are no states that meet the fusion conditions in the state machine;

前述会话抽象与初始状态机构建阶段的工作流程如下：状态机的推断以会话样本集为基础构建，将会话中输入、输出报文以其所在类别对应的抽象符号表示，从而把完整会话的输入输出报文序列转化为抽象的输入输出符号串；在此基础上，依据会话样本集，采用增强前缀树转换器EPTT的形式构造初始状态机，初始状态机中包含了所有作为会话样本的输入输出符号串；The workflow of the aforementioned session abstraction and initial state machine construction phase is as follows: the inference of the state machine is built on the basis of the session sample set, and the input and output messages in the session are represented by abstract symbols corresponding to their categories, so that the input of the complete session The output message sequence is converted into an abstract input and output symbol string; on this basis, according to the session sample set, the initial state machine is constructed in the form of an enhanced prefix tree converter EPTT, and the initial state machine contains all input and output as session samples symbol string;

前述基于输出报文的状态融合阶段的工作流程如下：在初始状态机的基础上，依据相似度的高低每次对两个相似状态进行状态融合，相似状态的选择以BlueFringe算法为基础，选择相似度最高的两个状态作为待融合的候选状态；候选状态的融合是否可行，将依据生成的测试字符串进行判定，判断两个候选状态是否能够融合，其中：测试字符串基于原始状态机中到达两个候选状态的字符串前缀和字符串后缀构建，通过交叉拼接的方式，将到达某一个候选状态的所有字符串前缀与另外一个候选状态的所有字符串后缀依次拼接，生成的字符串组成测试字符串集合，如果判定所有输出字符串均与融合后的协议状态机一致，则认为状态融合是可行的，否则判断状态融合不可行；将测试结果作为新样本加入会话样本集并扩展协议状态机，并继续尝试对其他状态进行融合。The workflow of the state fusion stage based on the output message is as follows: on the basis of the initial state machine, two similar states are fused each time according to the degree of similarity, and the selection of similar states is based on the BlueFringe algorithm. The two states with the highest degrees are the candidate states to be fused; whether the fusion of the candidate states is feasible or not will be judged based on the generated test string to determine whether the two candidate states can be fused, wherein: the test string is based on the arrival of the original state machine The string prefixes and string suffixes of the two candidate states are constructed. Through cross splicing, all the string prefixes reaching a certain candidate state are sequentially spliced with all the string suffixes of the other candidate state, and the generated strings form a test String collection, if it is judged that all the output strings are consistent with the fused protocol state machine, it is considered that the state fusion is feasible, otherwise it is judged that the state fusion is not feasible; the test results are added to the session sample set as a new sample and the protocol state machine is extended , and continue to try to fuse other states.

进一步，前述方法中，在协议状态机中选择待融合的候选状态时，以BlueFringe算法为基础，相似度的计算依据协议状态的公共输入字符串后缀，公共输入字符串后缀反映协议实体在处于两个不同协议状态时，接收到相同输入报文时的状态转换和输出响应情况，其中：输入字符串后缀是指协议实体从某一协议状态开始，接收到一系列的输入报文，这些输入报文在状态机中被表示为输入字符串后缀；公共输入字符串后缀是指两个不同状态接收到相同的一系列输入报文；相似度的计算将考虑协议状态的公共输入字符串后缀的长短，以及对于接收到相同输入时协议实体是否产生相同的输出结果，如果两个协议状态，它们公共输入字符串后缀的长度最长，且对于相同输入有相同的输出，这样的两个协议状态将优先尝试融合。Further, in the aforementioned method, when selecting a candidate state to be fused in the protocol state machine, the BlueFringe algorithm is used as the basis, and the calculation of the similarity is based on the public input string suffix of the protocol state, which reflects that the protocol entity is in two states. In different protocol states, the state transition and output response when receiving the same input message, where: the input string suffix means that the protocol entity starts from a certain protocol state and receives a series of input messages, these input messages The text is expressed as an input string suffix in the state machine; the common input string suffix means that two different states receive the same series of input messages; the calculation of the similarity will consider the length of the common input string suffix of the protocol state , and whether the protocol entity produces the same output when receiving the same input, if two protocol states have the longest length of their common input string suffix, and have the same output for the same input, such two protocol states will Prioritize fusion attempts.

进一步，前述方法中，利用测试字符串集合进行状态融合与可行性判定的过程，包括以下步骤：首先依据测试字符串和已知的报文协议格式，生成作为测试用例的输入报文序列；将输入报文序列发向协议实体程序，获取作为响应的输出报文序列；对输出报文序列进行抽象，将其表示为输出字符串序列；针对候选状态融合后的协议状态机，判断输出字符串序列中的输出字符是否都存在于协议状态机对应状态的输出符号集中：如果所有输出字符串均与融合后的协议状态机一致，则认为状态融合是可行的，否则判断状态融合不可行。Further, in the aforementioned method, the process of using the test character string set to perform state fusion and feasibility determination includes the following steps: first, according to the test character string and the known message protocol format, generate an input message sequence as a test case; The input message sequence is sent to the protocol entity program to obtain the output message sequence as a response; the output message sequence is abstracted and expressed as an output string sequence; the output string is judged for the protocol state machine after candidate state fusion Whether the output characters in the sequence exist in the output symbol set of the corresponding state of the protocol state machine: if all the output strings are consistent with the fused protocol state machine, then the state fusion is considered feasible, otherwise it is judged that the state fusion is not feasible.

由以上本发明的技术方案可知，本发明的有益效果在于将协议系统视为带输出的状态迁移系统，着眼于协议系统中输入输出报文之间的内在联系对协议状态机中的状态实施融合，通过采用增强前缀树转换器EPTT(ExtendedPrefixTreeTransducer)描述协议实体在会话过程涉及的输入、输出报文组成的抽象符号串，并通过与协议实体的测试性交互判定状态融合的可行性，有助于确保构建的协议状态机与实际协议系统高度吻合，确保了状态机推断的自动化，提高推断结果的准确度；而且针对待融合的候选状态自动化产生辅助判定的输入样本，进而通过协议实体程序的自动化运行掌握协议实体对于输入样本的输出响应，避免了人工判定的低效，提高了状态机逆向推断的准确度和整体效率。As can be seen from the technical solution of the present invention above, the beneficial effect of the present invention is that the protocol system is regarded as a state transition system with output, focusing on the internal relationship between the input and output messages in the protocol system and implementing fusion of the states in the protocol state machine , by using the enhanced prefix tree converter EPTT (ExtendedPrefixTreeTransducer) to describe the abstract symbol string composed of the input and output messages involved in the session process of the protocol entity, and through the test interaction with the protocol entity to determine the feasibility of state fusion, it will help Ensure that the constructed protocol state machine is highly consistent with the actual protocol system, ensure the automation of state machine inference, and improve the accuracy of inference results; and automatically generate auxiliary judgment input samples for the candidate states to be fused, and then through the automation of the protocol entity program The operation masters the output response of the protocol entity to the input sample, avoids the inefficiency of manual judgment, and improves the accuracy and overall efficiency of the reverse inference of the state machine.

附图说明Description of drawings

图1为本发明的自动推断方法的整体实现流程示意图。FIG. 1 is a schematic diagram of the overall implementation flow of the automatic inference method of the present invention.

图2是本发明中基于抽象字符串序列构建EPTT状态机的示例。Fig. 2 is an example of constructing an EPTT state machine based on an abstract string sequence in the present invention.

图3是本发明中对候选状态进行融合的示例。Fig. 3 is an example of fusion of candidate states in the present invention.

具体实施方式detailed description

为了更了解本发明的技术内容，特举具体实施例并配合附图说明如下。In order to better understand the technical content of the present invention, specific embodiments are given and described as follows with reference to the accompanying drawings.

如图1所示，根据本发明的较优实施例，基于状态融合的协议状态机自动推断方法，包括以下步骤：As shown in Figure 1, according to a preferred embodiment of the present invention, the automatic inference method of the protocol state machine based on state fusion includes the following steps:

（1）报文格式提取与报文分类：首先大量收集输入输出报文序列，进而采用现有的报文格式提取方法，获取协议实体程序相关的输入、输出报文的具体格式信息，在此基础上，依据报文格式分别对输入、输出报文分类，将结构相同的报文样本归为一类，以抽象符号表示分类的类别信息；(1) Message format extraction and message classification: First, a large number of input and output message sequences are collected, and then the existing message format extraction method is used to obtain the specific format information of the input and output messages related to the protocol entity program. Here Basically, according to the message format, the input and output messages are classified respectively, and the message samples with the same structure are classified into one category, and the category information of the classification is represented by abstract symbols;

（2）会话抽象与初始状态机构建：在报文分类的基础上，以会话为单位，对网络通信行为进行抽象，将会话过程中的输入输出报文序列描述为抽象的输入输出符号串，进而依据会话样本集，构建与输入输出符号串集合一致的初始状态机；(2) Conversation abstraction and initial state machine construction: On the basis of message classification, the network communication behavior is abstracted with sessions as units, and the input and output message sequences in the session process are described as abstract input and output symbol strings. Then, according to the session sample set, construct an initial state machine consistent with the set of input and output symbol strings;

（4）重复上述步骤（3）直到状态机中不再有符合融合条件的状态。(4) Repeat the above step (3) until there are no more states that meet the fusion conditions in the state machine.

其中，前述会话抽象与初始状态机构建阶段的工作流程如下：状态机的推断以会话样本集为基础构建，将会话中输入、输出报文以其所在类别对应的抽象符号表示，从而把完整会话的输入输出报文序列转化为抽象的输入输出符号串；在此基础上，依据会话样本集，采用增强前缀树转换器EPTT的形式构造初始状态机，初始状态机中包含了所有作为会话样本的输入输出符号串；Among them, the workflow of the aforementioned session abstraction and initial state machine construction phase is as follows: the inference of the state machine is built on the basis of the session sample set, and the input and output messages in the session are represented by abstract symbols corresponding to their categories, so that the complete session On this basis, according to the session sample set, the initial state machine is constructed in the form of enhanced prefix tree converter EPTT, and the initial state machine contains all session samples input and output strings;

参考图1所示的整体实现流程并结合图2、3所示，本实施例的协议状态机自动推断方法包括报文格式提取与报文分类、会话抽象与初始状态机构建以及基于输出报文的状态融合三个部分，具体的实施方式以下分别说明。With reference to the overall implementation process shown in Figure 1 and in combination with Figures 2 and 3, the automatic inference method of the protocol state machine in this embodiment includes message format extraction and message classification, session abstraction and initial state machine construction, and based on output message The state of the fusion of three parts, the specific implementation is described below.

（1）报文格式提取与报文分类(1) Message format extraction and message classification

本发明实施例首先大量收集协议实体程序网络通信产生的输入输出报文序列，并采用PI项目（ProtocolInformationProject）的报文格式提取方法获取输入输出报文的具体格式信息。在此基础上，依据报文格式分别对输入报文和输出报文进行分类，如果几个报文样本具有相同的报文结构，则将它们归为一类。对于每一类别，使用唯一的阿拉伯数字（如1、2、3）进行标识。The embodiment of the present invention firstly collects a large number of input and output message sequences generated by the network communication of the protocol entity program, and uses the message format extraction method of the PI project (Protocol Information Project) to obtain the specific format information of the input and output messages. On this basis, classify the input message and the output message according to the message format, and if several message samples have the same message structure, they will be classified into one category. For each category, use unique Arabic numerals (such as 1, 2, 3) for identification.

（2）会话抽象与初始状态机构建(2) Session abstraction and initial state machine construction

在报文分类的基础上，以会话为单位，对网络通信行为进行抽象。On the basis of packet classification, the network communication behavior is abstracted in units of sessions.

会话表示通信参与者之间进行的一次完整数据交换，能够反映在通信过程中协议状态的迁移情况。网络协议研究领域，已经有不少成熟的识别网络会话的方法。上层应用使用下层协议提供的服务。如果是基于TCP协议的网络应用，一次会话往往由TCP协议的三次握手开始，当TCP连接中断时停止；如果是基于UDP协议的网络应用，一次会话往往由通信的间隔时间来区分，如果通信双方停止通信的时间超过特定时长，则推断一次会话已经完成。A session represents a complete data exchange between communication participants, which can reflect the transition of the protocol state during the communication process. In the field of network protocol research, there are many mature methods for identifying network sessions. Upper-layer applications use services provided by lower-layer protocols. If it is a network application based on the TCP protocol, a session is often started by the three-way handshake of the TCP protocol, and stops when the TCP connection is interrupted; if it is a network application based on the UDP protocol, a session is often distinguished by the communication interval. If the time of stopping communication exceeds a certain period of time, it is inferred that a session has been completed.

在会话抽象的过程中，采用输入和输出报文类别代替具体的报文信息，并依据报文出现的时序构建字符串序列。In the process of session abstraction, the input and output message categories are used to replace the specific message information, and the string sequence is constructed according to the time sequence of the message.

例如，某一次会话被表示为字符串序列（<1，2，5>，<1，3，6>），其中<1，2，5>表示输入字符串序列，该序列中的阿拉伯数字1的意思是第一个输入报文属于输入报文类别1，数字2的意思是第二个输入报文属于输入报文类别2；<1，3，6>表示输出字符串序列，其中数字1的意思是第一个输出报文属于输出报文类别1，数字3的意思是第二个输出报文属于输出报文类别3，以此类推。该会话的含义是协议实体接收到某一个类别（输入报文类别）为1的报文时，输出了某一个类别（输出报文类别）为1的报文，同时进入一个新的协议状态；在处于新的协议状态时，协议实体接收到某一个类别（输入报文类别）为2的报文，输出了某一个类别（输出报文类别）为3的报文，进入了另外一个协议状态；在此协议状态，协议实体又接收到某一个类别（输入报文类别）为5的报文，输出了某一个类别（输出报文类别）为6的报文，再次进行了状态转换。For example, a session is expressed as a string sequence (<1, 2, 5>, <1, 3, 6>), where <1, 2, 5> represents the input string sequence, and the Arabic numeral 1 in the sequence means that the first input message belongs to the input message category 1, and the number 2 means that the second input message belongs to the input message category 2; <1, 3, 6> indicates the output string sequence, where the number 1 means that the first output message belongs to output message category 1, the number 3 means that the second output message belongs to output message category 3, and so on. The meaning of this session is that when the protocol entity receives a message with a certain category (input message category) as 1, it outputs a message with a certain category (output message category) as 1, and enters a new protocol state at the same time; When in a new protocol state, the protocol entity receives a message with a category (input message category) of 2, outputs a message with a category (output message category) of 3, and enters another protocol state ; In this protocol state, the protocol entity receives a message with a category (input message category) of 5, outputs a message with a category (output message category) of 6, and performs a state transition again.

在将输入输出报文样本进行会话抽象，转化为字符串序列集合后，开始构建初始的协议状态机。本发明采用增强前缀树转换器EPTT的形式构造状态机，其优势在于能够准确描述协议实体的状态输出信息，所构造的带输出的协议状态机与实际网络协议情况更为贴近。After the session abstraction of the input and output message samples is converted into a string sequence set, the initial protocol state machine is built. The invention adopts the form of enhanced prefix tree converter EPTT to construct the state machine, which has the advantage of being able to accurately describe the state output information of the protocol entity, and the constructed protocol state machine with output is closer to the actual network protocol situation.

EPTT形式的协议状态机被定义为6元组（Q_E，I，O，δ_E，λ_E，q_λ），其中Q_E代表状态集合，I代表输入符号集合，O代表输出符号集合，δ_E代表状态转移函数，λ_E代表输出函数，q_λ代表初始的协议状态。The protocol state machine in the form of EPTT is defined as a 6-tuple (Q _E , I, O, δ _E , λ _E , q _λ ), where Q _E represents the set of states, I represents the set of input symbols, O represents the set of output symbols, and δ _E represents the state transition function, λ _E represents the output function, and q _λ represents the initial protocol state.

构造EPTT协议状态机时，依次将会话样本加入状态机。图2为本发明中基于抽象字符串序列构建EPTT状态机的一个示例。对于某一会话的字符串序列，采用遍历的形式，将输入字符串序列和输出字符串序列结合在一起分析。例如，对于图2中第一个会话样本，输入字符串序列<1，2，5>和输出字符串序列<1，3，6>将结合在一起同步分析，反映出输入字符与输出字符的对应关系。When constructing the EPTT protocol state machine, session samples are added to the state machine in turn. FIG. 2 is an example of constructing an EPTT state machine based on an abstract string sequence in the present invention. For a string sequence of a certain session, the input string sequence and the output string sequence are combined for analysis in the form of traversal. For example, for the first conversation sample in Figure 2, the input string sequence <1, 2, 5> and the output string sequence <1, 3, 6> will be combined and analyzed synchronously, reflecting the Correspondence.

遍历的过程中将基于输入字符串序列构造前缀符号串，描述已经遍历过的输入字符。初始状态设置为状态0，协议状态机由初始状态开始接收输入。前缀符号串在初始时被设置为λ，表示前缀符号串目前为空。输入字符串序列中的字符依次加入前缀符号串。如果前缀符号串所到达的状态在原状态机中没有，则创建一个新状态，以阿拉伯数字唯一标识。如果相应的状态转移信息在原状态机中不存在，则扩充状态转移函数和输出函数；如果原状态机中已包含相应的状态转移信息，将进一步判断是否需要扩充输出信息。During the traversal process, a prefix symbol string will be constructed based on the input string sequence to describe the input characters that have been traversed. The initial state is set to state 0, and the protocol state machine starts to receive input from the initial state. The prefix symbol string is initially set to λ, indicating that the prefix symbol string is currently empty. The characters in the input string sequence are sequentially added to the prefix symbol string. If the state reached by the prefix symbol string does not exist in the original state machine, create a new state and uniquely identify it with Arabic numerals. If the corresponding state transition information does not exist in the original state machine, then expand the state transition function and output function; if the original state machine already contains the corresponding state transition information, it will further judge whether to expand the output information.

例如，对于图2中第一个会话样本，遍历过程遇到的第一个输入字符为1，形成前缀符号串λ1，需要创建一个新的协议状态，以数字1标识此状态。同时扩充状态转移函数（协议实体在处于状态0时，接收到输入字符1，转移到状态1），以及输出函数（协议实体在处于状态0时，接收到输入字符1，产生输出字符1）。遍历过程遇到的第二个输入字符为2，形成前缀符号串λ12，创建一个以数字2标识的新状态，同时扩充状态转移函数（协议实体在处于状态1时，接收到输入字符2，转移到状态2），以及输出函数（协议实体在处于状态1时，接收到输入字符2，产生输出字符3）。对于图2中第二个会话样本，由于涉及的状态转移与第一个会话样本相同，因此不会产生新状态。但是依据第二个会话样本，协议实体在处于状态1时接收到输入字符2，产生的输出字符为4。该输出信息将扩充到原协议状态机中，即协议实体在处于状态1时接收到输入字符2，产生的输出字符隶属于输出字符集合{3，4}。For example, for the first session sample in Figure 2, the first input character encountered in the traversal process is 1, forming a prefix symbol string λ1, and a new protocol state needs to be created, and the number 1 identifies this state. At the same time, expand the state transition function (when the protocol entity is in state 0, it receives input character 1 and transfers to state 1), and the output function (when the protocol entity is in state 0, it receives input character 1 and generates output character 1). The second input character encountered in the traversal process is 2, forming a prefix symbol string λ12, creating a new state identified by the number 2, and expanding the state transition function (when the protocol entity is in state 1, the input character 2 is received, transfer to state 2), and the output function (when the protocol entity is in state 1, it receives input character 2 and produces output character 3). For the second session sample in Figure 2, since the state transition involved is the same as that of the first session sample, no new state will be generated. But according to the second conversation sample, the protocol entity in state 1 receives input character 2, which produces output character 4. The output information will be extended to the original protocol state machine, that is, the protocol entity receives input character 2 when it is in state 1, and the output character generated belongs to the output character set {3, 4}.

在遍历完所有会话样本后，将得到初始的协议状态机。初始协议状态机的构建方法是将所有会话样本直接加入，没有进行任何区分和甄别，因此，构建得到的状态机结果往往包含了大量的冗余状态，需要通过融合同类状态的方法加以化简，得到的协议状态机才更具实用价值。After traversing all session samples, the initial protocol state machine will be obtained. The construction method of the initial protocol state machine is to directly add all session samples without any distinction and screening. Therefore, the state machine results obtained by construction often contain a large number of redundant states, which need to be simplified by fusing similar states. The obtained protocol state machine has more practical value.

（3）基于输出报文的状态融合(3) State fusion based on output messages

在构建完初始状态机以后，将依据相似度的高低，尝试对状态机中的相似状态进行状态融合。本实施例中相似状态的选择以BlueFringe算法为基础，但对其中相似度的计算方法进行了改进。评价两个协议状态的相似度，采用的依据是公共输入字符串后缀。输入字符串后缀指的是协议实体从某一协议状态开始，接收到一系列的输入报文，这些输入报文在状态机中被表示为输入字符串后缀。公共输入字符串后缀中的“公共”强调的是两个不同协议状态接收到相同的一系列输入报文。公共输入字符串后缀能够反映协议实体在处于两个不同协议状态时，接收到相同输入报文时的状态转换和输出响应情况。在对相似度进行计算时，将考虑协议状态的公共输入字符串后缀的长短，以及对于接收到的相同输入，协议实体是否有相同的输出结果。如果两个协议状态，它们公共输入字符串后缀的长度最长，且对于相同输入有相同的输出，这样的两个协议状态将作为候选状态优先尝试融合。After the initial state machine is constructed, it will try to perform state fusion on similar states in the state machine according to the degree of similarity. The selection of the similar state in this embodiment is based on the BlueFringe algorithm, but the calculation method of the similarity is improved. Evaluate the similarity of two protocol states, based on the public input string suffix. The input string suffix means that the protocol entity starts from a certain protocol state and receives a series of input messages, and these input messages are represented as the input string suffix in the state machine. The "common" in the public input string suffix emphasizes that two different protocol states receive the same sequence of input packets. The public input string suffix can reflect the state transition and output response of the protocol entity when it receives the same input message when it is in two different protocol states. When calculating the similarity, the length of the public input string suffix of the protocol state will be considered, and whether the protocol entities have the same output results for the same input received. If two protocol states have the longest common input string suffix and have the same output for the same input, such two protocol states will be used as candidate states to try to fuse first.

两个候选状态是否能够融合，需要进行进一步的判定。因为协议状态机是基于训练样本构建的，而训练样本难以保证全面，这使得推断产生的协议状态机与真实的协议状态机可能存在差异。为了判断两个候选状态的融合是否可行，需要进一步有针对性的生成测试字符串进行测试，通过协议实体程序对于测试字符串的表现来肯定或者否定候选状态的融合，从而保证推断的协议状态机结果与真实的协议状态机高度吻合。Whether the two candidate states can be merged requires further determination. Because the protocol state machine is constructed based on training samples, and the training samples are difficult to guarantee comprehensiveness, there may be differences between the inferred protocol state machine and the real protocol state machine. In order to judge whether the fusion of two candidate states is feasible, it is necessary to further generate a test string for testing, and to affirm or deny the fusion of candidate states through the performance of the protocol entity program for the test string, so as to ensure the inferred protocol state machine The results are highly consistent with the real protocol state machine.

参考图3，测试字符串基于候选状态融合前的状态机产生，主要依据是状态机中两个候选状态的字符串前缀和字符串后缀。测试字符串的产生通过交叉拼接的方式，将到达某一个候选状态的所有字符串前缀与另外一个候选状态的所有字符串后缀依次拼接，生成的字符串组成测试字符串集合。例如，对于图3中状态融合前的状态机，状态1和状态3的相似度最高，将优先尝试融合。在构造测试字符串时，通过分析可知，状态1的字符串前缀集合为{<1>}，状态3的字符串前缀集合{<1,7>}。由于字符串前缀<1,7>包含了<1>，在融合过程中，将着眼于状态1的后缀集合{<7,2,5,14>}和状态3的后缀集合{<2,5,14>}，生成的测试字符串集合包含2个元素{<1,2,5,14>,<1,7,7,2,5,14>}。Referring to FIG. 3 , the test string is generated based on the state machine before candidate state fusion, mainly based on the string prefix and string suffix of the two candidate states in the state machine. The test strings are generated by cross-splicing, sequentially splicing all string prefixes reaching a certain candidate state with all string suffixes in another candidate state, and the generated strings form a test string set. For example, for the state machine before state fusion in Figure 3, state 1 and state 3 have the highest similarity, and fusion will be attempted first. When constructing the test string, the analysis shows that the set of string prefixes in state 1 is {<1>}, and the set of string prefixes in state 3 is {<1,7>}. Since the string prefix <1,7> contains <1>, during the fusion process, we will focus on the suffix set {<7,2,5,14>} of state 1 and the suffix set of state 3 {<2,5 ,14>}, the generated test string set contains 2 elements {<1,2,5,14>,<1,7,7,2,5,14>}.

利用测试字符串进行测试的过程中，首先依据测试字符串和已知的报文协议格式，将测试字符串实例化为输入报文序列。将输入报文序列发向协议实体程序，获取作为响应的输出报文序列。对输出报文序列进行抽象，将其表示为输出字符串序列。针对候选状态融合后的协议状态机，判断输出字符串序列中的输出字符，是否都存在于协议状态机对应状态的输出符号集中。如果所有输出字符串均与状态机中的一致，则认为状态融合是可行的，否则判断状态融合不可行，将测试结果作为新样本加入样本集并扩展协议状态机，继续在状态机中选择其他候选状态实施融合。In the process of testing with the test character string, firstly, the test character string is instantiated as an input message sequence according to the test character string and the known message protocol format. Send the input message sequence to the protocol entity program, and obtain the output message sequence as a response. Abstract the sequence of output messages and represent it as a sequence of output strings. For the protocol state machine after candidate state fusion, it is judged whether the output characters in the output character string sequence all exist in the output symbol set of the corresponding state of the protocol state machine. If all the output strings are consistent with those in the state machine, it is considered that the state fusion is feasible, otherwise it is judged that the state fusion is not feasible, the test result is added to the sample set as a new sample and the protocol state machine is expanded, and continue to select other in the state machine Candidate states implement fusion.

状态的融合操作将反复进行，直到状态机中没有可融合的状态为止。The state fusion operation will be repeated until there are no fusionable states in the state machine.

由以上本发明的技术方案可知，本发明的基于状态融合的协议状态机自动推断方法，在已有的报文协议格式推断技术的基础上，依据收集的报文样本构造增强前缀树转换器，通过对增强前缀树转换器中的同类状态进行融合，得到精简的协议状态机。采用此方法需要获得协议实体程序，并能够根据需要运行实体程序，向其发送特定的报文序列，并观察相应的报文输出，以此作为协议状态机推断的基础。It can be known from the above technical solutions of the present invention that the automatic state machine inference method based on state fusion of the present invention, on the basis of the existing message protocol format inference technology, constructs an enhanced prefix tree converter according to the collected message samples, A simplified protocol state machine is obtained by merging the same kind of states in the enhanced prefix tree converter. Adopting this method needs to obtain the protocol entity program, and be able to run the entity program as needed, send a specific message sequence to it, and observe the corresponding message output, as the basis for inferring the protocol state machine.

综上所述，本发明的基于状态融合的协议状态机自动推断方法将协议系统视为带输出的状态迁移系统，着眼于协议系统中输入输出报文之间的内在联系对协议状态机中的状态实施融合，通过采用增强前缀树转换器EPTT描述协议实体在会话过程涉及的输入、输出报文组成的抽象符号串，并通过与协议实体的测试性交互判定状态融合的可行性，有助于确保构建的协议状态机与实际协议系统高度吻合，确保了状态机推断的自动化，提高推断结果的准确度；而且针对待融合的候选状态自动化产生辅助判定的输入样本，进而通过协议实体程序的自动化运行掌握协议实体对于输入样本的输出响应，避免了人工判定的低效，提高了状态机逆向推断的准确度和整体效率。In summary, the protocol state machine automatic inference method based on state fusion of the present invention regards the protocol system as a state transition system with output, and focuses on the internal relationship between input and output messages in the protocol system. State implementation fusion, through the use of enhanced prefix tree converter EPTT to describe the abstract symbol string composed of input and output messages involved in the session process of the protocol entity, and determine the feasibility of state fusion through test interaction with the protocol entity, which is helpful Ensure that the constructed protocol state machine is highly consistent with the actual protocol system, ensure the automation of state machine inference, and improve the accuracy of inference results; and automatically generate auxiliary judgment input samples for the candidate states to be fused, and then through the automation of the protocol entity program The operation masters the output response of the protocol entity to the input sample, avoids the inefficiency of manual judgment, and improves the accuracy and overall efficiency of the reverse inference of the state machine.

虽然本发明已以较佳实施例揭露如上，然其并非用以限定本发明。本发明所属技术领域中具有通常知识者，在不脱离本发明的精神和范围内，当可作各种的更动与润饰。因此，本发明的保护范围当视权利要求书所界定者为准。Although the present invention has been disclosed above with preferred embodiments, it is not intended to limit the present invention. Those skilled in the art of the present invention can make various changes and modifications without departing from the spirit and scope of the present invention. Therefore, the scope of protection of the present invention should be defined by the claims.

Claims

1. A method for automatically inferring a protocol state machine based on state fusion, characterized in that, comprising the following steps:

(1) Message format extraction and message classification: obtain the specific format information of the input and output messages related to the protocol entity program, and classify the input and output messages according to the message format, and group the message samples with the same structure into is a category, and the category information of the classification is represented by an abstract symbol;

(2) Session abstraction and initial state machine construction: Based on the classification categories represented by abstract symbols, the network communication behavior is abstracted with sessions as units, and the input and output message sequences in the session process are described as abstract input and output symbol strings. Then, according to the session sample set, construct an initial state machine consistent with the set of input and output symbol strings;

(3) State fusion based on the output message: The candidate states are fused according to the similarity, and a test symbol string is generated, and then through an automated test, the protocol entity is compared with the fused state machine after receiving the test symbol string. output response to verify the feasibility of state fusion;

(4) Repeat the above step (3) until there are no states that meet the fusion conditions in the state machine;

The workflow of the aforementioned session abstraction and initial state machine construction phase is as follows: the inference of the state machine is built on the basis of the session sample set, and the input and output messages in the session are represented by abstract symbols corresponding to their categories, so that the input of the complete session The output message sequence is converted into an abstract input and output symbol string; on this basis, according to the session sample set, the initial state machine is constructed in the form of an enhanced prefix tree converter EPTT, and the initial state machine contains all input and output as session samples symbol string;

The workflow of the state fusion stage based on the output message is as follows: on the basis of the initial state machine, two similar states are fused each time according to the degree of similarity, and the selection of similar states is based on the BlueFringe algorithm. The two states with the highest degrees are the candidate states to be fused; whether the fusion of the candidate states is feasible or not will be judged based on the generated test string to determine whether the two candidate states can be fused, wherein: the test string is based on the arrival of the original state machine The string prefixes and string suffixes of the two candidate states are constructed. Through cross splicing, all the string prefixes reaching a certain candidate state are sequentially spliced with all the string suffixes of the other candidate state, and the generated strings form a test String collection, if it is judged that all the output strings are consistent with the fused protocol state machine, it is considered that the state fusion is feasible, otherwise it is judged that the state fusion is not feasible; the test results are added to the session sample set as a new sample and the protocol state machine is extended , and continue to try to fuse other states.

2. The method according to claim 1, wherein, in the aforementioned method, when selecting a candidate state to be fused in the protocol state machine, based on the BlueFringe algorithm, the calculation of the similarity is based on the public input character string of the protocol state Suffix, the public input string suffix reflects the state transition and output response of the protocol entity when it receives the same input message when it is in two different protocol states, where: the input string suffix means that the protocol entity starts from a certain protocol state , a series of input messages are received, and these input messages are represented as input string suffixes in the state machine; the common input string suffix means that two different states receive the same series of input messages; the similarity The calculation will consider the length of the common input string suffix of the protocol state, and whether the protocol entity produces the same output result when receiving the same input, if two protocol states have the longest length of their common input string suffix, and for the same Inputs have the same output, such two protocol states will preferentially attempt to fuse.

3. The method according to claim 1, characterized in that, in the aforementioned method, the process of utilizing the test character string set to carry out state fusion and feasibility determination comprises the following steps: first according to the test character string and known message protocol Format, generate the input message sequence as a test case; send the input message sequence to the protocol entity program, and obtain the output message sequence as a response; abstract the output message sequence and express it as an output string sequence; The protocol state machine after the fusion of candidate states judges whether the output characters in the output string sequence exist in the output symbol set corresponding to the state of the protocol state machine: if all the output strings are consistent with the fused protocol state machine, the state is considered Fusion is feasible, otherwise it is judged that state fusion is not feasible.