[go: up one dir, main page]

CN113360726A - Parallel regular expression matcher - Google Patents

Parallel regular expression matcher Download PDF

Info

Publication number
CN113360726A
CN113360726A CN202110632853.7A CN202110632853A CN113360726A CN 113360726 A CN113360726 A CN 113360726A CN 202110632853 A CN202110632853 A CN 202110632853A CN 113360726 A CN113360726 A CN 113360726A
Authority
CN
China
Prior art keywords
module
regular
matching
regular expression
pure
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202110632853.7A
Other languages
Chinese (zh)
Other versions
CN113360726B (en
Inventor
苟鹏飞
陆泳
刘扬帆
徐越
杨浩
施葹
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Qingxin Semiconductor Technology Shanghai Co ltd
Original Assignee
Qingxin Semiconductor Technology Shanghai Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Qingxin Semiconductor Technology Shanghai Co ltd filed Critical Qingxin Semiconductor Technology Shanghai Co ltd
Priority to CN202110632853.7A priority Critical patent/CN113360726B/en
Publication of CN113360726A publication Critical patent/CN113360726A/en
Application granted granted Critical
Publication of CN113360726B publication Critical patent/CN113360726B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/903Querying
    • G06F16/90335Query processing
    • G06F16/90344Query processing by using string matching techniques
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/901Indexing; Data structures therefor; Storage structures
    • G06F16/9014Indexing; Data structures therefor; Storage structures hash tables
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/901Indexing; Data structures therefor; Storage structures
    • G06F16/9017Indexing; Data structures therefor; Storage structures using directory or table look-up

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Software Systems (AREA)
  • Computational Linguistics (AREA)
  • Devices For Executing Special Programs (AREA)

Abstract

本发明提供了一种并行正则表达式匹配器,包括:软件预编译单元,被配置为将正则表达式规则集中的多个正则表达式规则进行预编译,转换为符合硬件处理行为的格式的正则表达式规则,以供硬件匹配电路进行匹配;硬件匹配电路,被配置为将待匹配报文和符合硬件处理行为的格式的正则表达式规则进行匹配,得出匹配结果。

Figure 202110632853

The present invention provides a parallel regular expression matcher, comprising: a software precompiling unit configured to precompile multiple regular expression rules in a regular expression rule set, and convert them into regular expressions in a format conforming to hardware processing behaviors The expression rule is used for matching by the hardware matching circuit; the hardware matching circuit is configured to match the to-be-matched packet with the regular expression rule in the format conforming to the hardware processing behavior to obtain a matching result.

Figure 202110632853

Description

Parallel regular expression matcher
Technical Field
The invention relates to the technical field of regular expressions, in particular to a parallel regular expression matcher.
Background
Regular expressions are literal patterns consisting of common characters (letters/numbers/symbols) and special characters (also called "meta characters") that describe one or more strings of characters to be matched when searching for text. It is widely used in various fields of data processing. In high speed, large bandwidth network and database applications, extremely high requirements are placed on the processing speed of regular expressions. This requirement mainly includes two aspects: (1) processing bandwidth of data to be matched; (2) the number of regular expressions matched in parallel.
The current products are mainly implemented in software. Known regular expression matching software (libraries) contain: POSIX RE API, PCRE, Google RE2, Onikuma, Hyperscan, etc. At the heart of the implementation is a finite state automaton, which can be classified as Deterministic Finite Automata (DFA), non-deterministic finite automata (NFA) and combinations thereof. For example, fig. 1 shows a finite automaton constructed for regular expression a (b | c). Each circle represents a state and the double-line circle represents the end (match) state. When characters of the text are input into the automaton one by one, if the current character is the same as the character on the arrow, the next state can be advanced by the arrow. If the end state can be reached, this indicates that a match has occurred (Matched). If the end state is not reached when all characters of the text are scanned, no match is indicated.
The regular expression matching of software is characterized by flexible use and has the defects of low speed and limited parallel processing capability. The software (libraries) described above use methods such as state graph merging of multiple regular expressions, multithreading, phased matching, utilization of Single Instruction Multiple Data (SIMD) instructions, etc. to ameliorate the above disadvantages, but still fail to meet the increasing data processing speed requirements. For example, at present, Hyperscan, which is the fastest and widely adopted by mainstream data center manufacturers, can achieve the single-thread matching speed of a Snort regular expression set of 7 Gbit/sec. However, multithreading does not provide a linear speed increase, and differences in the text to be scanned can sometimes significantly affect speed. Other commonly used regular expression matching software (libraries) tend to have matching speeds of only tens or hundreds of mbits/sec. Furthermore, software algorithms are difficult to implement if the regular expression processing is to be embedded in a network processing device or a digital signal processing device (such devices often have no CPU or a low performance microcontroller).
The existing digital circuit implementation of regular expressions, which are generally simple DFA (deterministic finite state automata) or NFA (non-deterministic finite state automata) circuits, can achieve better processing speed (e.g., several Gbit/sec to several tens of Gbit/s) for a single regular expression, but when the number of regular expressions increases, the speed decreases rapidly in inverse proportion. When the number of regular expressions is 100, the speed is not as high as that of software, and the practical application value is low.
The problems with the current products/processes are: software regular expression implementations are flexible to use, but are slow and cannot keep up with the data entry speed of today's high speed networks (25Gbps, 100Gbps) or high speed interfaces (PCIe Gen3x 16). The existing digital circuit implementation cannot efficiently process a plurality of regular expressions simultaneously.
Disclosure of Invention
The invention aims to provide a parallel regular expression matcher so as to solve the problem that the conventional regular expression is slow in speed or cannot be processed in parallel.
In order to solve the above technical problem, the present invention provides a parallel regular expression matcher, including:
the software pre-compiling unit is configured to pre-compile a plurality of regular expression rules in the regular expression rule set, convert the regular expression rules into the regular expression rules in a format conforming to the hardware processing behavior, and enable the hardware matching circuit to match the regular expression rules;
and the hardware matching circuit is configured to match the message to be matched with the regular expression rule in the format conforming to the hardware processing behavior to obtain a matching result.
Optionally, in the parallel regular expression matcher, the software pre-compiling unit includes a splitting module, a state merging module, and a configuration table module, where:
the splitting module is used for splitting a plurality of regular expression rules into pure character strings and regular units;
the state merging module is used for merging the states of the pure character strings and the regular units, extracting the same pure character strings and regular units and distinguishing different pure character strings and regular units;
and the configuration table module encapsulates the pure character strings and the regular units into a regular expression pattern library according to a format conforming to the hardware processing behavior, so as to be used for carrying out parameter configuration and state table initialization on the hardware matching circuit.
Optionally, in the parallel regular expression matcher, the regular expression pattern library includes a pure character string group, a regular unit group, and a metadata group, where:
the pure character string group is used for storing pure character strings and configuring the relation between the pure character strings according to whether the pure character strings are the same or not;
the regular unit group is used for storing the regular units and configuring the relationship among the regular units according to whether the regular units are the same or not;
and the metadata group is used for storing metadata and configuring the relation between the pure character string and the regular unit according to the metadata.
Optionally, in the parallel regular expression matcher, the hardware matching circuit includes a pattern input module, a data input module, a regular matching core module, and a matching result output module, where:
the mode input module performs parameter configuration on the regular matching core module according to the pure character strings in the pure character string group and the regular units in the regular unit group, and performs state table initialization on the regular matching core module according to the metadata in the metadata group;
the data input module is used for providing a message to be matched to the regular matching core module;
the regular matching core module is used for matching the message to be matched with the regular expression rule to generate a matching result;
the matching result output module is used for outputting a matching result;
the mode input module, the data input module and the matching result output module are in bus protocol bridging or format conversion with the regular matching core module.
Optionally, in the parallel regular expression matcher, the regular matching core module includes a character string preprocessing module, an operation scheduling module, and an execution engine module, where:
the character string preprocessing module is used for scanning the message to be matched according to the pure character strings in the pure character string group to generate a character string queue;
the operation scheduling module is used for executing a corresponding program according to the character string queue and generating a request queue;
and the execution engine module scans the messages to be matched according to the request queue and judges whether the messages to be matched are matched.
Optionally, in the parallel regular expression matcher, the character string preprocessing module includes a text scanning module and a text confirmation module, where:
the text scanning module performs pre-matching on the pure character strings in the message to be matched according to the pure character strings in the pure character string group by using a Shift-OR algorithm circuit to obtain a coarse screening result;
and the text confirmation module confirms the pure character strings in the coarse screening result again according to the pure character strings in the pure character string group by using a hardware Hash algorithm to obtain a fine screening result.
Optionally, in the parallel regular expression matcher, the Shift-OR algorithm circuit simultaneously performs pipeline scanning on 16 characters;
after scanning by a Shift-OR algorithm circuit, putting the byte sequence number to be confirmed of the matched data packet into a first-in first-out queue of an index data structure of a text confirmation module for the text confirmation module to use;
the text confirmation module reads in a first-in first-out queue of an index data structure, forms and queries an index table of a fine screening module to obtain a byte sequence number of a matched data packet, then calculates a hash value, and uses the hash value to search a character string information table to obtain a character string length and a character string address;
a string queue is generated from the string length and the string address.
Optionally, in the parallel regular expression matcher, the operation scheduler is responsible for forming a corresponding program according to the character string request information, and selecting a corresponding execution engine module, where each program includes an instruction, an operation code, and an operand, and sends each instruction to the request queue for the subsequent execution engine module to call.
Optionally, in the parallel regular expression matcher, the execution engine module includes a character comparison engine, an NFA engine, a report engine, and a message buffer;
the request queue comprises a string comparison request, a regular NFA request and a result report request;
the character comparison engine accurately compares the fine screening results according to the character string comparison request, and the final comparison comprises the step of comparing the ultra-long pure character strings one by one;
the NFA engine performs corresponding NFA state machine operation on a message to be matched according to a regular NFA request, wherein the NFA state machine operation comprises selecting corresponding NFA in a field corresponding to the message to be matched according to input request information, performing NFA state machine operation, and finally generating a matching result for a subsequent report engine to use;
the report engine outputs a final matching result according to the result report request, and comprises the steps of calculating a matching position according to the output of the character comparison engine and the NFA engine, extracting a matched serial number, obtaining a matching result and outputting the matching result;
the message buffer is used for storing and providing message data to be matched required by various execution engines.
Optionally, in the parallel regular expression matcher, the quantity ratio and the operating frequency of the character comparison engine and the NFA engine are balanced according to software performance evaluation and physical characteristics of a hardware circuit;
selecting a typically applied regular expression rule set and a message to be matched for performance estimation, determining the scale of a character string preprocessing module under a certain matching probability, and determining the number and the proper proportion of a character comparison engine and an NFA engine;
and further determining the parameter configuration of each module according to the physical characteristics of the hardware implementation mode.
In the parallel regular expression matcher provided by the invention, a plurality of regular expression rules in the regular expression rule set are precompiled by a software precompiling unit and are converted into the regular expression rules in accordance with the format of the hardware processing behavior for being matched by a hardware matching circuit, so that the hardware matching circuit can match a message to be matched with the regular expression rules in accordance with the format of the hardware processing behavior to obtain a matching result, a novel digital circuit matched by parallel regular expressions is realized, and the parallel matching of a plurality of regular expressions can be supported.
Drawings
FIG. 1 is a schematic diagram of a finite automaton constructed by a conventional regular expression a (b | c);
FIG. 2 is a schematic diagram of a parallel regular expression matcher in an embodiment of the present invention;
FIG. 3 is a schematic diagram of a regular expression matching core in a hardware circuit of a parallel regular expression matcher in an embodiment of the present invention;
FIG. 4 is a schematic diagram of a text scanning module of a regular expression matching core in an embodiment of the present invention;
FIG. 5 is a block diagram of a text validation module for a regular expression matching kernel in an embodiment of the invention;
FIG. 6 is a schematic diagram of an operation scheduler for a regular expression matching core in one embodiment of the invention;
FIG. 7 is a schematic diagram of a string comparison engine of a regular expression matching core in an embodiment of the invention;
FIG. 8 is a diagram of an NFA engine for regular expression matching core in an embodiment of the invention.
Detailed Description
In the present invention, the embodiments are only intended to illustrate the aspects of the present invention, and should not be construed as limiting.
It is further noted herein that in embodiments of the present invention, only a portion of the components or assemblies may be shown for clarity and simplicity, but those of ordinary skill in the art will appreciate that, given the teachings of the present invention, required components or assemblies may be added as needed in a particular scenario. Furthermore, features from different embodiments of the invention may be combined with each other, unless otherwise indicated. For example, a feature of the second embodiment may be substituted for a corresponding or functionally equivalent or similar feature of the first embodiment, and the resulting embodiments are likewise within the scope of the disclosure or recitation of the present application.
It is also noted herein that, within the scope of the present invention, the terms "same", "equal", and the like do not mean that the two values are absolutely equal, but allow some reasonable error, that is, the terms also encompass "substantially the same", "substantially equal". By analogy, in the present invention, the terms "perpendicular", "parallel" and the like in the directions of the tables also cover the meanings of "substantially perpendicular", "substantially parallel".
The numbering of the steps of the methods of the present invention does not limit the order of execution of the steps of the methods. Unless specifically stated, the method steps may be performed in a different order.
The parallel regular expression matcher provided by the invention is further explained in detail in the following by combining the drawings and specific embodiments. Advantages and features of the present invention will become apparent from the following description and from the claims. It is to be noted that the drawings are in a very simplified form and are not to precise scale, which is merely for the purpose of facilitating and distinctly claiming the embodiments of the present invention.
Furthermore, features from different embodiments of the invention may be combined with each other, unless otherwise indicated. For example, a feature of the second embodiment may be substituted for a corresponding or functionally equivalent or similar feature of the first embodiment, and the resulting embodiments are likewise within the scope of the disclosure or recitation of the present application.
The core idea of the invention is to provide a parallel regular expression matcher to solve the problem that the existing regular expression is slow or cannot be processed in parallel.
To achieve the above idea, the present invention provides a parallel regular expression matcher, including: the software pre-compiling unit is configured to pre-compile a plurality of regular expression rules in the regular expression rule set, convert the regular expression rules into the regular expression rules in a format conforming to the hardware processing behavior, and enable the hardware matching circuit to match the regular expression rules; and the hardware matching circuit is configured to match the message to be matched with the regular expression rule in the format conforming to the hardware processing behavior to obtain a matching result.
The novel parallel regular expression matching digital circuit provided by the invention can support parallel matching of a plurality of regular expressions (for example, 1000 regular expressions), and simultaneously improves the matching throughput bandwidth (more than 50 Gbit/s).
The invention provides a novel regular expression digital circuit implementation method and a novel regular expression digital circuit implementation device, which can efficiently process a large number of parallel regular expressions according to the principle as shown in figure 2:
and splitting the regular expression rule into a pure String (live String) and a regular unit (Regex), and respectively carrying out matching scanning. For example, the regular expression is hello (e | f), which can be separated into a pure string hello and a regular cell (e | f). If the substring hello of a regular expression fails to match, then the entire regular expression must not match. Further, the substring can be further distinguished into a prefix (prefix) and a suffix (suffix).
And in the software compiling process, performing state merging on a plurality of regular expressions. Multiple regular expressions in practical application have certain similarity: containing the same substrings or the same regular finite automata structure. The partial differentiation and extraction, which can be "merged" for this part, is the key to improving the parallel quantity processing capability for regular expressions.
A multi-stage scanning method. In order to achieve a target processing speed with a smaller hardware circuit scale, three-stage scanning is divided: the first stage preprocesses the pure character string, and can judge that the text (message) to be scanned and the sub character string are not matched with each other OR are matched with each other definitely (the messages which are not matched with each other definitely are directly skipped over subsequently), and the hardware Shift-OR algorithm with low use cost and high efficiency is called as a coarse screen; the second stage is to Confirm (Confirm) the text of "possible matching", it adopts the hardware hash (hash) algorithm, can further Confirm the matching situation of the pure string, judge "affirmatively mismatch" and "most likely match" message, call "fine screen"; the third level is an architecture similar to a processor. The working principle is as follows: the 'operation and request co-processor' performs operation scheduling on the preprocessing result, and selects a corresponding Program (Program) to execute. A Program (Program) execution unit includes three types of engines: a string matching engine, a canonical NFA engine, and a results reporting engine. Wherein the number of string matching engines and regular NFA engines can be configured. Depending on the preprocessing results, the operation scheduler is responsible for generating the corresponding request sequences, possibly invoking one, two or three of the execution units.
The embodiment of the invention provides a parallel regular expression matcher, which comprises: the software pre-compiling unit is configured to pre-compile a plurality of regular expression rules in the regular expression rule set, convert the regular expression rules into the regular expression rules in a format conforming to the hardware processing behavior, and enable the hardware matching circuit to match the regular expression rules; and the hardware matching circuit is configured to match the message to be matched with the regular expression rule in the format conforming to the hardware processing behavior to obtain a matching result.
In an embodiment of the present invention, in the parallel regular expression matcher, the software pre-compiling unit includes a splitting module, a state merging module, and a configuration table module, where: the splitting module is used for splitting a plurality of regular expression rules into pure character strings and regular units; the state merging module is used for merging the states of the pure character strings and the regular units, extracting the same pure character strings and regular units and distinguishing different pure character strings and regular units; and the configuration table module encapsulates the pure character strings and the regular units into a regular expression pattern library according to a format conforming to the hardware processing behavior, so as to be used for carrying out parameter configuration and state table initialization on the hardware matching circuit. Specifically, a software method is used for carrying out state merging (precompilation) on a plurality of regular expressions, and in the precompilation process, the regular expressions are divided into a pure character string group and a regular unit group. And then packaging the regular expression matching cores into a regular expression pattern library (pattern database) according to a format conforming to the hardware processing behavior, and performing parameter configuration and state table initialization on the regular expression matching cores in the hardware circuit.
In an embodiment of the present invention, in the parallel regular expression matcher, the regular expression pattern library includes a pure string group, a regular unit group, and a metadata group, where: the pure character string group is used for storing pure character strings and configuring the relation between the pure character strings according to whether the pure character strings are the same or not; the regular unit group is used for storing the regular units and configuring the relationship among the regular units according to whether the regular units are the same or not; and the metadata group is used for storing metadata and configuring the relation between the pure character string and the regular unit according to the metadata.
In an embodiment of the present invention, in the parallel regular expression matcher, as shown in fig. 3, the hardware matching circuit includes a pattern input module, a data input module, a regular matching core module, and a matching result output module, wherein: the mode input module performs parameter configuration on the regular matching core module according to the pure character strings in the pure character string group and the regular units in the regular unit group, and performs state table initialization on the regular matching core module according to the metadata in the metadata group; the data input module is used for providing a message to be matched to the regular matching core module; the regular matching core module is used for matching the message to be matched with the regular expression rule to generate a matching result; the matching result output module is used for outputting a matching result; the mode input module, the data input module and the matching result output module are in bus protocol bridging or format conversion with the regular matching core module. Namely, the hardware circuit consists of a mode input module, a data input module, a matching result output module and a regular matching core. The mode input module, the data module and the matching result output module carry out bus protocol bridging and format conversion.
In an embodiment of the present invention, in the parallel regular expression matcher, as shown in fig. 3, the regular matching core module includes a character string preprocessing module, an operation scheduling module, and an execution engine module, where: the character string preprocessing module is used for scanning the message to be matched according to the pure character strings in the pure character string group to generate a character string queue; the operation scheduling module is used for executing a corresponding program according to the character string queue and generating a request queue; and the execution engine module scans the messages to be matched according to the request queue and judges whether the messages to be matched are matched.
In an embodiment of the present invention, in the parallel regular expression matcher, the character string preprocessing module includes a text scanning module and a text confirmation module, wherein: the text scanning module performs pre-matching on the pure character strings in the message to be matched according to the pure character strings in the pure character string group by using a Shift-OR algorithm circuit to obtain a coarse screening result; and the text confirmation module confirms the pure character strings in the coarse screening result again according to the pure character strings in the pure character string group by using a hardware Hash algorithm to obtain a fine screening result.
In an embodiment of the present invention, in the parallel regular expression matcher, the execution engine module includes a character comparison engine, an NFA engine, a report engine, and a message buffer; the request queue comprises a string comparison request, a regular NFA request and a result report request; the character comparison engine accurately compares the fine screening results according to the character string comparison request; the NFA engine performs corresponding NFA state machine operation on a message to be matched according to the regular NFA request; the report engine outputs a final matching result according to the result report request; the message buffer is used for storing and providing message data to be matched required by various execution engines.
In an embodiment of the invention, in the parallel regular expression matcher, the Shift-OR algorithm circuit simultaneously performs pipeline scanning on 16 characters; after the data packets are scanned by a Shift-OR algorithm circuit, the byte sequence numbers of the matched data packets are put into a first-in first-out queue of an index data structure of a text confirmation module for the text confirmation module to use; the text confirmation module reads in a first-in first-out queue of an index data structure, forms and queries an index table of a fine screening module to obtain a byte sequence number of a matched data packet, then calculates a hash value, and uses the hash value to search a character string information table to obtain a character string length and a character string address; generating a character string queue according to the character string length and the character string address; the operation scheduler is responsible for forming corresponding programs according to the character string request information and selecting corresponding execution engine modules, each program comprises an instruction, an operation code and an operand, and each instruction is sent into a request queue for being called by the subsequent execution engine modules.
Specifically, as shown in fig. 4, in the character string preprocessing step, the text scanning uses a Shift-OR circuit in parallel. In this example, it scans 16 characters simultaneously. This scanning process is pipelined so that 16 characters (bytes) can be processed per clock cycle. After scanning, the matched byte sequence number of the message is put into a byte sequence number FIFO to be confirmed for a subsequent text confirmation module to use. There may still be "false matches" for the match at this point, requiring further confirmation. As shown in fig. 5, the text confirmation module reads in the byte sequence number generated by the previous scanning, queries the confirmation packet index table, calculates the hash value, and finally searches the character string information table by using the hash value to obtain the information such as the length and address of the character string. This information will eventually generate a string request containing the screened matched address, length and grouping for use by subsequent modules. As shown in fig. 6, the operation scheduler is responsible for forming a Program (Program) according to the character string request information and selecting a corresponding execution unit. Each program consists of a series of instructions, each instruction containing an opcode and an operand. Each instruction is placed into a request queue for subsequent execution engine calls.
Examples of procedures are:
Figure BDA0003104332890000101
Figure BDA0003104332890000111
where min _ offset, fail _ jump, etc. are opcodes followed by operands, and Program State Machine places them in an Instruction Buffer and organizes them into parallel request queues.
In an embodiment of the present invention, in the parallel regular expression matcher, a character string comparison engine performs final one-by-one comparison of ultra-long pure character strings to obtain an accurate conclusion of whether matching is performed (since both coarse screening and fine screening are misjudgments that pursue efficiency and allow a certain probability, each byte needs to be finally confirmed word by word); the NFA engine selects a corresponding NFA in a corresponding field of a message to be matched according to the input request information, performs NFA state machine operation, and finally generates a matching result for a subsequent report engine to use; and the report engine corrects the matching position according to the output of the character string comparison engine and the NFA engine, extracts the serial number of the matched rule, forms a matching result and outputs the matching result.
Specifically, as shown in fig. 7, in the program execution unit, the string comparison engine is responsible for final one-by-one comparison of matching of very long strings (greater than 8 bytes). As shown in fig. 8, the regular NFA engine is responsible for implementing the NFA state machine. The engine selects corresponding NFA in the corresponding field of the message according to the input request information, and performs NFA state machine operation. The state machine will eventually generate a matching result for use by subsequent reporting engines. The result reporting engine calculates a matching position (offset) from the outputs of the first two matching engines, extracts a matched rule number (ID), forms a matching result, and outputs the matching result.
In one embodiment of the invention, in the parallel regular expression matcher, the scale of a character string preprocessing module, the quantity and the proportion of a character string comparison engine and an NFA engine, and the running frequency are balanced according to software performance evaluation and the physical characteristics of a hardware circuit; selecting a typically applied regular expression rule set and a message to be matched for performance pre-estimation, determining the scale of a character string preprocessing module under a certain matching probability, and determining the number and the proper proportion of a character comparison engine and an NFA engine; and further determining the parameter configuration of each module according to the physical characteristics of the hardware implementation mode.
Specifically, the scale of the character string preprocessing module, the number ratio of the character string comparison engine and the regular NFA engine, and the operating frequency need to be balanced according to the software performance evaluation and the physical characteristics of the hardware circuit. Selecting a regular expression set and a text (message) to be scanned, which are typically applied, to perform performance estimation, determining the scale of a character string preprocessing module under a certain matching probability, and determining the number and the proper proportion of a character comparison engine and an NFA engine. Then, according to the physical characteristics of the hardware implementation, for example, the limitation and the synthesizable frequency of the circuit logic resources are considered for the FPGA, and the characteristics of the tape-out process library and the integration of the input and output interfaces are considered for the ASIC chip, the parameter configuration (bus bit width, operating frequency, memory cell size, etc.) of each module can be further determined. For different purposes, the technical scheme of the invention can be presented as different parameter configurations.
In the regular expression matcher and the matching method provided by the invention, a plurality of regular expression rules in a regular expression rule set are precompiled by a software precompiling unit and are converted into the regular expression rules in accordance with the format of the hardware processing behavior, so that a hardware matching circuit can match a message to be matched with the regular expression rules in accordance with the format of the hardware processing behavior to obtain a matching result, a novel digital circuit matched with the parallel regular expressions is realized, and the parallel matching of a plurality of regular expressions can be supported.
The canonical matching kernel consists of three parts: a) a character string preprocessing module; b) an operation scheduling module; c) and executing the engine module. The three parts carry out stream processing on the input message and output a matching result according to a preset rule.
In the processing process, an input message firstly passes through a character string preprocessing module to judge whether the input message is matched with a character string in a rule. If not, the message is not matched with the rule, so that the message does not need to be sent to a subsequent module for processing; if so, subsequent processing is required. The module consists of two sub-modules, wherein a text Scan (Literal Scan) is responsible for the first step of coarse screening and determines the possible matching positions of the character strings; the text confirmation (Literal Confirm) module is responsible for further screening using computed hash values and comparisons for possible matching locations.
After character string preprocessing confirmation, a series of position information which can be matched and coding information of associated rules are obtained. They are put into a string queue. Then, the program scheduler will select the corresponding program to execute according to the information in the character string queue. During program execution, corresponding requests are generated according to rules, and the requests comprise character string comparison, regular NFA, result report and the like. All requests are put into the request queue and the execution flow is entered once the subsequent execution units are ready.
The execution engine generally performs three steps of string precise word-by-word comparison, regular NFA and result reporting, where the string word-by-word comparison is responsible for confirming a matching result of an ultra-long string (more than 8 bytes), the NFA is responsible for performing corresponding NFA state machine operation, and the reporting engine outputs a final matching result. During operation, a dedicated message cache may be responsible for providing the message data stream required by the engine.
The key points of the invention are as follows:
a regular expression matcher implemented in a hardware circuit, wherein:
a multi-stage hardware scanning circuit for parallel matching regular expressions: in the first stage, text quick scanning of a Shift-OR algorithm is performed; secondly, obtaining a possible matching result and then using a text matching confirmation process of a Hash algorithm; third, operating the scheduler and the associated execution engine: including a string comparison engine, a canonical NFA engine, and a result reporting engine.
The software pre-compiling method applicable to the multistage hardware scanning circuit comprises the following steps: firstly, extracting pure character string substrings and regular units when a regular expression pattern library is precompiled. And secondly, performing common part extraction and merging on a plurality of rules when the regular expression pattern library is precompiled, and encoding the rules into a format suitable for a hardware memory.
In summary, the above embodiments have described different configurations of the rule expression matcher and the matching method in detail, and it is understood that the present invention includes, but is not limited to, the configurations listed in the above embodiments, and any modifications based on the configurations provided in the above embodiments are within the scope of the present invention. One skilled in the art can take the contents of the above embodiments to take a counter-measure.
The embodiments in the present description are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other. For the system disclosed by the embodiment, the description is relatively simple because the system corresponds to the method disclosed by the embodiment, and the relevant points can be referred to the method part for description.
The above description is only for the purpose of describing the preferred embodiments of the present invention, and is not intended to limit the scope of the present invention, and any variations and modifications made by those skilled in the art based on the above disclosure are within the scope of the appended claims.

Claims (10)

1.一种并行正则表达式匹配器,其特征在于,包括:1. a parallel regular expression matcher, is characterized in that, comprises: 软件预编译单元,被配置为将正则表达式规则集中的多个正则表达式规则进行预编译,转换为符合硬件处理行为的格式的正则表达式规则,以供硬件匹配电路进行匹配;The software pre-compilation unit is configured to pre-compile multiple regular expression rules in the regular expression rule set, and convert them into regular expression rules in a format conforming to the hardware processing behavior, so as to be matched by the hardware matching circuit; 硬件匹配电路,被配置为将待匹配报文和符合硬件处理行为的格式的正则表达式规则进行匹配,得出匹配结果。The hardware matching circuit is configured to match the to-be-matched packet with a regular expression rule in a format conforming to the hardware processing behavior to obtain a matching result. 2.如权利要求1所述的并行正则表达式匹配器,其特征在于,所述软件预编译单元包括拆分模块、状态归并模块和配置表模块,其中:2. parallel regular expression matcher as claimed in claim 1, is characterized in that, described software precompiling unit comprises splitting module, state merging module and configuration table module, wherein: 所述拆分模块用于将多个正则表达式规则拆分为纯字符串和正则单元;The splitting module is used to split multiple regular expression rules into pure strings and regular units; 所述状态归并模块用于将纯字符串和正则单元进行状态归并,提取出相同的纯字符串和正则单元,区分相异的纯字符串和正则单元;The state merging module is used for state merging of pure strings and regular units, extracting the same pure strings and regular units, and distinguishing different pure strings and regular units; 所述配置表模块将纯字符串和正则单元按符合硬件处理行为的格式封装成正则表达式模式库,以用于对硬件匹配电路进行参数配置和状态表初始化。The configuration table module encapsulates pure strings and regular units into a regular expression pattern library in a format conforming to hardware processing behavior, so as to perform parameter configuration and state table initialization for the hardware matching circuit. 3.如权利要求2所述的并行正则表达式匹配器,其特征在于,所述正则表达式模式库包括纯字符串组、正则单元组和元数据组,其中:3. The parallel regular expression matcher of claim 2, wherein the regular expression pattern library comprises a pure string group, a regular unit group and a metadata group, wherein: 所述纯字符串组用于存储纯字符串,并根据各个纯字符串是否相同而配置纯字符串之间的关系;The pure string group is used to store the pure strings, and the relationship between the pure strings is configured according to whether each pure string is the same; 所述正则单元组用于存储正则单元,并根据各个正则单元是否相同而配置正则单元之间的关系;The regular unit group is used to store regular units, and configure the relationship between regular units according to whether each regular unit is the same; 所述元数据组用于存储元数据,并根据元数据配置纯字符串与正则单元之间的关系。The metadata group is used to store metadata, and configure the relationship between the plain string and the regular unit according to the metadata. 4.如权利要求3所述的并行正则表达式匹配器,其特征在于,所述硬件匹配电路包括模式输入模块、数据输入模块、正则匹配核心模块及匹配结果输出模块,其中:4. The parallel regular expression matcher of claim 3, wherein the hardware matching circuit comprises a pattern input module, a data input module, a regular matching core module and a matching result output module, wherein: 所述模式输入模块根据纯字符串组中的纯字符串和正则单元组中的正则单元对正则匹配核心模块进行参数配置,以及根据元数据组中的元数据对正则匹配核心模块进行状态表初始化;The pattern input module performs parameter configuration on the regular matching core module according to the pure character string in the pure character string group and the regular unit in the regular unit group, and initializes the state table for the regular matching core module according to the metadata in the metadata group ; 所述数据输入模块用于将待匹配报文提供至正则匹配核心模块;The data input module is used to provide the message to be matched to the regular matching core module; 所述正则匹配核心模块用于将待匹配报文与正则表达式规则进行匹配,生成匹配结果;The regular matching core module is used to match the message to be matched with the regular expression rule to generate a matching result; 所述匹配结果输出模块用于输出匹配结果;The matching result output module is used for outputting matching results; 其中,模式输入模块、数据输入模块和匹配结果输出模块与正则匹配核心模块进行总线协议桥接或格式转换。Among them, the mode input module, the data input module and the matching result output module and the regular matching core module perform bus protocol bridging or format conversion. 5.如权利要求4所述的并行正则表达式匹配器,其特征在于,所述正则匹配核心模块包括字符串预处理模块、操作调度模块和执行引擎模块,其中:5. The parallel regular expression matcher of claim 4, wherein the regular matching core module comprises a string preprocessing module, an operation scheduling module and an execution engine module, wherein: 所述字符串预处理模块用于根据纯字符串组中的纯字符串对待匹配报文进行扫描,生成字符串队列;The character string preprocessing module is used to scan the message to be matched according to the pure character string in the pure character string group to generate a character string queue; 所述操作调度模块用于根据字符串队列执行相应的程序,生成请求队列;The operation scheduling module is used to execute a corresponding program according to the string queue to generate a request queue; 所述执行引擎模块根据请求队列对待匹配报文进行扫描,判断待匹配报文是否匹配。The execution engine module scans the to-be-matched packets according to the request queue, and determines whether the to-be-matched packets match. 6.如权利要求5所述的并行正则表达式匹配器,其特征在于,所述字符串预处理模块包括文本扫描模块和文本确认模块,其中:6. The parallel regular expression matcher of claim 5, wherein the character string preprocessing module comprises a text scanning module and a text verification module, wherein: 所述文本扫描模块利用Shift-OR算法电路,根据纯字符串组中的纯字符串对待匹配报文中的纯字符串进行预匹配,得到粗筛结果;The text scanning module utilizes the Shift-OR algorithm circuit to pre-match the pure character string in the message to be matched according to the pure character string in the pure character string group to obtain a rough screening result; 所述文本确认模块利用硬件哈希算法,根据纯字符串组中的纯字符串对粗筛结果中的纯字符串进行再次确认,得到细筛结果。The text confirmation module uses the hardware hash algorithm to confirm the pure character string in the coarse screening result again according to the pure character string in the pure character string group to obtain the fine screening result. 7.如权利要求6所述的并行正则表达式匹配器,其特征在于,所述Shift-OR算法电路同时对16个字符进行流水线扫描;7. parallel regular expression matcher as claimed in claim 6, is characterized in that, described Shift-OR algorithm circuit carries out pipeline scanning to 16 characters simultaneously; 经过Shift-OR算法电路扫描后,匹配的数据包的待确认字节序号放入文本确认模块的索引数据结构的先入先出队列,供文本确认模块使用;After being scanned by the Shift-OR algorithm circuit, the byte sequence numbers to be confirmed of the matched data packets are put into the FIFO queue of the index data structure of the text confirmation module for use by the text confirmation module; 文本确认模块将索引数据结构的先入先出队列读入,形成并查询细筛模块索引表,以获得匹配的数据包的字节序号后计算哈希值,使用哈希值查找字符串信息表,获得字符串长度和字符串地址;The text confirmation module reads the first-in, first-out queue of the index data structure, forms and queries the index table of the fine sieve module to obtain the byte sequence number of the matching data packet, calculates the hash value, and uses the hash value to find the string information table. Get string length and string address; 由字符串长度和字符串地址生成字符串队列。Generate a string queue from string length and string address. 8.如权利要求7所述的并行正则表达式匹配器,其特征在于,操作调度器负责根据字符串请求信息形成相应的程序,并选择相应的执行引擎模块,每个程序包括指令、操作码和操作数,将每个指令送入请求队列中,供后续执行引擎模块调用。8. parallel regular expression matcher as claimed in claim 7, is characterized in that, operation scheduler is responsible for forming corresponding program according to string request information, and selects corresponding execution engine module, and each program comprises instruction, operation code And operands, each instruction is sent to the request queue for subsequent execution engine module calls. 9.如权利要求8所述的并行正则表达式匹配器,其特征在于,所述执行引擎模块包括字符比较引擎、NFA引擎、报告引擎和报文缓存器;9. parallel regular expression matcher as claimed in claim 8, is characterized in that, described execution engine module comprises character comparison engine, NFA engine, report engine and message buffer; 所述请求队列包括字符串比较请求、正则NFA请求和结果报告请求;The request queue includes a string comparison request, a regular NFA request and a result report request; 所述字符比较引擎根据字符串比较请求对细筛结果进行精确比较,包括进行超长纯字符串匹配的最终逐一比较;The character comparison engine accurately compares the fine sieve results according to the string comparison request, including the final one-by-one comparison of super-long pure string matching; 所述NFA引擎根据正则NFA请求对待匹配报文进行相应的NFA状态机操作,包括根据输入的请求信息,在待匹配报文相应的字段,选择相应的NFA,进行NFA状态机操作,最终生成匹配结果,供后续的报告引擎使用;The NFA engine performs a corresponding NFA state machine operation on the message to be matched according to the regular NFA request, including selecting a corresponding NFA in the corresponding field of the message to be matched according to the input request information, performing the NFA state machine operation, and finally generating a match The results are used by subsequent reporting engines; 所述报告引擎根据结果报告请求输出最终的匹配结果,包括根据字符比较引擎和NFA引擎的输出,计算匹配位置,并提取被匹配的序号,得到匹配结果并输出;The report engine requests the output of the final matching result according to the result report, including according to the output of the character comparison engine and the NFA engine, calculates the matching position, and extracts the matched serial number, obtains the matching result and outputs; 所述报文缓存器用于存储和提供各种执行引擎所需的待匹配报文数据。The message buffer is used to store and provide message data to be matched required by various execution engines. 10.如权利要求9所述的并行正则表达式匹配器,其特征在于,10. The parallel regular expression matcher of claim 9, wherein 字符比较引擎和NFA引擎的数量配比、运行频率依据软件性能评测和硬件电路的物理特性进行权衡;The number ratio and operating frequency of the character comparison engine and NFA engine are weighed according to the software performance evaluation and the physical characteristics of the hardware circuit; 选取典型应用的正则表达式规则集和待匹配报文进行性能预估,确定在一定的匹配概率下字符串预处理模块的规模,确定字符比较引擎和NFA引擎的数量及合适比例;Select the regular expression rule set of typical applications and the packets to be matched for performance estimation, determine the scale of the string preprocessing module under a certain matching probability, and determine the number and appropriate ratio of character comparison engines and NFA engines; 根据硬件实现方式的物理特性,进一步确定各个模块的参数配置。The parameter configuration of each module is further determined according to the physical characteristics of the hardware implementation.
CN202110632853.7A 2021-06-07 2021-06-07 Parallel regular expression matchers Active CN113360726B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110632853.7A CN113360726B (en) 2021-06-07 2021-06-07 Parallel regular expression matchers

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110632853.7A CN113360726B (en) 2021-06-07 2021-06-07 Parallel regular expression matchers

Publications (2)

Publication Number Publication Date
CN113360726A true CN113360726A (en) 2021-09-07
CN113360726B CN113360726B (en) 2025-03-11

Family

ID=77532855

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110632853.7A Active CN113360726B (en) 2021-06-07 2021-06-07 Parallel regular expression matchers

Country Status (1)

Country Link
CN (1) CN113360726B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN118861377A (en) * 2024-07-30 2024-10-29 中国人民解放军国防科技大学 Regular expression matching method and system for synchronous FPGA-CPU architecture

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CA2380263A1 (en) * 2002-04-03 2003-10-03 El-Abidine Khaldoune Zine A development framework for efficient uniform control of heterogeneous communication networks
CN101201836A (en) * 2007-09-04 2008-06-18 浙江大学 An Acceleration Method for Regular Expression Matching Based on Deterministic Finite Automata with Memory
CN101853301A (en) * 2010-05-25 2010-10-06 华为技术有限公司 Method and system for regular expression matching
CN106776456A (en) * 2017-01-18 2017-05-31 中国人民解放军国防科学技术大学 High-speed regular expression matching hybrid system and method based on FPGA+NPU
CN111177491A (en) * 2019-12-31 2020-05-19 奇安信科技集团股份有限公司 Regular expression matching method and device, electronic equipment and storage medium

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CA2380263A1 (en) * 2002-04-03 2003-10-03 El-Abidine Khaldoune Zine A development framework for efficient uniform control of heterogeneous communication networks
CN101201836A (en) * 2007-09-04 2008-06-18 浙江大学 An Acceleration Method for Regular Expression Matching Based on Deterministic Finite Automata with Memory
CN101853301A (en) * 2010-05-25 2010-10-06 华为技术有限公司 Method and system for regular expression matching
CN106776456A (en) * 2017-01-18 2017-05-31 中国人民解放军国防科学技术大学 High-speed regular expression matching hybrid system and method based on FPGA+NPU
CN111177491A (en) * 2019-12-31 2020-05-19 奇安信科技集团股份有限公司 Regular expression matching method and device, electronic equipment and storage medium

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
张树壮;罗浩;方滨兴;云晓春;: "一种面向网络安全检测的高性能正则表达式匹配算法", 计算机学报, no. 10, 15 October 2010 (2010-10-15) *
高阳阳;徐烈伟;俞剑;许薇;: "一种新型动态可重构的正则表达式匹配引擎设计", 复旦学报(自然科学版), no. 06, 15 December 2019 (2019-12-15) *

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN118861377A (en) * 2024-07-30 2024-10-29 中国人民解放军国防科技大学 Regular expression matching method and system for synchronous FPGA-CPU architecture

Also Published As

Publication number Publication date
CN113360726B (en) 2025-03-11

Similar Documents

Publication Publication Date Title
US12166676B2 (en) Apparatus and method of generating lookups and making decisions for packet modifying and forwarding in a software-defined network engine
US11568674B2 (en) Fast signature scan
US9990583B2 (en) Match engine for detection of multi-pattern rules
US8484147B2 (en) Pattern matching
US10176187B2 (en) Method and apparatus for generating a plurality of indexed data fields
US9983876B2 (en) Non-deterministic finite state machine module for use in a regular expression matching system
CN101095310A (en) Programmable packet parsing processor
EP1872557A2 (en) Apparatus and method for pattern detection
US20190052553A1 (en) Architectures and methods for deep packet inspection using alphabet and bitmap-based compression
US20140317134A1 (en) Multi-stage parallel multi-character string matching device
US8370274B2 (en) Apparatuses and methods for deterministic pattern matching
Najam et al. Speculative parallel pattern matching using stride-k DFA for deep packet inspection
CN113360726B (en) Parallel regular expression matchers
EP3264716B1 (en) State transition compression mechanism to efficiently compress dfa based regular expression signatures
Ngoc et al. Memory-efficient signature matching for ClamAV on FPGA
Mahdinia et al. Attack signature matching using graphics processors in high-performance intrusion detection systems
US11025650B2 (en) Multi-pattern policy detection system and method
US8065322B2 (en) Binary search circuit and method
Pan Efficient network packet signature matching on GPUs
Soewito et al. Hybrid pattern matching for trusted intrusion detection
US10862903B2 (en) State grouping methodologies to compress transitions in a deterministic automata
Cao et al. Frequent statistics of link-layer bit stream data based on AC-IM algorithm
Nguyen et al. BBFex: A Flexible and Efficient FPGA-based Pattern Matching Engine for Large Database

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant