CN100354863C

CN100354863C - Method and system for large scale keyboard matching

Info

Publication number: CN100354863C
Application number: CNB2005100070895A
Authority: CN
Inventors: 刘萍; 谭建龙; 程学旗
Original assignee: Institute of Computing Technology of CAS
Current assignee: Institute of Computing Technology of CAS
Priority date: 2005-02-03
Filing date: 2005-02-03
Publication date: 2007-12-12
Anticipated expiration: 2025-02-03
Also published as: CN1648901A

Abstract

The present invention provides methods and systems for large-scale keyword matching. According to the provided method and system, firstly, the given keyword set is normalized, and an optimal grouping and intra-group best matching method is solved on the normalized keyword set (or directly on the original keyword set). The process can use two mechanisms: one is to use the dynamic programming method to calculate an optimal grouping, and divide the given keyword set into several groups according to this result; then, for each group, obtain a The best matching method; one is to establish a directed graph with weights on the edges through training, and solve the shortest path of this graph to obtain the optimal grouping and best matching method within the group; then use the training results for all groups The scanning automaton is constructed sequentially to form a scanning automaton sequence, so that the input texts to be scanned pass through sequentially, and the final scanning result is obtained.

Description

Method and system for large-scale keyword matching

技术领域technical field

本发明涉及文本处理技术领域，特别是一种大规模关键词匹配方法和系统。The invention relates to the technical field of text processing, in particular to a large-scale keyword matching method and system.

背景技术Background technique

多关键词匹配的技术已经比较成熟，而且广泛的用于文本处理、内容过滤的各个方面。传统的多关键词匹配算法是将待扫描的文本看成是一维的字符串，充分利用已知关键词串的特征，在扫描过程中尽量向前跳跃，以提高匹配的性能。多关键词匹配算法根据对关键词预处理方法的不同，可以分为三种形式：前缀模式(KMP、AC、Shift-AND、Shift-Or等算法)、后缀模式(Boyer-Moore、Wu-Manber等算法)、子串模式(BDM、BOM、SBDM、SBOM等算法)。多关键词匹配算法的性能主要受以下三个方面的影响：关键词数量、关键词的最小长度、字符集。此外，匹配速度也和待扫描文本中出现关键词的次数有关系。The technology of multi-keyword matching is relatively mature, and it is widely used in various aspects of text processing and content filtering. The traditional multi-keyword matching algorithm regards the text to be scanned as a one-dimensional string, makes full use of the characteristics of the known keyword string, and jumps forward as much as possible during the scanning process to improve the matching performance. Multi-keyword matching algorithms can be divided into three forms according to different preprocessing methods for keywords: prefix mode (KMP, AC, Shift-AND, Shift-Or and other algorithms), suffix mode (Boyer-Moore, Wu-Manber and other algorithms), substring mode (BDM, BOM, SBDM, SBOM and other algorithms). The performance of the multi-keyword matching algorithm is mainly affected by the following three aspects: the number of keywords, the minimum length of keywords, and the character set. In addition, the matching speed is also related to the number of keywords appearing in the text to be scanned.

为了不断的提高关键词匹配的性能，出现了很多新的方法，但都是在对关键词本身的预处理上进行改进，即：尽可能多的利用关键词串的特征，寻求新的数据结构来存储特征和由特征计算出来的跳跃量，改进跳跃方式等，这样的改进方法对匹配速度的提高很有限，通常可以提高20％-40％左右。In order to continuously improve the performance of keyword matching, many new methods have emerged, but they are all improved on the preprocessing of the keyword itself, that is, to use the characteristics of the keyword string as much as possible to find a new data structure To store the features and the jump amount calculated by the features, improve the jump method, etc., such an improvement method has a limited improvement in the matching speed, which can usually be increased by about 20%-40%.

随着计算机应用和网络应用的普及，数据处理量日益增大。尤其是在网络应用环境中，存在大量的实时数据处理的需求，例如：垃圾邮件的实时过滤，网络内容安全等。在这些应用中，系统因为用户使用习惯和数据处理量的不断增大，关键词数量也会不断的增大，规模常常达到上万级。这时，传统的匹配技术的速度会明显的急剧下降，已经不能很好的满足应用需求，尤其是实时数据处理的需求。With the popularization of computer applications and network applications, the amount of data processing is increasing day by day. Especially in the network application environment, there is a large amount of real-time data processing requirements, such as: real-time filtering of spam, network content security and so on. In these applications, due to the continuous increase of user usage habits and data processing volume, the number of keywords will also continue to increase, and the scale often reaches tens of thousands. At this time, the speed of the traditional matching technology will obviously drop sharply, and it can no longer meet the application requirements, especially the real-time data processing requirements.

发明内容Contents of the invention

为了满足大规模的快速关键词匹配的需求，本发明提供了一种针对大规模的关键词匹配方法，包括步骤：将关键词按照长度进行个数统计及排序，对关键词集合进行规范化；使用动态规划方法或者最短路径方法，对关键词集合进行最优分组并寻找最佳匹配方法；依次对每一组关键词，使用所述最佳匹配方法构建一个扫描自动机，形成一个扫描自动机的序列；使用所述形成的扫描自动机序列，对输入的文本进行扫描匹配，将结果存储在指定的内存结构或者外部文件中。In order to meet the needs of large-scale fast keyword matching, the present invention provides a method for large-scale keyword matching, comprising the steps of: counting and sorting the keywords according to their length, and standardizing the keyword set; Dynamic programming method or shortest path method, carry out optimal grouping to keyword set and find best matching method; For each group of keywords in turn, use described best matching method to construct a scanning automaton, form the scanning automaton Sequence: use the formed scanning automaton sequence to scan and match the input text, and store the result in the specified memory structure or external file.

为了实现以上目的，本发明还提供了一种大规模关键词匹配的系统(图1)，包括：用于规范化给定关键词的装置；用于寻找最优分组和最佳匹配方法的装置，提供两种机制：一是动态规划机制(图2、3)，二是最短路径机制(图4、5)，此装置将结果以配置文件的形式存储的装置；用于读取最佳结果，并对每个分组创建扫描自动机的装置；用于最终扫描，并将结果存储在指定内存结构或者文件中的装置。In order to achieve the above object, the present invention also provides a large-scale keyword matching system (Fig. 1), comprising: a device for normalizing a given keyword; a device for finding optimal grouping and best matching method, Two mechanisms are provided: one is the dynamic programming mechanism (Figure 2, 3), and the other is the shortest path mechanism (Figure 4, 5). This device stores the results in the form of configuration files; it is used to read the best results, And create a device for scanning automata for each group; a device for final scanning and storing the results in a specified memory structure or file.

本发明尤其涉及基于内容的文本过滤和网络内容安全。More particularly, the invention relates to content-based text filtering and web content security.

在大规模关键词匹配方法中，最核心的是怎样求解一个最优的分组，最短路径方法是其中的一种方法，另外一种是动态规划的方法。In the large-scale keyword matching method, the core is how to solve an optimal grouping, the shortest path method is one of the methods, and the other is the dynamic programming method.

本发明解决了针对过滤中关键词大规模(通常关键词数量在5000以上)关键词快速匹配的问题，特别适用于实时网络数据的处理。实验证明，使用本发明给出的系统，平均可以使关键词匹配的速度提高1倍。The invention solves the problem of fast matching of keywords in a large scale (usually more than 5000 keywords) in filtering, and is especially suitable for processing real-time network data. Experiments have proved that the speed of keyword matching can be doubled on average by using the system provided by the present invention.

附图说明Description of drawings

图1是本发明的大规模关键词匹配的系统示意图。FIG. 1 is a schematic diagram of the large-scale keyword matching system of the present invention.

图2是本发明的动态规划机制求解最优分组的示意图。Fig. 2 is a schematic diagram of solving optimal grouping by the dynamic programming mechanism of the present invention.

图3是本发明的动态规划机制求解最优分组方法的实现流程图。Fig. 3 is a flow chart of the realization of the method for solving the optimal grouping by the dynamic programming mechanism of the present invention.

图4是本发明的最短路径机制求解最优分组的示意图。Fig. 4 is a schematic diagram of solving the optimal grouping by the shortest path mechanism of the present invention.

图5是本发明的最短路径机制求解最优分组方法的实现流程图。Fig. 5 is a flow chart of the implementation of the method for solving the optimal grouping based on the shortest path mechanism of the present invention.

具体实施方式Detailed ways

如图1所示，本发明的系统包括：As shown in Figure 1, the system of the present invention includes:

装置(1)：规范化关键词装置，作用是：对给定的大量关键词，按照长度进行个数的统计，然后按照长度排序；Device (1): a device for standardizing keywords, which is used to count a large number of given keywords according to their length, and then sort them according to their length;

装置(2)：求解最优分组和最佳匹配方法的装置，作用是：可以使用两种机制求解最优分组：一是采用动态规划机制得到最优分组，然后通过训练的方法得到每组的最佳匹配方法；另一种是使用最短路径机制直接得到分组和每一组的最佳匹配方法；此装置最终以配置文件的方式将分组和最佳匹配方法的结果存储在文件中；Device (2): The device for solving the optimal grouping and the best matching method. Its function is: two mechanisms can be used to solve the optimal grouping: one is to use the dynamic programming mechanism to obtain the optimal grouping, and then obtain the optimal grouping of each group through training. The best matching method; the other is to use the shortest path mechanism to directly obtain the grouping and the best matching method of each group; this device finally stores the results of the grouping and the best matching method in the file in the form of a configuration file;

装置(3)：建立扫描自动机的装置，作用是：读取配置文件，采用训练后的结果，依次为每一个分组的关键词建立扫描自动机；Device (3): a device for setting up a scanning automaton, the effect is: read the configuration file, adopt the result after training, and set up a scanning automaton for each grouped keyword in turn;

装置(4)扫描装置，作用是：使用装置(3)中建立的一组扫描自动机，对输入的文本进行扫描匹配，将结果存储在指定的内存结构或者外部文件中。The device (4) scanning device has the function of: using a group of scanning automata established in the device (3) to scan and match the input text, and store the result in a specified memory structure or an external file.

各装置的详细操作将在下面分别详细描述。The detailed operation of each device will be described in detail respectively below.

1.规范化装置1. Normalization device

规范化装置是将给定的一组关键词，首先按照长度排序，可以从小到大或者从大到小，然后统计相同长度的关键词个数。The normalization device is to sort a given set of keywords according to length first, which can be from small to large or from large to small, and then count the number of keywords with the same length.

定义一组关键词K＝{K₁，K₂，K₃，...，K_n}，对应的长度L＝{l₁，l₂，l₃，...，l_n}。规范化的过程首先将K排序成为：K’＝{K₁’，K₂’，K₃’，...，K_n’}，使得对应长度L’＝{l₁’，l₂’，l₃’，...，l_n’}满足：l₁’＜＝l₂’＜＝l₃’＜＝...＜＝l_n’或l₁’＞＝l₂’＞＝l₃’＞＝...＞＝l_n’。然后对L’进行统计，计算相同长度的个数，得到统计值序列LNn:₁，n₂，n₃，...，n_m.其中，m是关键词的最大长度；n_i，1＜＝i＜＝m表示长度为i的关键词的个数。A set of keywords K={K ₁ , K ₂ , K ₃ , . . . , K _n } is defined, and the corresponding length L={l ₁ , l ₂ , l ₃ , . . . , l _n }. The normalization process first sorts K into: K'={K ₁ ', K ₂ ', K ₃ ', ..., K _n '}, so that the corresponding length L'={l ₁ ', l ₂ ', l ₃ ', ..., l _n '} satisfy: l ₁ '<=l ₂ '<=l ₃ '<=...<=l _n 'or l ₁ '>=l ₂ '>=l ₃ ' ＞＝...＞＝ l _n '. Then L' is counted, the number of the same length is calculated, and the statistical value sequence LNn: ₁ , n ₂ , n ₃ ,..., n _m is obtained. Among them, m is the maximum length of the keyword; n _i , 1<=i<=m represents the number of keywords whose length is i.

2.求解最优分组和最佳匹配方法的装置2. A device for solving optimal grouping and optimal matching methods

此装置的目的是对规范化后的关键词集合求解一个最优的分组，并且在每个分组上使用最适合的一种匹配方法，从而使整个集合的匹配速度达到最快。为了达到这个目的，此装置可以通过两种机制实现：一是动态规划机制；一是最短路径机制。下面分别进行描述。The purpose of this device is to solve an optimal grouping for the normalized keyword set, and use the most suitable matching method on each grouping, so as to make the matching speed of the whole set reach the fastest. In order to achieve this goal, the device can be realized through two mechanisms: one is the dynamic programming mechanism; the other is the shortest path mechanism. Described below respectively.

2.1动态规划机制2.1 Dynamic programming mechanism

使用动态规划机制分成四个步骤：定义评价函数步骤、分组步骤、训练步骤和存储配置信息步骤。在定义评价函数步骤中，给定一个函数，它是与关键词个数和长度相关的，在分组步骤中使用；在分组步骤中，使用动态规划的方法，利用规范化模块中计算的统计信息，求解一个对给定关键词集合的最优分组方案；训练步骤中，对每一个分组中的关键词，寻找最适合的匹配方法；在存储配置信息步骤中，将分组位置信息和分组内的最佳匹配方法信息记录在磁盘文件中，供最终的扫描自动机读取使用。The dynamic programming mechanism is divided into four steps: defining the evaluation function step, grouping step, training step and storing configuration information step. In the step of defining the evaluation function, a function is given, which is related to the number and length of keywords, and is used in the grouping step; in the grouping step, the dynamic programming method is used to utilize the statistical information calculated in the normalization module, Solve an optimal grouping scheme for a given keyword set; in the training step, find the most suitable matching method for the keywords in each group; in the step of storing configuration information, combine the grouping position information and the most The best matching method information is recorded in a disk file for the final scanning robot to read and use.

(1)、定义评价函数步骤(图3中step1)(1), define the evaluation function steps (step1 in Figure 3)

根据对传统的关键词匹配算法的分析，我们认为关键词匹配时间在字符集一定的情况下，和关键词的数量成正比，和关键词的最小长度成反比，即：关键词数量越大，匹配的时间就越长；关键词的最小长度越大，匹配的时间就越短。更进一步，如果用F(K)表示文本通过关键词集合K的时间，用G(|K|)表示与关键词集合K的个数的影响关系，用L_min(K)表示与关键词最小长度的影响关系，我们可以将它们的关系表述成：According to the analysis of the traditional keyword matching algorithm, we believe that the keyword matching time is proportional to the number of keywords and inversely proportional to the minimum length of keywords under the condition of a certain character set, that is, the larger the number of keywords, The longer the matching time; the greater the minimum length of the keyword, the shorter the matching time. Furthermore, if F(K) is used to indicate the time when the text passes through the keyword set K, G(|K|) is used to indicate the influence relationship with the number of keyword set K, and L _min (K) is used to indicate the minimum The influence relationship of length, we can express their relationship as:

$F (K) = G (| K |) \times L_{\min} (K) = \frac{\sqrt{| K |}}{\min (K)}$ (公式1) $f (K) = G (| K |) \times L_{\min} (K) = \frac{\sqrt{| K |}}{\min (K)}$ (Formula 1)

即：匹配的时间与关键词个数的开根成正比，与最小长度(min(K))成反比。That is: the matching time is proportional to the root of the number of keywords, and inversely proportional to the minimum length (min(K)).

(2)、分组步骤(图3中step2)(2), grouping step (step2 among Fig. 3)

该分组方法步骤如下：The grouping method steps are as follows:

下面我们描述怎样使用动态规划的方法求解一个最优分组。Below we describe how to use dynamic programming to solve an optimal grouping.

动态规划的基本思想是将待求解问题分解成若干个子问题，先求解子问题，然后从这些子问题的解得到原问题的解。它通常可以按照以下几个步骤进行：The basic idea of dynamic programming is to decompose the problem to be solved into several sub-problems, first solve the sub-problems, and then obtain the solution of the original problem from the solutions of these sub-problems. It can usually be done in the following steps:

a、找出最优解的性质，并刻画其结果特征；a. Find out the nature of the optimal solution and characterize its result characteristics;

b、递归的定义最优值；b. Recursively define the optimal value;

c、以自底向上的方式计算出最优值；c. Calculate the optimal value in a bottom-up manner;

d、根据计算最优值时得到的信息，构造一个最优解。d. According to the information obtained when calculating the optimal value, construct an optimal solution.

我们的问题是：找到一种对集合K的分组，使F(K)值最小。最优解的性质见公式1的表述。为了求最优解，实质上，我们是要找一个使F(K)最小的分组。Our problem is: find a grouping of the set K that minimizes the value of F(K). See the expression of formula 1 for the nature of the optimal solution. In order to find the optimal solution, in essence, we are looking for a group that minimizes F(K).

记对一个有序集合K的分组为F[1:n]，如果第一个分组的位置在k处将集合分开，则有F[1:n]＝F[1:k]+F[k+1:n]，对F[k+1:n]可以依次类推。对于分组求解的递归式为：Note that the grouping of an ordered set K is F[1:n], if the position of the first grouping separates the set at k, then there is F[1:n]=F[1:k]+F[k +1:n], and so on for F[k+1:n]. The recursive formula for grouping solution is:

$F f [[k k : : n no]] = = min min \{\begin{matrix} \frac{G G ((p p))}{K K} + + F f [[k k + + p p : : n no]],, 11 \leq \leq p p \leq \leq n no - - k k \\ \frac{G G ((n no - - k k + + 11))}{K K} \end{matrix},, 11 \leq \leq k k \leq \leq n no$

计算过程：从F[n:n]开始计算，F[n-1:n]，F[n-2:n]，...，一直到F[1:n]为止。计算中，使用表来存储中间的计算结果F[k:n]，利于后面计算的查找，这也是动态规划算法的核心思想。同时，要使用数组position存储F[k:n]取值处的位置信息。Calculation process: start calculation from F[n:n], F[n-1:n], F[n-2:n], ..., until F[1:n]. In the calculation, the table is used to store the intermediate calculation result F[k:n], which is beneficial to the search of the subsequent calculation, which is also the core idea of the dynamic programming algorithm. At the same time, use the array position to store the position information at the value of F[k:n].

回溯过程：当F[1:n]的值得到时，计算过程结束，然后由position的值开始进行回溯得到最终解。首先取position[1]的值，它表示对于F[1:n]的分组位置，然后取position[position[1]]，它表示下一个位置，直到position[n]为止，这样得到的位置序列就是一个最优的分组位置。Backtracking process: When the value of F[1:n] is obtained, the calculation process ends, and then backtracking starts from the value of position to get the final solution. First take the value of position[1], which represents the grouping position for F[1:n], and then take position[position[1]], which represents the next position until position[n], so the obtained position sequence It is an optimal grouping position.

使用动态规划方法对长度为n的序列进行分组，它的时间复杂度为O(n²)，空间复杂度为O(n)。Use the dynamic programming method to group the sequence of length n, its time complexity is O(n ² ), and its space complexity is O(n).

(3)、训练步骤(图3中step3)(3), training steps (step3 in Figure 3)

分组完成之后，对于每个分组的关键词使用哪种合适的匹配方法，本发明采用通过训练的方法得到。在前面技术背景中我们对成熟的关键词匹配技术进行了说明，我们可以选择其中几种来进行训练。在本发明的系统中，我们选用了BOM、Wu-Manber、AC三种方法。训练文本的使用根据不同字符集大小生成的随机数据文件。After the grouping is completed, which suitable matching method to use for the keywords of each group is obtained by means of training in the present invention. In the previous technical background, we have explained the mature keyword matching technology, and we can choose several of them for training. In the system of the present invention, we have selected three methods of BOM, Wu-Manber and AC. The training text uses random data files generated according to different charset sizes.

对分组后的每一组：依次使用训练选用匹配方法A₁，A₂，..，A_p中每一种，对训练文本进行扫描匹配，记录扫描完成所需要的时间Ti，1＜＝i＜＝p。最终计算{T₁，T₂，..，T_p}中的最小值T_j，则对此分组，使用第j种匹配方法。依次训练所有的分组，并记录每个分组的最佳匹配方法。For each group after grouping: use each of the matching methods A ₁ , A ₂ , .., A _p for training in turn, scan and match the training text, and record the time Ti required for the completion of the scan, 1<=i <=p. Finally calculate the minimum value T _j among {T ₁ , T ₂ , .., T _p }, then group this and use the jth matching method. Train all groups in turn, and record the best matching method for each group.

(4)、存储配置信息步骤(图3中step4)(4), the step of storing configuration information (step4 in Figure 3)

训练完成后，将分组结果和训练结果按照指定的格式写入系统的配置文件中，以便后面的建立扫描自动机装置读取使用。本发明的系统中，采用的是整数序列的方式写入配置文件，第一行表示分组结果，第二行表示匹配方法的训练结果。整数中间使用tab键分隔开。After the training is completed, the grouping results and training results are written into the configuration file of the system according to the specified format, so that the subsequent scanning automaton device can be read and used. In the system of the present invention, the configuration file is written in the form of an integer sequence, the first line represents the grouping result, and the second line represents the training result of the matching method. Integers are separated by the tab key.

例如：For example:

3 14 403 14 40

1 3 21 3 2

第一行表示最终分成3组，长度＜＝3的为一组，长度为4-14的为一组，长度为15-40的为一组；The first line indicates that they are finally divided into 3 groups, the length <= 3 is a group, the length is 4-14 is a group, and the length is 15-40 is a group;

第二行表示第一组关键词匹配使用方法1，第二组使用方法3，第三组使用方法2。具体方法1、2、3代表哪种方法，系统内部自己约定。The second line indicates that the first group of keyword matches uses method 1, the second group uses method 3, and the third group uses method 2. The specific methods 1, 2, and 3 represent which methods are determined by the system itself.

使用动态规划机制的示意图见附图2。图中，没有给出上述步骤1的示例。最顶端给出以长度表示的一组关键词，用虚线表示对它们使用动态规划求出的分组情况，这对应了上述的分组步骤；下面，对于每一个分组中的关键词，使用一个训练文本，在3个候选的扫描匹配自动机中，训练得到一个最佳的自动机，这对应了上述的训练步骤；最后，将分组的信息、每一个分组使用的最佳扫描自动机信息存储在外部存储设备中，这对应了上述的存储配置信息步骤。相应的实现流程见附图3。See Figure 2 for a schematic diagram of using the dynamic programming mechanism. In the figure, the example of the above step 1 is not given. At the top is a group of keywords represented by length, and a dotted line is used to indicate the grouping of them using dynamic programming, which corresponds to the above grouping steps; below, for each keyword in the grouping, a training text is used , among the three candidate scan-matching automata, train to get the best automaton, which corresponds to the above-mentioned training steps; finally, store the grouping information and the best scanning automaton information used by each group in the external In the storage device, this corresponds to the above-mentioned step of storing configuration information. The corresponding implementation process is shown in Figure 3.

2.2最短路径机制2.2 Shortest path mechanism

最短路径机制来源于求解一个图的最短路径技术。针对大规模关键词匹配的问题，我们做如下的定义：The shortest path mechanism is derived from the shortest path technique for solving a graph. For the problem of large-scale keyword matching, we make the following definitions:

定义1：将规范化后的关键词集合中，相同长度的关键词形成一个组，作为图的一个点，记为：N_i，i表示此点的关键词长度。Definition 1: In the normalized keyword set, keywords of the same length form a group, which is regarded as a point in the graph, and is recorded as: N _i , where i represents the length of the keyword at this point.

定义2：从点Ni到Nj之间的有向边表示：将长度为i到长度为j-1的所有关键词组成一组，记为：Bij。Definition 2: Directed edge representation from point Ni to Nj: All keywords with length i to j-1 are grouped together, denoted as: Bij.

定义3：Bij上的权重表示：由长度为[i，j]的关键词组成的扫描自动机对训练文本扫描一遍的时间，记为：T_ij。Definition 3: The weight on Bij means: the time for a scanning automaton composed of keywords with length [i, j] to scan the training text once, denoted as: T _ij .

定义4：对于给定的一组关键词和一个训练文本，如果关键词的最小长度为p，最大长度为q，p＜＝q，则它对应的一个有向图为由点集合{N_p，N_p+1，...，N_q，N_q+1}和有向边集合{B_ij}，p＜＝i＜＝q，p＜j＜＝q+1，组成的一个有向图。Definition 4: For a given set of keywords and a training text, if the minimum length of keywords is p and the maximum length is q, p<=q, then a directed graph corresponding to it is a set of points {N _p , N _p+1 ,..., N _q , N _q+1 } and the directed edge set {B _ij }, p<=i<=q, p<j<=q+1, composed of a directed picture.

根据如上的定义，我们对给定的关键词集合建立了有向图，然后可以根据给定的训练文本求出每条边的权重，最后求出图的最短路径，即最优的分组。有向图上求解一个最短路径机制分为以下几步：According to the above definition, we have established a directed graph for a given keyword set, and then we can calculate the weight of each edge according to the given training text, and finally find the shortest path of the graph, that is, the optimal grouping. Solving a shortest path mechanism on a directed graph is divided into the following steps:

(1)、将规范化的关键词集合表示成如上所定义的有向图；(图5中的step1)(1), the keyword set of normalization is expressed as the directed graph defined above; (step1 among Fig. 5)

(2)、计算每条边上的权重。在计算的过程中，对每一边使用设定的多种匹配方法计算，最终取最小值(扫描时间最短)作为这条边的权重，并记录此边使用的匹配方法；(图5中的step2)(2) Calculate the weight of each edge. In the calculation process, use the set multiple matching methods to calculate each side, and finally take the minimum value (the shortest scanning time) as the weight of this side, and record the matching method used by this side; (step2 in Figure 5 )

(3)、求出有向图的最短路径，得到最优分组，同时可以得到每一组对应的最佳匹配方法；(图5中的step3)(3), find the shortest path of the directed graph, obtain optimal grouping, and can obtain the best matching method corresponding to each group simultaneously; (step3 among Fig. 5)

(4)、将最终的结果存储在系统的配置文件中。配置文件格式同(4) Store the final result in the configuration file of the system. The configuration file format is the same as

4.2.1中存储配置信息步骤中所述。(图5中的step4)As described in the step of storing configuration information in 4.2.1. (step4 in Figure 5)

附图4所示的是一个对给定关键词集合求解最优分组的示意图。图中假设关键词长度为2-7，则根据定义1，有向图有标号为2-8的7个节点，根据定义2每个节点有指向标号大于本身的有向边，根据定义3可以计算出每条边上的权重(图中没有具体标出)，然后可以对此图求解它的最短路径。如果最短路径为2-6-8，则分组情况为：长度为2-5的为一组，长度为6-7的为一组。Figure 4 is a schematic diagram of solving the optimal grouping for a given keyword set. Assuming that the keyword length in the figure is 2-7, according to definition 1, the directed graph has 7 nodes labeled 2-8, according to definition 2, each node has a directed edge with a label greater than itself, according to definition 3, it can be Calculate the weight on each edge (not specifically marked in the figure), and then solve the shortest path for this graph. If the shortest path is 2-6-8, the grouping situation is: the length of 2-5 is a group, and the length of 6-7 is a group.

使用最短路径机制，设关键词长度的个数为n，即有向图中节点的个数n，则单纯计算有向图的最短路径的时间复杂度是O(n²)。如果考虑在计算边的权重过程中带有多个匹配方法的训练过程，设选取的匹配方法有m种，则最短路径机制的时间复杂度为O(n²*m)。作为系统的初始化阶段，这是可以接受的。Using the shortest path mechanism, assuming that the number of key words is n, that is, the number of nodes in the directed graph is n, then the time complexity of simply calculating the shortest path of the directed graph is O(n ² ). If considering the training process with multiple matching methods in the process of calculating the weight of edges, assuming that there are m kinds of matching methods selected, the time complexity of the shortest path mechanism is O(n ² *m). This is acceptable as an initialization phase of the system.

3.建立扫描自动机装置3. Establish a scanning robot device

读取配置信息建立扫描自动机的装置：根据系统的配置信息，读取分组的位置(长度分隔位置)，然后将相应长度的关键词组成一组，并根据配置信息中记录的本组内最佳的扫描匹配自动机来构造自动机，最终，对原始的关键词集合构造成一个由多个自动机组成的自动机序列。A device for reading configuration information and establishing a scanning automaton: According to the configuration information of the system, read the position of the group (length-separated position), and then form a group of keywords of the corresponding length, and according to the most The optimal scan-matching automaton is used to construct the automaton, and finally, an automaton sequence composed of multiple automata is constructed for the original keyword set.

建立扫描自动机装置读取系统的配置，首先根据分组的情况，对原始的关键词集合进行分组，对每一个组，使用相同的数据结构，存储组内的关键词、关键词的长度、关键词原索引号等信息；然后再根据训练的情况，分别建立不同匹配方法的扫描自动机。系统存储每个扫描自动机的入口地址，以便后面的扫描装置直接使用。To establish the configuration of the scanning automaton device reading system, first, group the original keyword set according to the grouping situation, and use the same data structure for each group to store the keywords in the group, the length of the keyword, the key Information such as the original index number of the word; and then according to the training situation, respectively establish scanning automata with different matching methods. The system stores the entry address of each scanning robot for direct use by subsequent scanning devices.

关于怎样建立扫描自动机，不属于本发明的范围，此处不详细叙述。How to set up the scanning automaton does not belong to the scope of the present invention, and will not be described in detail here.

4.扫描装置4. Scanning device

扫描装置读取外界输入的文本数据，这个数据可能是本地磁盘上存储的文件，也可以是网络上传输过来的各种数据。文本数据依次通过系统的各个扫描自动机，当有匹配成功的关键词出现时，系统将记录它们的索引号、出现位置等信息，可以在内存的相应机结构中统计这些信息，供外部的其它应用系统使用，也可以直接将它们存储在磁盘文件中。The scanning device reads text data input from the outside world, which may be files stored on the local disk, or various data transmitted over the network. The text data passes through each scanning automaton of the system in turn. When a keyword that matches successfully appears, the system will record their index number, occurrence location and other information, which can be counted in the corresponding machine structure of the memory for other external The application system can also store them directly in disk files.

5.积极效果5. Positive effect

采用以上的处理，我们可以提高大规模关键词匹配的速度。使用动态规划的方法对关键词进行分组，可以保证在系统设定的匹配速度评价函数下，这种分组是理论最优的；对于每个分组，因为组内关键词个数不同，最小长度也不相同，所以系统采用训练的方法找到最佳的匹配方法。使用最短路径机制，将寻找最优分组和最佳匹配方法结合起来，可以在实际系统运行中，保证分组实际运行速度最快。计算最优分组和寻找最佳匹配的过程时间复杂度稍差，但是因为它们都在系统的初始化部分做，所以不影响最终的扫描匹配速度。With the above processing, we can increase the speed of large-scale keyword matching. Using the dynamic programming method to group keywords can ensure that this grouping is theoretically optimal under the matching speed evaluation function set by the system; for each group, because the number of keywords in the group is different, the minimum length is also are not the same, so the system uses the training method to find the best matching method. Using the shortest path mechanism, combining the search for optimal grouping and the best matching method, can ensure the fastest actual grouping speed in actual system operation. The time complexity of calculating the optimal grouping and finding the best matching process is slightly lower, but because they are all done in the initialization part of the system, it does not affect the final scan matching speed.

分别使用两种不同的分组机制，我们可以保证在给定关键词集合和给定的训练文本下，得到一种扫描速度最快的分组。通过建立一个扫描自动机的序列，我们解决了大规模关键词匹配速度下降严重的问题。试验证明：同样条件下，使用本发明的方法和系统，与传统的成熟的多关键词匹配方法(AC、Wu-Manber、SBOM)相比，匹配速度是最快的单个匹配方法的2倍，是最慢的单个匹配方法的4倍。By using two different grouping mechanisms, we can guarantee to get a group with the fastest scanning speed under a given set of keywords and a given training text. By building a sequence of scanning automatons, we solve the problem of severe slowdown of large-scale keyword matching. The test proves: under the same conditions, using the method and system of the present invention, compared with traditional mature multi-keyword matching methods (AC, Wu-Manber, SBOM), the matching speed is 2 times that of the fastest single matching method, 4 times slower than the slowest single matching method.

Claims

1. A method for large scale keyword matching, the method comprising the steps of:

counting and sequencing the number of the keywords according to the length, and standardizing a keyword set;

optimally grouping the keyword set and searching for an optimal matching method by using a dynamic programming method or a shortest path method;

sequentially constructing a scanning automaton for each group of keywords by using the optimal matching method to form a sequence of the scanning automaton;

and scanning and matching the input text by using the formed scanning automaton sequence, and storing the result in a specified memory structure or an external file.

2. The method for large scale keyword matching according to claim 1, wherein the dynamic programming method for optimally grouping the keyword sets and finding the best matching method comprises: defining an evaluation function, grouping, training and storing configuration information; wherein,

the step of defining the evaluation function specifically comprises: according to the analysis of the traditional keyword matching algorithm, the keyword matching time is in direct proportion to the number of keywords and in inverse proportion to the minimum length of the keywords under the condition that a character set is certain, if F (K) is used for representing the time of a text passing through a keyword set K, G (| K |) is used for representing the influence relation with the number of the keyword set K, and L is used for representing the influence relation with the number of the keyword set K_min(K) Representing the influence relationship with the minimum length of the keywords, their relationship is expressed as:

F (K) = G (| K |) \times L_{\min} (K) = \frac{\sqrt{| K |}}{\min (K)}

equation 1

Namely: the matching time is in direct proportion to the root of the number of the keywords and in inverse proportion to the minimum length min (K);

the grouping step specifically includes the steps of: solving an optimal grouping by using a dynamic programming method, solving sub-problems, and then obtaining a solution of an original problem from the solutions of the sub-problems, which can be generally performed according to the following steps: a. finding out the property of the optimal solution and depicting the result characteristics; b. a recursive defined optimum; c. calculating an optimal value in a bottom-up manner; d. constructing an optimal solution according to information obtained when the optimal value is calculated; finding a grouping of K sets that minimizes the value of F (K), finding a grouping that minimizes F (K) for an optimal solution, noting F [1: n ] for an ordered set K, if the position of the first grouping separates the sets at K, then F [1: n ] ═ F [1: K ] + F [ K +1: n ], and so on for F [ K +1: n ], the recursion for the grouping solution is:

F [k : n] = \min \begin{matrix} \{\begin{matrix} \frac{G (p)}{K} + F [k + p : n], 1 \leq p \leq n - k \\ \frac{G (n - k + 1)}{K} \end{matrix}, 1 \leq \end{matrix} k \leq n

and (3) calculating: starting calculation from Fn to n, Fn-1 to n, Fn-2 to n, until F1 to n, in the calculation, using table to store intermediate calculation result Fk to n, beneficial to searching later calculation, at the same time using array position to store position information of F k to n value;

a backtracking process: when the value of F [1: n ] is up, the calculation process is finished, and then backtracking is carried out from the value of position to obtain a final solution;

the training step specifically comprises: after grouping is completed, a matching method for the use of each grouped keyword is obtained by adopting a training method, three methods of BOM, Wu-Manber and AC are selected, a random data file generated according to different character set sizes is used for training a text, and for each grouped keyword: selecting and matching method A by using training in sequence₁，A₂，...，A_pIn each of the above methods, scanning and matching the training text, recording the time Ti required for scanning, 1 & lti & gt & lti & ltp & gt, and finally calculating { T [ ]₁，T₂，..，T_pMinimum value T in_jThen, for the group, using the jth matching method, training all the groups in turn, and recording the optimal matching method of each group;

the step of storing configuration information specifically comprises: after training, writing the grouping result and the training result into a configuration file of a system according to a specified format so as to facilitate the subsequent establishment of a scanning automatic machine for reading and use; and writing the configuration file in an integer sequence mode, wherein the first row represents a grouping result, the second row represents a training result of the matching method, and the integers are separated by tab keys.

3. A system for large scale keyword matching, comprising:

the normalized keyword device is used for counting the number of given keywords according to the length and then sorting the keywords according to the length;

apparatus for solving optimal grouping and optimal matching methods for solving optimal grouping using two mechanisms: firstly, obtaining an optimal grouping by adopting a dynamic programming mechanism, and then obtaining an optimal matching method of each group by a training method; the other method is to directly obtain the best matching method of the grouping and each group by using a shortest path mechanism; the device finally stores the grouping and the result of the best matching method in a file in the form of a configuration file;

a device for establishing the scanning automata, which is used for reading the configuration file and establishing the scanning automata for each grouped keyword in sequence by adopting the trained result;

and the scanning device is used for scanning and matching the input text by using a group of scanning automata established in the device for establishing the scanning automata and storing the result in a specified memory structure or an external file.

4. The large-scale keyword matching system of claim 3, wherein the normalized keyword means is configured to sort a given set of keywords by length, and count the number of keywords with the same length, and define a set of keywords K ═ { K ═ K [, where the keywords are sorted from small to large or from large to small, and then count the number of keywords with the same length₁，K₂，K₃，...，K_nL, corresponding length L ═ L₁，l₂，l₃，...，l_nThe normalization process first orders K into: k ═ K₁’，K₂’，K₃’，...，K_n', such that the corresponding length L' ═ L₁’，l₂’，l₃’，...，l_n' } satisfies: l₁’＜＝l₂’＜＝l₃’＜＝...＜＝l_n' or l₁’＞＝l₂’＞＝l₃’＞＝...＞＝l_n'then counting L', and calculating the number of the same length to obtain a statistical value sequence LN: n is₁，n₂，n₃，...，n_mWherein m is the maximum length of the keyword; n is_iAnd l < i < m represents the number of keywords having a length i.

5. The large-scale keyword matching system of claim 3, wherein the means for establishing a scanning automaton: reading the grouped positions according to the configuration information of the system, then combining the keywords with corresponding lengths into a group, and constructing an automatic machine according to the optimal scanning matching automatic machine in the group recorded in the configuration information, and finally constructing an automatic machine sequence consisting of a plurality of automatic machines for the original keyword set.

6. The large-scale keyword matching system according to claim 3, wherein the scanning device reads externally inputted text data, the data is files stored on a local disk or various data transmitted on a network, the text data sequentially passes through the scanning automata of the system, when the successfully matched keywords appear, the system records the index numbers and the appearance position information of the keywords, and then counts the information in the corresponding machine structure of the memory for use by an external application system or directly stores the information in the disk file.