CN107220333B

CN107220333B - A Character Search Method Based on Sunday Algorithm

Info

Publication number: CN107220333B
Application number: CN201710375615.6A
Authority: CN
Inventors: 刘小垒; 张小松; 牛伟纳; 胡斌; 王中晴
Original assignee: University of Electronic Science and Technology of China
Current assignee: University of Electronic Science and Technology of China
Priority date: 2017-05-24
Filing date: 2017-05-24
Publication date: 2020-01-31
Anticipated expiration: 2037-05-24
Also published as: CN107220333A

Abstract

The invention belongs to the field of computer software application, and discloses character searching methods based on a Sunday algorithm, which match a pattern string and a text pattern string by judging whether a combination of the last bit of the text window and the lower bit characters outside the text window appears in the pattern string, wherein if the matching is successful, a program is ended, if the matching is unsuccessful, the text window is moved, and the judgment is continued by using the method until the text window reaches the tail end of the text string or the matching is successful, the program is ended.

Description

A Character Search Method Based on Sunday Algorithm

技术领域technical field

本发明涉及字符搜索，特别是一种基于Sunday算法的字符搜索方法，用于计算机网络领域中对字符进行搜索。The invention relates to character search, in particular to a character search method based on Sunday algorithm, which is used for character search in the field of computer networks.

背景技术Background technique

字符串是计算机科学中常见的基本概念，搜索问题也是计算机科学中的基本问题。随着互联网的日渐庞大，信息也是越来越多，如何在海量的信息中快速查找自己所要的信息是网络搜索研究的热点所在；在这其中，字符串匹配算法起着非常重要的作用，一个高效的字符串匹配算法，可以极大的提高搜索的效率和质量。字符串匹配在网络领域有着广泛的应用。比如，拼写检查、语言翻译、数据压缩、搜索引擎、网络入侵检测等。Strings are a common basic concept in computer science, and the search problem is also a basic problem in computer science. With the growing size of the Internet, there is more and more information. How to quickly find the information you want in the massive amount of information is the hotspot of network search research; in this, the string matching algorithm plays a very important role, a Efficient string matching algorithm can greatly improve the efficiency and quality of search. String matching has a wide range of applications in the network field. For example, spell checking, language translation, data compression, search engines, network intrusion detection, etc.

现有的常见的字符串匹配算法有Brute force、KMP、Boyer Moore、Sunday、robin_karp和bitap等。其中，Sunday算法是平均效率较高的一种匹配方案。Existing common string matching algorithms include Brute force, KMP, Boyer Moore, Sunday, robin_karp, and bitap. Among them, the Sunday algorithm is a matching scheme with higher average efficiency.

Sunday算法是Daniel M.Sunday于1990年提出的一种字符串模式匹配算法。其核心思想是：在匹配过程中，模式串并不被要求一定要按从左向右进行比较还是从右向左进行比较，它在发现不匹配时，算法能跳过尽可能多的字符以进行下一步的匹配，从而提高了匹配效率。Sunday algorithm is a string pattern matching algorithm proposed by Daniel M.Sunday in 1990. The core idea is: in the matching process, the pattern string is not required to be compared from left to right or from right to left. When it finds a mismatch, the algorithm can skip as many characters as possible to avoid The next step of matching is performed, thereby improving the matching efficiency.

Sunday算法思想跟BM算法很相似，在匹配失败时关注的是文本串中参加匹配的最末位字符的下一位字符。如果该字符没有在匹配串中出现则直接跳过，即移动步长＝匹配串长度+1；否则，同BM算法一样其移动步长＝匹配串中最右端的该字符到末尾的距离+1。The idea of the Sunday algorithm is very similar to the BM algorithm. When the match fails, the focus is on the next character of the last character in the text string that participates in the match. If the character does not appear in the matching string, skip it directly, that is, the moving step = the length of the matching string + 1; otherwise, the same as the BM algorithm, the moving step = the distance from the rightmost character in the matching string to the end + 1 .

Sunday算法在模式串很长的时候(很多实际应用中，模式串都会很长)，文本窗口外下一个字符在模式串中出现的概率会很高，这样会增加很多无效匹配，由此Sunday算法的效率会显著降低。When the pattern string of the Sunday algorithm is very long (in many practical applications, the pattern string will be very long), the probability of the next character outside the text window appearing in the pattern string will be very high, which will increase a lot of invalid matches, so the Sunday algorithm efficiency will be significantly reduced.

发明内容SUMMARY OF THE INVENTION

基于以上技术问题，本发明提供了一种基于Sunday算法的字符搜索方法,从而解决了在字符搜索中产生很多无效匹配的技术问题。Based on the above technical problems, the present invention provides a character search method based on the Sunday algorithm, thereby solving the technical problem of generating many invalid matches in the character search.

本发明的技术方案如下：The technical scheme of the present invention is as follows:

一种基于Sunday算法的字符搜索方法，包括以下步骤：A character search method based on Sunday algorithm, including the following steps:

步骤1：利用模式串中相邻两字符的信息构建辅助数组，所述模式串为待匹配的字符串；Step 1: use the information of two adjacent characters in the pattern string to construct an auxiliary array, and the pattern string is the character string to be matched;

步骤2：文本窗口与文本字符串左对齐，所述文本字符串为待搜索文本，所述文本窗口在所述文本字符串上滑动；利用所述辅助数组判断所述文本窗口内最后两个字符是否在模式串中出现；若未出现，跳转到步骤4；若出现，则将模式串中出现的字符与所述最后两个字符对齐，判断模式串与文本字符串对应位置上的字符是否匹配，若匹配，则匹配成功程序结束；若不匹配，则跳转到步骤3；Step 2: The text window is left aligned with the text string, the text string is the text to be searched, and the text window slides on the text string; the auxiliary array is used to determine the last two characters in the text window Whether it appears in the pattern string; if not, jump to step 4; if it does, align the characters appearing in the pattern string with the last two characters, and judge whether the characters in the corresponding positions of the pattern string and the text string are If it matches, the program ends if the match is successful; if it does not match, jump to step 3;

步骤3：采用Sunday算法进行下一次的匹配和跳转，若采用Sunday算法匹配成功，则程序结束；若采用Sunday算法匹配失败，则判断Sunday算法跳转的长度是否超过文本窗口的长度，若超过文本窗口长度，则跳转到步骤4；若未超过文本窗口长度，则重复步骤3；Step 3: Use the Sunday algorithm to perform the next match and jump. If the Sunday algorithm matches successfully, the program ends; if the Sunday algorithm fails to match, judge whether the length of the Sunday algorithm jump exceeds the length of the text window. If the length of the text window is exceeded, go to step 4; if the length of the text window is not exceeded, repeat step 3;

步骤4：文本窗口向右移动J个字符长度，将文本窗口中的末位字符和文本窗外与所述末位字符相邻的字符作为组合字符，利用辅助数组判断所述组合字符是否在模式串中出现；若出现，跳转到步骤5；若未出现，判断文本窗口是否超出文本字符串，若超出，则匹配失败程序结束；若未超出，重复步骤4的内容；其中模式串的长度和文本窗口的长度均为J个字符长度；Step 4: Move the text window to the right by J characters, take the last character in the text window and the character adjacent to the last character outside the text window as the combined character, and use the auxiliary array to determine whether the combined character is in the pattern string. If it appears, jump to step 5; if it does not appear, judge whether the text window exceeds the text string, if it does, the match fails and the program ends; if it does not exceed, repeat the content of step 4; the length of the pattern string and The length of the text window is J characters;

步骤5：模式串中出现的字符与所述组合字符对齐，判断模式串与文本字符串对应位置上的字符是否匹配；若匹配则匹配成功程序结束；若不匹配，跳转到步骤3。Step 5: The characters appearing in the pattern string are aligned with the combined characters, and it is judged whether the pattern string matches the characters in the corresponding positions of the text string;

进一步的，步骤1中辅助数组构建方式如下：Further, the auxiliary array construction method in step 1 is as follows:

S201：利用模式串中字符的ASCII码值和设定的index值，计算模式串中相邻两字符的哈希值，采用如下公式：S201: Use the ASCII code value of the character in the pattern string and the set index value to calculate the hash value of two adjacent characters in the pattern string, using the following formula:

H[i]＝A[i]×index+A[i+1]×(index+1) (1)H[i]=A[i]×index+A[i+1]×(index+1) (1)

其中，i表示模式串中字符的位置序号；H[i]表示模式串中相邻两字符的哈希值；A[i]表示第i个字符的ASCII码值；Among them, i represents the position number of the character in the pattern string; H[i] represents the hash value of two adjacent characters in the pattern string; A[i] represents the ASCII code value of the i-th character;

S202：利用所有字符两两组合产生的最大哈希值作为辅助数组中元素的个数，且字符两两组合产生的哈希值的大小为所述元素的位置序号；S202: the maximum hash value generated by the combination of all characters is used as the number of elements in the auxiliary array, and the size of the hash value generated by the combination of characters is the position sequence number of the element;

S203：将S201中计算出的哈希值映射到辅助数组中，具体方式如下：S203: Map the hash value calculated in S201 to the auxiliary array, and the specific method is as follows:

哈希值H[i]对应位置上的元素为1，若模式串中两组及以上相邻两字符组合产生的哈希值H[i]相同，则将该哈希值H[i]对应位置上的元素设为大于1的数，辅助数组中其余位置上的元素均为0。The element at the corresponding position of the hash value H[i] is 1. If the hash value H[i] generated by the combination of two or more adjacent characters in the pattern string is the same, then the hash value H[i] corresponds to The element at the position is set to a number greater than 1, and the elements at the remaining positions in the auxiliary array are all 0.

进一步的，所述利用辅助数组进行判断方法为：计算文本字符串中需要判断的相邻两字符的哈希值，利用该哈希值查找辅助数组中对应位置上的元素，若元素为1，则代表文本字符串中相邻两字符与模式串匹配或出现在模式串中；若元素为0，则代表文本字符串中相邻两字符与模式串不匹配或未出现在模式串中；若元素为大于1的数，则代表该字符组合与其余字符组合的哈希值相同，则进行步骤3的内容。Further, the judging method using the auxiliary array is: calculating the hash value of two adjacent characters that need to be judged in the text string, and using the hash value to find the element on the corresponding position in the auxiliary array, if the element is 1, It means that two adjacent characters in the text string match the pattern string or appear in the pattern string; if the element is 0, it means that the two adjacent characters in the text string do not match the pattern string or do not appear in the pattern string; if If the element is a number greater than 1, it means that the hash value of the combination of characters is the same as that of the other combinations of characters, and the content of step 3 is performed.

综上所述，由于采用了上述技术方案，本发明的有益效果是：To sum up, due to the adoption of the above-mentioned technical solutions, the beneficial effects of the present invention are:

采用字符的ASCII值和index值进行哈希值的计算，简便易行；利用组合字符形式来进行匹配，可有效的降低无效匹配的概率，从而提高Sunday算法的效率。Using the ASCII value and index value of the character to calculate the hash value is simple and easy; using the combination of characters to match can effectively reduce the probability of invalid matching, thereby improving the efficiency of the Sunday algorithm.

模式串一旦确定，通过预处理得出辅助数组后再匹配，提高在实际应用中的使用效率；比如网络入侵中恶意代码的检测、论文检索中大段文本的搜索，病毒多特征扫描，由于这些应用的模式串一般都很长，所以在本发明的应用上能有较大效率的提升。Once the pattern string is determined, the auxiliary array is obtained through preprocessing and then matched to improve the efficiency of use in practical applications; such as the detection of malicious code in network intrusion, the search of large texts in paper retrieval, and multi-feature scanning of viruses. The applied pattern string is generally very long, so the application of the present invention can greatly improve the efficiency.

附图说明Description of drawings

图1是本发明的流程图；Fig. 1 is the flow chart of the present invention;

图2是本发明的实验结果图Fig. 2 is the experimental result figure of the present invention

具体实施方式Detailed ways

本说明书中公开的所有特征，除了互相排斥的特征和/或步骤以外，均可以以任何方式组合。All features disclosed in this specification, except mutually exclusive features and/or steps, may be combined in any way.

下面结合附图对本发明作详细说明。The present invention will be described in detail below with reference to the accompanying drawings.

辅助数组构建方式如下：The auxiliary array is constructed as follows:

H[i]＝A[i]×index+A[i+1]×(index+1) (2)H[i]=A[i]×index+A[i+1]×(index+1) (2)

步骤2：文本窗口与文本字符串左对齐，所述文本字符串为待搜索文本，所述文本窗口在所述文本字符串上滑动；计算所述文本窗口内最后两个字符的哈希值，在辅助数组中，若计算出的哈希值对应位置上的元素为0，则代表最后两个字符未出现在模式串中，跳转到步骤4；若计算出的哈希值对应位置上的元素为1，则代表则代表最后两个字符出现在模式串中，则将模式串中出现的字符与所述最后两个字符对齐，判断模式串与文本字符串对应位置上的字符是否匹配，若匹配，则匹配成功程序结束；若不匹配，则跳转到步骤3；若计算出的哈希值对应位置上的元素为大于1的值，则代表存在另外组合字符的哈希值与该哈希值相同，为减小错误率，跳转到步骤3。Step 2: the text window is left aligned with the text string, the text string is the text to be searched, and the text window slides on the text string; the hash value of the last two characters in the text window is calculated, In the auxiliary array, if the element in the corresponding position of the calculated hash value is 0, it means that the last two characters do not appear in the pattern string, and jump to step 4; if the corresponding position of the calculated hash value is 0 If the element is 1, it means that the last two characters appear in the pattern string, then align the characters appearing in the pattern string with the last two characters, and judge whether the pattern string matches the characters in the corresponding position of the text string, If it matches, the matching program ends; if it does not match, jump to step 3; if the element at the corresponding position of the calculated hash value is a value greater than 1, it means that there is another combined character hash value and this The hash values are the same. To reduce the error rate, go to step 3.

步骤4：文本窗口向右移动J个字符长度，将文本窗口中的末位字符和文本窗外与所述末位字符相邻的字符作为组合字符，计算所述组合字符的哈希值，在辅助数组中，若计算出的哈希值对应位置上的元素为0，则代表最后两个字符未出现在模式串中，判断文本窗口是否超出文本字符串，若超出，则匹配失败程序结束；若未超出，重复步骤4的内容；若计算出的哈希值对应位置上的元素为1，则代表则代表最后两个字符出现在模式串中，跳转到步骤5；若计算出的哈希值对应位置上的元素为大于1的值，则代表存在另外组合字符的哈希值与该哈希值相同，为减小错误率，跳转到步骤3；其中模式串的长度和文本窗口的长度均为J个字符长度；Step 4: Move the text window to the right by J characters, take the last character in the text window and the character adjacent to the last character outside the text window as a combined character, calculate the hash value of the combined character, and use the auxiliary In the array, if the element at the corresponding position of the calculated hash value is 0, it means that the last two characters do not appear in the pattern string, and it is judged whether the text window exceeds the text string. If it exceeds, the matching fails and the program ends; if If it does not exceed, repeat the content of step 4; if the element in the corresponding position of the calculated hash value is 1, it means that the last two characters appear in the pattern string, and jump to step 5; if the calculated hash value is 1 If the element at the corresponding position of the value is a value greater than 1, it means that the hash value of another combined character is the same as the hash value. In order to reduce the error rate, go to step 3; the length of the pattern string is the same as that of the text window. The length is all J characters long;

本发明的工作原理是：构建辅助数组，文本窗口在文本字符串上滑动，滑动长度为模式串的长度；将文本窗口的最后一个字符与文本窗外的下一个字符进行组合，通过辅助数组来判断该字符组合是否出现在模式串中，以此来对字符串进行匹配。The working principle of the invention is as follows: an auxiliary array is constructed, the text window slides on the text string, and the sliding length is the length of the pattern string; the last character of the text window and the next character outside the text window are combined, and the auxiliary array is used to judge Whether the character combination appears in the pattern string to match the string.

下面，结合具体实施例来对本发明做进一步详细说明。Hereinafter, the present invention will be further described in detail with reference to specific embodiments.

具体实施例specific embodiment

具体实施例1Specific Example 1

以下为本发明C语言的实现程序：The following is the implementation program of the C language of the present invention:

具体实施例2Specific embodiment 2

以下为本发明的实验结果图，具体为与Sunday算法的效率比较图。The following is a graph of experimental results of the present invention, specifically an efficiency comparison graph with the Sunday algorithm.

生成10KB的随机纯英文字母和数字作为文本字符串，选取特定特点的字符串作模式串，用原Sunday算法和本发明采用的算法进行搜索，比较它们的实际效率，重复调用1000次求总时间(单位:毫秒)，得出下列统计图。Generate 10KB random pure English letters and numbers as text strings, select character strings with specific characteristics as pattern strings, use the original Sunday algorithm and the algorithm adopted by the present invention to search, compare their actual efficiency, and repeat the call 1000 times to find the total time (unit: milliseconds), the following statistical graphs are obtained.

图中，A代表Sunday算法，B代表本发明采用的算法；In the figure, A represents the Sunday algorithm, and B represents the algorithm adopted by the present invention;

柱体1采用的模式串为：abcdef；The pattern string used by cylinder 1 is: abcdef;

柱体2采用模式串为：abcdefghijklmnopqrstuvwxyz；Cylinder 2 adopts the pattern string: abcdefghijklmnopqrstuvwxyz;

柱体3采用模式串为：Cylinder 3 adopts the pattern string as:

abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ；abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ;

柱体4采用模式串为：Cylinder 4 adopts the pattern string as:

abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ1234567abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ1234567

如上所述即为本发明的实施例。本发明不局限于上述实施方式，任何人应该得知在本发明的启示下做出的结构变化，凡是与本发明具有相同或相近的技术方案，均落入本发明的保护范围之内。The above are the embodiments of the present invention. The present invention is not limited to the above-mentioned embodiments, and anyone should know that structural changes made under the inspiration of the present invention, and all technical solutions that are the same or similar to those of the present invention, fall within the protection scope of the present invention.

Claims

1, character search method based on Sunday algorithm, which is characterized in that the method comprises the following steps:

step 1: constructing an auxiliary array by using information of two adjacent characters in a pattern string, wherein the pattern string is a character string to be matched;

step 2: the method comprises the following steps that a text window is aligned with a text character string on the left, the text character string is a text to be searched, and the text window slides on the text character string; judging whether the last two characters in the text window appear in the pattern string by using the auxiliary array; if not, jumping to the step 4; if yes, aligning the characters appearing in the pattern string with the last two characters, judging whether the characters on the corresponding positions of the pattern string and the text character string are matched, and if yes, ending the matching success program; if not, jumping to the step 3;

step 3, adopting a Sunday algorithm to carry out matching and jumping for times, if the Sunday algorithm is successfully matched, ending the program, if the Sunday algorithm is unsuccessfully matched, judging whether the jumping length of the Sunday algorithm exceeds the length of a text window, if the jumping length of the Sunday algorithm exceeds the length of the text window, jumping to the step 4, and if the jumping length of the text window is not exceeded, repeating the step 3;

and 4, step 4: moving the text window to the right by J character length, taking the last character in the text window and the character outside the text window and adjacent to the last character as combined characters, and judging whether the combined characters appear in the pattern string by using an auxiliary array; if yes, jumping to the step 5; if not, judging whether the text window exceeds the text character string, if so, ending the matching failure program; if not, repeating the content of the step 4; the length of the mode string and the length of the text window are J characters;

and 5: aligning the characters appearing in the pattern string with the combined characters, and judging whether the characters on the corresponding positions of the pattern string and the text character string are matched or not; if the matching is successful, the matching procedure is ended; and if not, jumping to the step 3.

2. The character searching method based on Sunday algorithm, as claimed in claim 1, wherein the auxiliary array is constructed in the following manner in step 1:

s201: calculating the hash value of two adjacent characters in the pattern string by using the ASCII code value of the characters in the pattern string and the set index value, and adopting the following formula:

H[i]＝A[i]×index+A[i+1]×(index+1)

wherein i represents the position serial number of the character in the pattern string; h [ i ] represents the hash value of two adjacent characters in the pattern string; a [ i ] represents an ASCII code value of the ith character;

s202: the maximum hash value generated by combining all characters in pairs is used as the number of elements in the auxiliary array, and the size of the hash value generated by combining all the characters in pairs is the position serial number of the elements;

s203: mapping the hash value calculated in S201 to the auxiliary array in the following specific manner:

the elements at the positions corresponding to the hash values Hi are 1, if the hash values Hi generated by two or more groups of adjacent character combinations in the pattern string are the same, the elements at the positions corresponding to the hash values Hi are set to be numbers larger than 1, and the elements at the rest positions in the auxiliary array are 0.

3. The character searching method based on Sunday algorithm, as claimed in claim 2, wherein the method for determining by using the auxiliary array comprises calculating the hash value of two adjacent characters to be determined in the text string, searching the element at the corresponding position in the auxiliary array by using the hash value, if the element is 1, then representing that the two adjacent characters in the text string match the pattern string or appear in the pattern string, if the element is 0, then representing that the two adjacent characters in the text string do not match the pattern string or do not appear in the pattern string, and if the element is a number greater than 1, then representing that the hash value of the character combination is the same as that of the other character combinations, then proceeding with the content of step 3.