[go: up one dir, main page]

CN109408681B - Character string matching method, device and equipment and readable storage medium - Google Patents

Character string matching method, device and equipment and readable storage medium Download PDF

Info

Publication number
CN109408681B
CN109408681B CN201811183579.4A CN201811183579A CN109408681B CN 109408681 B CN109408681 B CN 109408681B CN 201811183579 A CN201811183579 A CN 201811183579A CN 109408681 B CN109408681 B CN 109408681B
Authority
CN
China
Prior art keywords
string
character string
target
matched
preset
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201811183579.4A
Other languages
Chinese (zh)
Other versions
CN109408681A (en
Inventor
吴晓鸰
詹彬
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Guangdong University of Technology
Original Assignee
Guangdong University of Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Guangdong University of Technology filed Critical Guangdong University of Technology
Priority to CN201811183579.4A priority Critical patent/CN109408681B/en
Publication of CN109408681A publication Critical patent/CN109408681A/en
Application granted granted Critical
Publication of CN109408681B publication Critical patent/CN109408681B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

本发明公开了一种字符串匹配方法,包括:获取目标字符串和待匹配字符串,将目标字符串和待匹配字符串分别切分为预设长度的子字符串,得到子字符串集合;为子字符串集合中的每种子字符串分配哈希值,得到初始哈希序列;按照预设的逻辑运算更新初始哈希序列中的每个哈希值,得到目标哈希序列;根据目标哈希序列计算目标字符串和待匹配字符串的相似度,当相似度大于预设的相似度阈值时,确定目标字符串和待匹配字符串相匹配。该方法基于哈希值判别目标字符串和待匹配字符串是否匹配,其计算复杂度低,消耗的计算机资源较少,可以提高字符串匹配效率,降低计算成本。本发明公开的一种字符串匹配装置、设备及可读存储介质,也同样具有上述技术效果。

Figure 201811183579

The invention discloses a string matching method, comprising: acquiring a target string and a to-be-matched character string, dividing the target character string and the to-be-matched character string into substrings of preset lengths, respectively, to obtain a substring set; Assign a hash value to each substring in the substring set to obtain an initial hash sequence; update each hash value in the initial hash sequence according to a preset logical operation to obtain a target hash sequence; The Greek sequence calculates the similarity between the target string and the string to be matched, and when the similarity is greater than a preset similarity threshold, it is determined that the target string and the string to be matched match. The method discriminates whether the target string matches the to-be-matched string based on the hash value, has low computational complexity, consumes less computer resources, can improve string matching efficiency and reduce computational cost. The string matching device, device and readable storage medium disclosed by the present invention also have the above technical effects.

Figure 201811183579

Description

Character string matching method, device and equipment and readable storage medium
Technical Field
The present invention relates to the field of information science and technology, and more particularly, to a method, an apparatus, a device and a readable storage medium for matching a character string.
Background
With the popularization of computers and the increasing development of information engineering, obtaining information from the internet becomes an important way for people to live and work in daily life, and the internet gradually becomes a resource storage space with an overlarge information amount nowadays. Search engines have been developed to obtain useful information from the internet.
The search engine is called as a black box for information processing and acquisition, crawls information resources in the internet through a certain rule, processes and extracts information, and provides an interface for facilitating user query, thereby playing a role in guiding a user to acquire information. String matching is a key technique in search engines. When a user searches for information by keywords, a search engine searches in a database, and if the information which is consistent with the content required by the user is found, the inquired information is returned to the user so that the user can check and select.
To say thatIt is clear that the existing string matching generally discriminates the similarity of two strings based on a distance matrix. In the calculation process, the calculation complexity when filling the distance matrix is n2(n is the length of the string) level. That is, as the length of the string increases, the size and computational complexity of the distance matrix will grow exponentially, thus consuming enormous computer resources; when the number of query requests is large, the computer resources consumed by the query requests are immeasurable, so that the character string matching efficiency is reduced, and the calculation cost is increased.
Therefore, how to improve the character string matching efficiency is a problem to be solved by those skilled in the art.
Disclosure of Invention
The invention aims to provide a character string matching method, a device, equipment and a readable storage medium, so as to improve the character string matching efficiency.
In order to achieve the above purpose, the embodiment of the present invention provides the following technical solutions:
a string matching method, comprising:
acquiring a target character string and a character string to be matched, and segmenting the target character string and the character string to be matched into sub-character strings with preset lengths respectively to obtain a sub-character string set;
distributing a hash value to each seed character string in the sub character string set to obtain an initial hash sequence;
updating each hash value in the initial hash sequence according to a preset logical operation to obtain a target hash sequence;
and calculating the similarity between the target character string and the character string to be matched according to the target hash sequence, and determining that the target character string is matched with the character string to be matched when the similarity is greater than a preset similarity threshold value.
Wherein, the allocating a hash value to each substring in the substring set includes:
and distributing a hash value to each seed character string in the sub character string set according to a preset increment rule.
Wherein, updating each hash value in the initial hash sequence according to a preset logical operation to obtain a target hash sequence, including:
and updating each hash value in the initial hash sequence to be zero, and performing exclusive or operation on each updated hash value and a preset identification value to obtain the target hash sequence.
Wherein, the calculating the similarity between the target character string and the character string to be matched according to the target hash sequence comprises:
calculating the similarity between the target character string and the character string to be matched by adopting a similarity calculation formula;
wherein, the similarity calculation formula is as follows: sim (S)1,S2)=1-count(H)/n,sim(S1,S2) Representing the similarity of the target string and the string to be matched, S1Representing said target string, S2Representing the character string to be matched; count (h) represents the number of different substrings of the target string and the string to be matched in the target hash sequence; n represents the length of the character string, and the length of the character string is half of the sum of the lengths of the target character string and the character string to be matched.
Before determining that the target character string is matched with the character string to be matched, the method further comprises the following steps:
calculating the offset of the target character string and the character string to be matched;
judging whether the offset is smaller than a preset offset threshold value or not;
and if so, executing the step of determining that the target character string is matched with the character string to be matched.
Wherein, the judging whether the offset is smaller than a preset offset threshold value comprises:
judging whether the ratio of the offset to the length of the character string is smaller than a preset offset threshold value or not;
and the length of the character string is half of the sum of the lengths of the target character string and the character string to be matched.
A character string matching apparatus comprising:
the system comprises an acquisition module, a matching module and a matching module, wherein the acquisition module is used for acquiring a target character string and a character string to be matched, and respectively segmenting the target character string and the character string to be matched into sub-character strings with preset lengths to obtain a sub-character string set;
the distribution module is used for distributing a hash value to each seed character string in the sub character string set to obtain an initial hash sequence;
the updating module is used for updating each hash value in the initial hash sequence according to preset logical operation to obtain a target hash sequence;
and the matching module is used for calculating the similarity between the target character string and the character string to be matched according to the target hash sequence, and when the similarity is greater than a preset similarity threshold value, determining that the target character string is matched with the character string to be matched.
Wherein, still include:
the calculation module is used for calculating the offset of the target character string and the character string to be matched;
the judging module is used for judging whether the offset is smaller than a preset offset threshold value or not;
and the execution module is used for determining that the target character string is matched with the character string to be matched when the offset is smaller than a preset offset threshold.
A character string matching apparatus comprising:
a memory for storing a computer program;
a processor for implementing the steps of the string matching method of any one of the above items when executing the computer program.
A readable storage medium having stored thereon a computer program which, when executed by a processor, implements the steps of the character string matching method of any one of the above.
According to the scheme, the character string matching method provided by the embodiment of the invention comprises the following steps: acquiring a target character string and a character string to be matched, and segmenting the target character string and the character string to be matched into sub-character strings with preset lengths respectively to obtain a sub-character string set; distributing a hash value to each seed character string in the sub character string set to obtain an initial hash sequence; updating each hash value in the initial hash sequence according to a preset logical operation to obtain a target hash sequence; and calculating the similarity between the target character string and the character string to be matched according to the target hash sequence, and determining that the target character string is matched with the character string to be matched when the similarity is greater than a preset similarity threshold value.
The method comprises the steps that firstly, a target character string and a character string to be matched are segmented to obtain a sub-character string set; further distributing a hash value for each seed character string in the sub character string set, and updating each hash value; and finally, calculating the similarity of the target character string and the character string to be matched based on the updated hash value, and determining whether the target character string and the character string to be matched are matched based on the similarity. The method judges whether the target character string is matched with the character string to be matched or not based on the hash value, has low calculation complexity and less consumed computer resources, and can improve the character string matching efficiency and reduce the calculation cost.
Accordingly, the character string matching device, the equipment and the readable storage medium provided by the embodiment of the invention also have the technical effects.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.
FIG. 1 is a flow chart of a string matching method disclosed in the embodiments of the present invention;
FIG. 2 is a flow chart of another string matching method disclosed in the embodiments of the present invention;
FIG. 3 is a schematic diagram of a string matching apparatus according to an embodiment of the present invention;
FIG. 4 is a schematic diagram of a string matching apparatus according to an embodiment of the present invention;
fig. 5 is a graph comparing the performance of the present invention with that of the prior art.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
The embodiment of the invention discloses a character string matching method, a device, equipment and a readable storage medium, which aim to improve the character string matching efficiency.
Referring to fig. 1, a method for matching a character string provided in an embodiment of the present invention includes:
s101, obtaining a target character string and a character string to be matched, and segmenting the target character string and the character string to be matched into sub character strings with preset lengths respectively to obtain a sub character string set;
assuming that the target character string is content and the character string to be matched is content, after the target character string and the character string to be matched are respectively segmented (the length of the sub-character string is 2), the sub-character string corresponding to the target character string comprises: co, on, nt, te, en, nt; the sub-character strings corresponding to the character strings to be matched comprise: co, on, nt, te, en, and nd, and Q represents the set of substrings, { co, on, nt, te, en, nt, co, on, nt, te, en, and nd }.
The length of the substring can be any number greater than or equal to 2, and the larger the substring is, the higher the accuracy is, but for the convenience of calculation, the larger the substring is not generally; of course, the length of the substring can be flexibly adjusted according to practical application, for example, a short substring is not suitable for a long segmentation distance.
S102, distributing a hash value to each seed character string in the sub character string set to obtain an initial hash sequence;
wherein, the allocating a hash value to each substring in the substring set includes: and distributing a hash value to each seed character string in the sub character string set according to a preset increment rule.
Note that, when assigning a hash value, the type of the substring is used as a reference. Namely: the same hash value is assigned to the same substring in the set of substrings. The distribution can be carried out in the following way: traversing the sub-character string set, distributing a hash value 1 to the traversed first sub-character string co, distributing a hash value 2 to the second sub-character string on, distributing a hash value 3 to the third sub-character string nt, and repeating the steps until all sub-character strings in the sub-character string set are traversed, wherein the method can be regarded as an incremental rule. Wherein when traversing to substring co again, hash value 1 is still assigned. Of course, the increment rule can be similar to the arithmetic sequence, and the initial value (i.e. the initial term) and the size of the interval value (i.e. the tolerance) can be preset in advance and flexibly changed.
After the hash value is assigned to each substring in the substring set, the obtained initial hash sequence is {1, 2, 3, 4, 5, 6, 7}, and the corresponding relationship between each hash value and the substring is shown in table 1.
TABLE 1
Figure BDA0001825615500000051
Figure BDA0001825615500000061
Based on this, the length L of the hash sequence is represented by the largest hash value in the initial hash sequenceHI.e. LH=7。
S103, updating each hash value in the initial hash sequence according to a preset logical operation to obtain a target hash sequence;
wherein, updating each hash value in the initial hash sequence according to a preset logical operation to obtain a target hash sequence, including: and updating each hash value in the initial hash sequence to be zero, and performing exclusive or operation on each updated hash value and a preset identification value to obtain the target hash sequence.
Specifically, each initial hash value is initialized to 0, and the maximum value in the initial hash values is used as the length of the initial hash sequence; and traversing the sub-character string set, carrying out XOR operation on the initial hash value corresponding to each sub-character string in the sub-character string set and a preset identification value 1, and repeatedly executing the XOR operation when a certain sub-character string appears for multiple times.
For example: for a substring "te", an initial hash value corresponding to the first appearing "te" is 4, and an exclusive-or operation is performed, that is, 4 ≦ 1, and the obtained hash value is 1, and when "te" appears for the second time, an exclusive-or operation is performed on the hash value 1 obtained for the previous time and a preset identification value 1, that is, 1 ≦ 0, and then a target hash value corresponding to the last obtained "te" is 0. By analogy, please refer to table 2 for the corresponding relationship among the strings, the initial hash value and the target hash value obtained by the xor operation and the preset identification value 1.
TABLE 2
Initializing hash sequences 0 0 0 0 0 0 0
Target hash sequence 0 0 0 0 0 1 1
S104, calculating the similarity between the target character string and the character string to be matched according to the target hash sequence;
s105, judging whether the similarity is larger than a preset similarity threshold value or not; if yes, executing S106; if not, executing S107;
s106, determining that the target character string is matched with the character string to be matched;
and S107, determining that the target character string is not matched with the character string to be matched.
Specifically, sim (S) may be used1,S2) The similarity between the target character string and the character string to be matched is calculated by the formula 1-count (H)/n, wherein sim (S)1,S2) Representing the similarity of the target string and the string to be matched, S1Representing a target string, S2Representing a character string to be matched; count (H) represents the number of different sub-character strings of the target character string and the character string to be matched in the target hash sequence; n represents the length of the character string, and the length of the character string is half of the sum of the lengths of the target character string and the character string to be matched. It should be noted that the number of different sub-strings of the target string and the string to be matched may be an estimated value.
When the obtained similarity is larger than a preset similarity threshold, determining that the target character string is matched with the character string to be matched; and when the obtained similarity is not greater than a preset similarity threshold, determining that the target character string is not matched with the character string to be matched.
The method includes the steps that firstly, a target character string and a character string to be matched are segmented to obtain a sub-character string set; further distributing a hash value for each seed character string in the sub character string set, and updating each hash value; and finally, calculating the similarity of the target character string and the character string to be matched based on the updated hash value, and determining whether the target character string and the character string to be matched are matched based on the similarity. The method judges whether the target character string is matched with the character string to be matched or not based on the hash value, has low calculation complexity and less consumed computer resources, and can improve the character string matching efficiency and reduce the calculation cost.
The embodiment of the invention discloses another character string matching method, and compared with the previous embodiment, the technical scheme is further explained and optimized in the embodiment.
Referring to fig. 2, another character string matching method provided in the embodiment of the present invention includes:
s201, obtaining a target character string and a character string to be matched, and segmenting the target character string and the character string to be matched into sub character strings with preset lengths respectively to obtain a sub character string set;
s202, distributing a hash value to each seed character string in the sub character string set to obtain an initial hash sequence;
s203, updating each hash value in the initial hash sequence according to a preset logical operation to obtain a target hash sequence;
s204, calculating the similarity between the target character string and the character string to be matched according to the target hash sequence;
s205, judging whether the similarity is larger than a preset similarity threshold value; if yes, go to S206; if not, executing S209;
s206, calculating the offset of the target character string and the character string to be matched, and executing S207;
s207, judging whether the offset is smaller than a preset offset threshold value; if yes, go to step S208; if not, executing S209;
s208, determining that the target character string is matched with the character string to be matched;
s209, determining that the target character string is not matched with the character string to be matched.
In the embodiment, in order to further improve the accuracy of character string matching, an offset is introduced for corresponding calculation. The offset may verify the order of the characters and may further distinguish whether two strings match. When the similarity between the target character string and the character string to be matched is larger than a preset similarity threshold value, calculating the offset of the target character string and the character string to be matched; when the offset is smaller than a preset offset threshold, determining that the target character string is matched with the character string to be matched; otherwise the two do not match.
Wherein, the judging whether the offset is smaller than a preset offset threshold value comprises: judging whether the ratio of the offset to the length of the character string is smaller than a preset offset threshold value or not; and the length of the character string is half of the sum of the lengths of the target character string and the character string to be matched.
Specifically, in the offset calculation principle, when one or more consecutive characters are equally offset, an offset is considered to exist, and the offset is considered to be within 2. The calculation process comprises the following steps:
initializing a 2X n/2X matrix for storing the character offset for each substring in the set Q, the first row indicating that 1 offset has occurred for the bit character, the second row indicating that 2 offsets have occurred for the bit character (1 indicating right offset, -1 indicating left offset, 0 indicating no offset).
The offset in the X matrix is calculated according to the following equation:
Figure BDA0001825615500000081
Figure BDA0001825615500000082
where bv is an offset and i and j are arbitrary indices.
Examples are as follows: when calculating the offset of two character strings of levenshtein and levenshteoin, the substring corresponding to levenshtein includes: le, ev, ve, en, ns, sh, ht, te, ei, in; co, on, nt, te, en, nt; the substring corresponding to levenshteoin includes: le, ev, vn, ns, sh, ht, te, eo, oi, in; it can be seen that ns, sh, ht, and te all have been shifted, but since ns, sh, ht, and te are four consecutive substrings, it can be considered that one shift has occurred, i.e. the shift amount is 1. Wherein, the offset matrix X of the two character strings of levenshtein and levenshteoin is as follows:
Figure BDA0001825615500000083
after the offset of the two character strings is obtained, whether the two character strings are matched can be determined by judging whether the offset is smaller than a preset offset threshold value; specifically, whether two character strings are matched or not can be determined by judging whether the ratio of the offset to the length of the character string is smaller than a preset offset threshold or not. Namely: and comparing the offset with the length of the character string, and judging whether the ratio of the offset to the length of the character string is smaller than an offset threshold value or not. Wherein the length of the character string is half of the sum of the lengths of the two character strings.
Among them, the offset matrix can be obtained as follows. Determining substring results Q1 and Q2 corresponding to the two character strings respectively; and calculating the first row of the offset matrix according to the first row, namely judging whether the characters of the corresponding position substrings are offset leftwards, rightwards or not. For each row of the offset matrix, Q1 is assignediAnd Q2i+1By comparison, the same offset is used to offset X in the matrix ,ij0, if not Xij1 is ═ 1; then Q1 is addediAnd Q2i-1Comparing X in the same offset matrix ij0, if not XijAnd-1, so that an offset matrix can be obtained.
Wherein, Q1iRepresenting a certain substring in Q1, Q2iA certain substring in Q2, for example: if Q1 ═ co, on, nt, te, en, nt, then the substrings therein can be sequentially represented as: { Q11、Q12、Q13、Q14、Q15、Q16Q2 can be expressed correspondingly by analogy, so that the description is not repeated herein.
ij denotes the rows and columns of the offset matrix, and when the X matrix is as follows, the first row elements (0001111000) may be sequentially denoted as X11、X12、X13、X14、X15、X16、X17、X18、X19、X20
Figure BDA0001825615500000091
As can be seen, the present embodiment provides another character string matching method, in which a target character string and a character string to be matched are firstly segmented to obtain a sub-character string set; further distributing a hash value for each seed character string in the sub character string set, and updating each hash value; and finally, calculating the similarity of the target character string and the character string to be matched based on the updated hash value, and determining whether the target character string and the character string to be matched are matched based on the similarity. The method judges the similarity of the target character string and the character string to be matched based on the hash value, judges whether the target character string and the character string to be matched are matched or not by referring to the offset, has low calculation complexity and less consumed computer resources, can improve the matching efficiency and accuracy of the character strings, and reduces the calculation cost.
Based on any of the above embodiments, it should be noted that the calculating the similarity between the target character string and the character string to be matched according to the target hash sequence includes: calculating the similarity between the target character string and the character string to be matched by adopting a similarity calculation formula;
wherein, the similarity calculation formula is as follows: sim (S)1,S2)=1-count(H)/n,sim(S1,S2) Representing similarity of the target character string and the character string to be matchedDegree, S1Representing said target string, S2Representing the character string to be matched; count (h) represents the number of different substrings of the target string and the string to be matched in the target hash sequence; n represents the length of the character string, and the length of the character string is half of the sum of the lengths of the target character string and the character string to be matched.
Specifically, the target hash sequence may reflect the number of identical sub-strings and the number of different sub-strings of the target string and the string to be matched. But whether the two strings are similar cannot be distinguished by the difference alone; for example, if the number of different characters compared by a pair of character strings is 5, it can be clearly determined that the two character strings are not similar for the character string with the length of 8; however, if the two strings are DNA sequences consisting of millions of characters, the difference is negligible, i.e., the two strings are determined to be similar. It is determined whether there is a match between different character strings by the similarity degree with respect to the length of the character string.
Specifically, the target character string S can be calculated as follows1And the character string S to be matched2And judging whether the two are matched, wherein the similarity calculation formula is as follows:
Figure BDA0001825615500000101
where n denotes a character string length, and n ═ length (S)1)+length(S2) 2)/2; count (h) is the number of bits in the target hash sequence that are extracted for accumulation (i.e., the number of bits that have a value of 1).
When the judgment is carried out based on the similarity, a similarity threshold value can be preset in advance, namely: judging whether the similarity is greater than a preset similarity threshold value or not; and when the similarity is greater than a preset similarity threshold, determining that the two character strings are matched. The similarity threshold value can be set based on the requirements for similarity under different scenes.
In the following, a character string matching device provided by an embodiment of the present invention is introduced, and a character string matching device described below and a character string matching method described above may be referred to each other.
Referring to fig. 3, an embodiment of the present invention provides a character string matching apparatus, including:
the acquiring module 301 is configured to acquire a target character string and a character string to be matched, and divide the target character string and the character string to be matched into sub-character strings of preset lengths, respectively, to obtain a sub-character string set;
an allocating module 302, configured to allocate a hash value to each seed string in the substring set to obtain an initial hash sequence;
an updating module 303, configured to update each hash value in the initial hash sequence according to a preset logical operation to obtain a target hash sequence;
a matching module 304, configured to calculate a similarity between the target character string and the character string to be matched according to the target hash sequence, and determine that the target character string and the character string to be matched are matched when the similarity is greater than a preset similarity threshold.
Wherein, still include:
the calculation module is used for calculating the offset of the target character string and the character string to be matched;
the judging module is used for judging whether the offset is smaller than a preset offset threshold value or not;
and the execution module is used for determining that the target character string is matched with the character string to be matched when the offset is smaller than a preset offset threshold.
Wherein, the judging module is specifically configured to:
judging whether the ratio of the offset to the length of the character string is smaller than a preset offset threshold value or not; and the length of the character string is half of the sum of the lengths of the target character string and the character string to be matched.
Wherein the allocation module is specifically configured to:
and distributing a hash value to each seed character string in the sub character string set according to a preset increment rule.
Wherein the update module is specifically configured to:
and updating each hash value in the initial hash sequence to be zero, and performing exclusive or operation on each updated hash value and a preset identification value to obtain the target hash sequence.
Wherein the matching module is specifically configured to:
calculating the similarity between the target character string and the character string to be matched by adopting a similarity calculation formula; wherein, the similarity calculation formula is as follows: sim (S)1,S2)=1-count(H)/n,sim(S1,S2) Representing the similarity of the target string and the string to be matched, S1Representing said target string, S2Representing the character string to be matched; count (h) represents the number of different substrings of the target string and the string to be matched in the target hash sequence; n represents the length of the character string, and the length of the character string is half of the sum of the lengths of the target character string and the character string to be matched.
It can be seen that this embodiment provides a character string matching device, including: the device comprises an acquisition module, a distribution module, an updating module and a matching module. Firstly, an acquisition module acquires a target character string and a character string to be matched, and the target character string and the character string to be matched are respectively segmented into sub character strings with preset lengths to obtain a sub character string set; then, the distribution module distributes a hash value to each seed character string in the sub character string set to obtain an initial hash sequence; the updating module updates each hash value in the initial hash sequence according to preset logical operation to obtain a target hash sequence; and finally, the matching module calculates the similarity between the target character string and the character string to be matched according to the target hash sequence, and when the similarity is greater than a preset similarity threshold value, the matching between the target character string and the character string to be matched is determined. Thus, all modules are in work and cooperation and take their own roles, so that the calculation complexity and the consumption of computer resources are reduced, and the character string matching efficiency and accuracy are improved.
In the following, a character string matching device provided by an embodiment of the present invention is introduced, and a character string matching device described below and a character string matching method and device described above may refer to each other.
Referring to fig. 4, an embodiment of the present invention provides a character string matching apparatus, including:
a memory 401 for storing a computer program;
a processor 402 for implementing the steps of the string matching method according to any of the above embodiments when executing the computer program.
In the following, a readable storage medium provided by an embodiment of the present invention is introduced, and a readable storage medium described below and a character string matching method, apparatus, and device described above may be referred to each other.
A readable storage medium having stored thereon a computer program which, when executed by a processor, implements the steps of the string matching method as described in any of the above embodiments.
To further show the beneficial effects of the present invention (hash-sequence-based matching method) compared to the prior art (distance-matrix-based matching method), the following comparative experiment is now performed. See table 3 for strings to be compared.
TABLE 3
Character string to be compared Distance threshold Offset threshold
that 0.4 0.4
Student 0.4 0.4
The results of the two matching methods are shown in table 4.
TABLE 4
Figure BDA0001825615500000121
Figure BDA0001825615500000131
The experimental result shows that the invention is basically consistent with the output result of the distance matrix, and the invention is superior to the prior art in the aspect of the recall ratio.
The computational complexity of distance matrix based matching methods grows exponentially as the length of the string increases, (since the distance matrix needs to fill the entire matrix). Assuming a string length of n, the computational complexity of the distance matrix based matching method is close to n ^ 2. The invention is obviously superior to a distance matrix when matching long character strings, the implementation process of the invention comprises 6 steps of splitting content, distributing hash values, calculating hash sequences, calculating deviation values, traversing character sequences and the like, and the algorithm complexity is close to 6 n.
See fig. 5 for comparison of the performance of the present invention with that of the prior art. As can be seen from fig. 5, when the length of the character string (i.e., the matching content length in fig. 5) is less than 100, there is substantially no difference between the present invention and the prior art; when the length of the character string is greater than 100, the present invention exhibits better performance than the prior art. Therefore, when a large amount of long contents are matched, the method can save computer resources and improve matching efficiency.
The embodiments in the present description are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other.
The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present invention. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the invention. Thus, the present invention is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims (8)

1.一种字符串匹配方法,其特征在于,包括:1. a string matching method, is characterized in that, comprises: 获取目标字符串和待匹配字符串,将所述目标字符串和所述待匹配字符串分别切分为预设长度的子字符串,得到子字符串集合;若一个字符串为:T1、T2、T3……TN,预设长度为i,2≤i≤N,那么子字符串为:T1、T2、T3…Ti;T2、T3…Ti、Ti+1;T3…Ti、Ti+1、Ti+2;……;TN-(i-1)……TN-1、TNObtain the target string and the string to be matched, and divide the target string and the string to be matched into substrings of preset lengths to obtain a set of substrings; if a string is: T 1 , T 2 , T 3 ......T N , the preset length is i, 2≤i≤N, then the substrings are: T 1 , T 2 , T 3 ...... T i ; T 2 , T 3 ...... T i , T i+1 ; T 3 . . . T i , T i+1 , T i+2 ;  … ; T N-(i-1) …T N-1 , T N ; 为所述子字符串集合中的每种子字符串分配哈希值,得到初始哈希序列;Assign a hash value to each substring in the set of substrings to obtain an initial hash sequence; 按照预设的逻辑运算更新所述初始哈希序列中的每个哈希值,得到目标哈希序列;Update each hash value in the initial hash sequence according to a preset logical operation to obtain a target hash sequence; 根据所述目标哈希序列计算所述目标字符串和所述待匹配字符串的相似度,当所述相似度大于预设的相似度阈值时,确定所述目标字符串和所述待匹配字符串相匹配;Calculate the similarity between the target string and the to-be-matched character string according to the target hash sequence, and when the similarity is greater than a preset similarity threshold, determine the target string and the to-be-matched character string match; 其中,所述为所述子字符串集合中的每种子字符串分配哈希值,包括:Wherein, assigning a hash value to each substring in the substring set includes: 按照预设的递增规则为所述子字符串集合中的每种子字符串分配哈希值;Allocate a hash value to each substring in the set of substrings according to a preset incrementing rule; 其中,所述按照预设的逻辑运算更新所述初始哈希序列中的每个哈希值,得到目标哈希序列,包括:Wherein, updating each hash value in the initial hash sequence according to a preset logical operation to obtain a target hash sequence, including: 将所述初始哈希序列中的每个哈希值更新为零,并将更新后的每个哈希值与预设标识值进行异或运算,得到所述目标哈希序列;其中,当某个子字符串出现多次时,则重复执行异或运算,也就是:将前次异或运算得到的哈希值与预设标识值再次进行异或运算。Update each hash value in the initial hash sequence to zero, and perform XOR operation on each updated hash value with a preset identification value to obtain the target hash sequence; When the substring appears multiple times, the XOR operation is repeatedly performed, that is, the XOR operation is performed again on the hash value obtained by the previous XOR operation and the preset identification value. 2.根据权利要求1所述的字符串匹配方法,其特征在于,所述根据所述目标哈希序列计算所述目标字符串和所述待匹配字符串的相似度,包括:2. The string matching method according to claim 1, wherein calculating the similarity between the target character string and the character string to be matched according to the target hash sequence comprises: 采用相似度计算公式计算所述目标字符串和所述待匹配字符串的相似度;Calculate the similarity between the target character string and the character string to be matched using a similarity calculation formula; 其中,所述相似度计算公式为:sim(S1,S2)=1-count(H)/n,sim(S1,S2)表示所述目标字符串和所述待匹配字符串的相似度,S1表示所述目标字符串,S2表示所述待匹配字符串;count(H)表示所述目标哈希序列中的所述目标字符串和所述待匹配字符串具有的不同子字符串的个数;n表示字符串长度,所述字符串长度为所述目标字符串和所述待匹配字符串长度之和的二分之一。Wherein, the similarity calculation formula is: sim(S 1 , S 2 )=1-count(H)/n, sim(S 1 , S 2 ) represents the difference between the target string and the to-be-matched string similarity, S 1 represents the target string, S 2 represents the to-be-matched string; count(H) represents the difference between the target string in the target hash sequence and the to-be-matched string The number of substrings; n represents the length of the string, which is half of the sum of the length of the target string and the string to be matched. 3.根据权利要求1或2所述的字符串匹配方法,其特征在于,所述确定所述目标字符串和所述待匹配字符串相匹配之前,还包括:3. The character string matching method according to claim 1 or 2, wherein before the determining that the target character string is matched with the character string to be matched, further comprises: 计算所述目标字符串和所述待匹配字符串的偏移量;Calculate the offset between the target string and the to-be-matched string; 判断所述偏移量是否小于预设的偏移量阈值;Determine whether the offset is less than a preset offset threshold; 若是,则执行所述确定所述目标字符串和所述待匹配字符串相匹配的步骤。If so, the step of determining that the target character string matches the to-be-matched character string is performed. 4.根据权利要求3所述的字符串匹配方法,其特征在于,所述判断所述偏移量是否小于预设的偏移量阈值,包括:4. The string matching method according to claim 3, wherein the judging whether the offset is less than a preset offset threshold comprises: 判断所述偏移量与字符串长度的比值是否小于预设的偏移量阈值;Determine whether the ratio of the offset to the string length is less than a preset offset threshold; 其中,所述字符串长度为所述目标字符串和所述待匹配字符串长度之和的二分之一。Wherein, the length of the character string is half of the sum of the lengths of the target character string and the character string to be matched. 5.一种字符串匹配装置,其特征在于,包括:5. A string matching device, characterized in that, comprising: 获取模块,用于获取目标字符串和待匹配字符串,将所述目标字符串和所述待匹配字符串分别切分为预设长度的子字符串,得到子字符串集合;若一个字符串为:T1、T2、T3……TN,预设长度为i,2≤i≤N,那么子字符串为:T1、T2、T3…Ti;T2、T3…Ti、Ti+1;T3…Ti、Ti+1、Ti+2;……;TN-(i-1)……TN-1、TNThe acquisition module is used to acquire the target string and the string to be matched, and divide the target string and the string to be matched into substrings of preset lengths to obtain a set of substrings; if a string is: T 1 , T 2 , T 3 ......T N , the preset length is i, 2≤i≤N, then the substrings are: T 1 , T 2 , T 3 ...... T i ; T 2 , T 3 ...T i , T i+1 ; T 3 ...T i , T i+1 , T i+2 ; ...; T N-(i-1) ... T N-1 , T N ; 分配模块,用于为所述子字符串集合中的每种子字符串分配哈希值,得到初始哈希序列;an assigning module, configured to assign a hash value to each substring in the set of substrings to obtain an initial hash sequence; 更新模块,用于按照预设的逻辑运算更新所述初始哈希序列中的每个哈希值,得到目标哈希序列;an update module, configured to update each hash value in the initial hash sequence according to a preset logical operation to obtain a target hash sequence; 匹配模块,用于根据所述目标哈希序列计算所述目标字符串和所述待匹配字符串的相似度,当所述相似度大于预设的相似度阈值时,确定所述目标字符串和所述待匹配字符串相匹配;The matching module is used to calculate the similarity between the target string and the to-be-matched string according to the target hash sequence, and when the similarity is greater than a preset similarity threshold, determine the target string and the string to be matched. the to-be-matched strings match; 其中,所述分配模块具体用于:Wherein, the allocation module is specifically used for: 按照预设的递增规则为所述子字符串集合中的每种子字符串分配哈希值;Allocate a hash value to each substring in the set of substrings according to a preset incrementing rule; 其中,所述更新模块具体用于:Wherein, the update module is specifically used for: 将所述初始哈希序列中的每个哈希值更新为零,并将更新后的每个哈希值与预设标识值进行异或运算,得到所述目标哈希序列;其中,当某个子字符串出现多次时,则重复执行异或运算,也就是:将前次异或运算得到的哈希值与预设标识值再次进行异或运算。Update each hash value in the initial hash sequence to zero, and perform XOR operation on each updated hash value with a preset identification value to obtain the target hash sequence; When the substring appears multiple times, the XOR operation is repeatedly performed, that is, the XOR operation is performed again on the hash value obtained by the previous XOR operation and the preset identification value. 6.根据权利要求5所述的字符串匹配装置,其特征在于,还包括:6. The character string matching device according to claim 5, further comprising: 计算模块,用于计算所述目标字符串和所述待匹配字符串的偏移量;a calculation module for calculating the offset between the target character string and the character string to be matched; 判断模块,用于判断所述偏移量是否小于预设的偏移量阈值;a judgment module for judging whether the offset is less than a preset offset threshold; 执行模块,用于当所述偏移量小于预设的偏移量阈值时,确定所述目标字符串和所述待匹配字符串相匹配。An execution module, configured to determine that the target string matches the to-be-matched string when the offset is less than a preset offset threshold. 7.一种字符串匹配设备,其特征在于,包括:7. A string matching device, characterized in that, comprising: 存储器,用于存储计算机程序;memory for storing computer programs; 处理器,用于执行所述计算机程序时实现如权利要求1-4任意一项所述的字符串匹配方法的步骤。The processor is configured to implement the steps of the string matching method according to any one of claims 1-4 when executing the computer program. 8.一种可读存储介质,其特征在于,所述可读存储介质上存储有计算机程序,所述计算机程序被处理器执行时实现如权利要求1-4任意一项所述的字符串匹配方法的步骤。8. A readable storage medium, characterized in that, a computer program is stored on the readable storage medium, and when the computer program is executed by a processor, the string matching according to any one of claims 1-4 is realized steps of the method.
CN201811183579.4A 2018-10-11 2018-10-11 Character string matching method, device and equipment and readable storage medium Active CN109408681B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201811183579.4A CN109408681B (en) 2018-10-11 2018-10-11 Character string matching method, device and equipment and readable storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201811183579.4A CN109408681B (en) 2018-10-11 2018-10-11 Character string matching method, device and equipment and readable storage medium

Publications (2)

Publication Number Publication Date
CN109408681A CN109408681A (en) 2019-03-01
CN109408681B true CN109408681B (en) 2021-11-26

Family

ID=65466921

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201811183579.4A Active CN109408681B (en) 2018-10-11 2018-10-11 Character string matching method, device and equipment and readable storage medium

Country Status (1)

Country Link
CN (1) CN109408681B (en)

Families Citing this family (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109933644B (en) * 2019-03-22 2021-03-09 中国农业银行股份有限公司 Character string matching method and device
CN110287149A (en) * 2019-05-10 2019-09-27 同济大学 A Match Encoding Method Using Hash Search
CN110309374A (en) * 2019-05-22 2019-10-08 深圳市金泰克半导体有限公司 A kind of analytic method, system, terminal device and computer readable storage medium
CN111191087B (en) * 2019-12-31 2023-11-07 歌尔股份有限公司 Character matching method, terminal device and computer readable storage medium
CN111312297B (en) * 2020-02-14 2021-07-06 腾讯音乐娱乐科技(深圳)有限公司 Audio processing method and device, storage medium and electronic equipment
CN111797285A (en) * 2020-06-30 2020-10-20 深圳壹账通智能科技有限公司 Character string fuzzy matching method, device, equipment and readable storage medium
CN112528101A (en) * 2020-12-22 2021-03-19 杭州趣链科技有限公司 Character string matching method, device, equipment and storage medium
CN113626651A (en) * 2021-08-04 2021-11-09 上海金仕达成括信息科技有限公司 Data matching method and device
CN114996347B (en) * 2022-06-24 2024-08-27 中国电信股份有限公司 User portrait management method, device, electronic equipment and storage medium

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106484730A (en) * 2015-08-31 2017-03-08 北京国双科技有限公司 Character string matching method and device
CN107396112A (en) * 2017-08-01 2017-11-24 深信服科技股份有限公司 A kind of coding method and device, computer installation, readable storage medium storing program for executing
US10019378B1 (en) * 2014-10-09 2018-07-10 Google Llc Addressing recent strings with ring buffer

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10019378B1 (en) * 2014-10-09 2018-07-10 Google Llc Addressing recent strings with ring buffer
CN106484730A (en) * 2015-08-31 2017-03-08 北京国双科技有限公司 Character string matching method and device
CN107396112A (en) * 2017-08-01 2017-11-24 深信服科技股份有限公司 A kind of coding method and device, computer installation, readable storage medium storing program for executing

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
"一种基于分段匹配的字符串匹配算法";刘许刚,黄海,马宏;《计算机应用与软件》;20120330;第I138-984页 *
"分布并行字符串相似性连接方法研究与应用";阮文洁;《中国优秀硕士学位论文全文数据库 信息科技辑》;20170715;第129-131页 *

Also Published As

Publication number Publication date
CN109408681A (en) 2019-03-01

Similar Documents

Publication Publication Date Title
CN109408681B (en) Character string matching method, device and equipment and readable storage medium
JP3067980B2 (en) String matching method and apparatus
CN104462668B (en) Computer-implemented method for designing an industrial product modeled with a binary tree
Wu et al. A review for weighted minhash algorithms
CN108961141B (en) Vector map double zero watermarking method, system, storage medium and server
CN111801665B (en) Hierarchical Locality Sensitive Hash (LSH) partition index for big data applications
US10191998B1 (en) Methods of data reduction for parallel breadth-first search over graphs of connected data elements
WO2022105497A1 (en) Text screening method and apparatus, device, and storage medium
US20180143979A1 (en) Method for segmenting and indexing features from multidimensional data
CN113419792A (en) Event processing method and device, terminal equipment and storage medium
Behera et al. KmerEstimate: a streaming algorithm for estimating k-mer counts with optimal space usage
Tirthapura et al. Rectangle-efficient aggregation in spatial data streams
CN111339064A (en) Data tilt correction method, device and computer readable storage medium
CN119248860A (en) A cardinality estimation method for dynamic data streams
CN112380445B (en) Data query method, device, equipment and storage medium
CN115017269B (en) Data processing system for determining similar texts
TWI812524B (en) Method and system for execution of a conditional statement by an arithmetic and/or bitwise unit
CN113448957B (en) A data query method and device
Li et al. Computing Longest Increasing Subsequence Over Sequential Data Streams
US20100070511A1 (en) Reducing use of randomness in consistent uniform hashing
CN104166959B (en) A kind of accelerated method of image noise reduction and device
CN114898104A (en) Hash method and device for image features and processing equipment
CN114461606A (en) Data storage method and device, computer equipment and storage medium
CN105528414A (en) Crawling method and system for collecting deep web data complete set
CN111506756A (en) Similar picture searching method and system, electronic device and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant