Disclosure of Invention
The invention aims to provide a character string matching method, a device, equipment and a readable storage medium, so as to improve the character string matching efficiency.
In order to achieve the above purpose, the embodiment of the present invention provides the following technical solutions:
a string matching method, comprising:
acquiring a target character string and a character string to be matched, and segmenting the target character string and the character string to be matched into sub-character strings with preset lengths respectively to obtain a sub-character string set;
distributing a hash value to each seed character string in the sub character string set to obtain an initial hash sequence;
updating each hash value in the initial hash sequence according to a preset logical operation to obtain a target hash sequence;
and calculating the similarity between the target character string and the character string to be matched according to the target hash sequence, and determining that the target character string is matched with the character string to be matched when the similarity is greater than a preset similarity threshold value.
Wherein, the allocating a hash value to each substring in the substring set includes:
and distributing a hash value to each seed character string in the sub character string set according to a preset increment rule.
Wherein, updating each hash value in the initial hash sequence according to a preset logical operation to obtain a target hash sequence, including:
and updating each hash value in the initial hash sequence to be zero, and performing exclusive or operation on each updated hash value and a preset identification value to obtain the target hash sequence.
Wherein, the calculating the similarity between the target character string and the character string to be matched according to the target hash sequence comprises:
calculating the similarity between the target character string and the character string to be matched by adopting a similarity calculation formula;
wherein, the similarity calculation formula is as follows: sim (S)1,S2)=1-count(H)/n,sim(S1,S2) Representing the similarity of the target string and the string to be matched, S1Representing said target string, S2Representing the character string to be matched; count (h) represents the number of different substrings of the target string and the string to be matched in the target hash sequence; n represents the length of the character string, and the length of the character string is half of the sum of the lengths of the target character string and the character string to be matched.
Before determining that the target character string is matched with the character string to be matched, the method further comprises the following steps:
calculating the offset of the target character string and the character string to be matched;
judging whether the offset is smaller than a preset offset threshold value or not;
and if so, executing the step of determining that the target character string is matched with the character string to be matched.
Wherein, the judging whether the offset is smaller than a preset offset threshold value comprises:
judging whether the ratio of the offset to the length of the character string is smaller than a preset offset threshold value or not;
and the length of the character string is half of the sum of the lengths of the target character string and the character string to be matched.
A character string matching apparatus comprising:
the system comprises an acquisition module, a matching module and a matching module, wherein the acquisition module is used for acquiring a target character string and a character string to be matched, and respectively segmenting the target character string and the character string to be matched into sub-character strings with preset lengths to obtain a sub-character string set;
the distribution module is used for distributing a hash value to each seed character string in the sub character string set to obtain an initial hash sequence;
the updating module is used for updating each hash value in the initial hash sequence according to preset logical operation to obtain a target hash sequence;
and the matching module is used for calculating the similarity between the target character string and the character string to be matched according to the target hash sequence, and when the similarity is greater than a preset similarity threshold value, determining that the target character string is matched with the character string to be matched.
Wherein, still include:
the calculation module is used for calculating the offset of the target character string and the character string to be matched;
the judging module is used for judging whether the offset is smaller than a preset offset threshold value or not;
and the execution module is used for determining that the target character string is matched with the character string to be matched when the offset is smaller than a preset offset threshold.
A character string matching apparatus comprising:
a memory for storing a computer program;
a processor for implementing the steps of the string matching method of any one of the above items when executing the computer program.
A readable storage medium having stored thereon a computer program which, when executed by a processor, implements the steps of the character string matching method of any one of the above.
According to the scheme, the character string matching method provided by the embodiment of the invention comprises the following steps: acquiring a target character string and a character string to be matched, and segmenting the target character string and the character string to be matched into sub-character strings with preset lengths respectively to obtain a sub-character string set; distributing a hash value to each seed character string in the sub character string set to obtain an initial hash sequence; updating each hash value in the initial hash sequence according to a preset logical operation to obtain a target hash sequence; and calculating the similarity between the target character string and the character string to be matched according to the target hash sequence, and determining that the target character string is matched with the character string to be matched when the similarity is greater than a preset similarity threshold value.
The method comprises the steps that firstly, a target character string and a character string to be matched are segmented to obtain a sub-character string set; further distributing a hash value for each seed character string in the sub character string set, and updating each hash value; and finally, calculating the similarity of the target character string and the character string to be matched based on the updated hash value, and determining whether the target character string and the character string to be matched are matched based on the similarity. The method judges whether the target character string is matched with the character string to be matched or not based on the hash value, has low calculation complexity and less consumed computer resources, and can improve the character string matching efficiency and reduce the calculation cost.
Accordingly, the character string matching device, the equipment and the readable storage medium provided by the embodiment of the invention also have the technical effects.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
The embodiment of the invention discloses a character string matching method, a device, equipment and a readable storage medium, which aim to improve the character string matching efficiency.
Referring to fig. 1, a method for matching a character string provided in an embodiment of the present invention includes:
s101, obtaining a target character string and a character string to be matched, and segmenting the target character string and the character string to be matched into sub character strings with preset lengths respectively to obtain a sub character string set;
assuming that the target character string is content and the character string to be matched is content, after the target character string and the character string to be matched are respectively segmented (the length of the sub-character string is 2), the sub-character string corresponding to the target character string comprises: co, on, nt, te, en, nt; the sub-character strings corresponding to the character strings to be matched comprise: co, on, nt, te, en, and nd, and Q represents the set of substrings, { co, on, nt, te, en, nt, co, on, nt, te, en, and nd }.
The length of the substring can be any number greater than or equal to 2, and the larger the substring is, the higher the accuracy is, but for the convenience of calculation, the larger the substring is not generally; of course, the length of the substring can be flexibly adjusted according to practical application, for example, a short substring is not suitable for a long segmentation distance.
S102, distributing a hash value to each seed character string in the sub character string set to obtain an initial hash sequence;
wherein, the allocating a hash value to each substring in the substring set includes: and distributing a hash value to each seed character string in the sub character string set according to a preset increment rule.
Note that, when assigning a hash value, the type of the substring is used as a reference. Namely: the same hash value is assigned to the same substring in the set of substrings. The distribution can be carried out in the following way: traversing the sub-character string set, distributing a hash value 1 to the traversed first sub-character string co, distributing a hash value 2 to the second sub-character string on, distributing a hash value 3 to the third sub-character string nt, and repeating the steps until all sub-character strings in the sub-character string set are traversed, wherein the method can be regarded as an incremental rule. Wherein when traversing to substring co again, hash value 1 is still assigned. Of course, the increment rule can be similar to the arithmetic sequence, and the initial value (i.e. the initial term) and the size of the interval value (i.e. the tolerance) can be preset in advance and flexibly changed.
After the hash value is assigned to each substring in the substring set, the obtained initial hash sequence is {1, 2, 3, 4, 5, 6, 7}, and the corresponding relationship between each hash value and the substring is shown in table 1.
TABLE 1
Based on this, the length L of the hash sequence is represented by the largest hash value in the initial hash sequenceHI.e. LH=7。
S103, updating each hash value in the initial hash sequence according to a preset logical operation to obtain a target hash sequence;
wherein, updating each hash value in the initial hash sequence according to a preset logical operation to obtain a target hash sequence, including: and updating each hash value in the initial hash sequence to be zero, and performing exclusive or operation on each updated hash value and a preset identification value to obtain the target hash sequence.
Specifically, each initial hash value is initialized to 0, and the maximum value in the initial hash values is used as the length of the initial hash sequence; and traversing the sub-character string set, carrying out XOR operation on the initial hash value corresponding to each sub-character string in the sub-character string set and a preset identification value 1, and repeatedly executing the XOR operation when a certain sub-character string appears for multiple times.
For example: for a substring "te", an initial hash value corresponding to the first appearing "te" is 4, and an exclusive-or operation is performed, that is, 4 ≦ 1, and the obtained hash value is 1, and when "te" appears for the second time, an exclusive-or operation is performed on the hash value 1 obtained for the previous time and a preset identification value 1, that is, 1 ≦ 0, and then a target hash value corresponding to the last obtained "te" is 0. By analogy, please refer to table 2 for the corresponding relationship among the strings, the initial hash value and the target hash value obtained by the xor operation and the preset identification value 1.
TABLE 2
Initializing hash sequences
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
Target hash sequence
|
0
|
0
|
0
|
0
|
0
|
1
|
1 |
S104, calculating the similarity between the target character string and the character string to be matched according to the target hash sequence;
s105, judging whether the similarity is larger than a preset similarity threshold value or not; if yes, executing S106; if not, executing S107;
s106, determining that the target character string is matched with the character string to be matched;
and S107, determining that the target character string is not matched with the character string to be matched.
Specifically, sim (S) may be used1,S2) The similarity between the target character string and the character string to be matched is calculated by the formula 1-count (H)/n, wherein sim (S)1,S2) Representing the similarity of the target string and the string to be matched, S1Representing a target string, S2Representing a character string to be matched; count (H) represents the number of different sub-character strings of the target character string and the character string to be matched in the target hash sequence; n represents the length of the character string, and the length of the character string is half of the sum of the lengths of the target character string and the character string to be matched. It should be noted that the number of different sub-strings of the target string and the string to be matched may be an estimated value.
When the obtained similarity is larger than a preset similarity threshold, determining that the target character string is matched with the character string to be matched; and when the obtained similarity is not greater than a preset similarity threshold, determining that the target character string is not matched with the character string to be matched.
The method includes the steps that firstly, a target character string and a character string to be matched are segmented to obtain a sub-character string set; further distributing a hash value for each seed character string in the sub character string set, and updating each hash value; and finally, calculating the similarity of the target character string and the character string to be matched based on the updated hash value, and determining whether the target character string and the character string to be matched are matched based on the similarity. The method judges whether the target character string is matched with the character string to be matched or not based on the hash value, has low calculation complexity and less consumed computer resources, and can improve the character string matching efficiency and reduce the calculation cost.
The embodiment of the invention discloses another character string matching method, and compared with the previous embodiment, the technical scheme is further explained and optimized in the embodiment.
Referring to fig. 2, another character string matching method provided in the embodiment of the present invention includes:
s201, obtaining a target character string and a character string to be matched, and segmenting the target character string and the character string to be matched into sub character strings with preset lengths respectively to obtain a sub character string set;
s202, distributing a hash value to each seed character string in the sub character string set to obtain an initial hash sequence;
s203, updating each hash value in the initial hash sequence according to a preset logical operation to obtain a target hash sequence;
s204, calculating the similarity between the target character string and the character string to be matched according to the target hash sequence;
s205, judging whether the similarity is larger than a preset similarity threshold value; if yes, go to S206; if not, executing S209;
s206, calculating the offset of the target character string and the character string to be matched, and executing S207;
s207, judging whether the offset is smaller than a preset offset threshold value; if yes, go to step S208; if not, executing S209;
s208, determining that the target character string is matched with the character string to be matched;
s209, determining that the target character string is not matched with the character string to be matched.
In the embodiment, in order to further improve the accuracy of character string matching, an offset is introduced for corresponding calculation. The offset may verify the order of the characters and may further distinguish whether two strings match. When the similarity between the target character string and the character string to be matched is larger than a preset similarity threshold value, calculating the offset of the target character string and the character string to be matched; when the offset is smaller than a preset offset threshold, determining that the target character string is matched with the character string to be matched; otherwise the two do not match.
Wherein, the judging whether the offset is smaller than a preset offset threshold value comprises: judging whether the ratio of the offset to the length of the character string is smaller than a preset offset threshold value or not; and the length of the character string is half of the sum of the lengths of the target character string and the character string to be matched.
Specifically, in the offset calculation principle, when one or more consecutive characters are equally offset, an offset is considered to exist, and the offset is considered to be within 2. The calculation process comprises the following steps:
initializing a 2X n/2X matrix for storing the character offset for each substring in the set Q, the first row indicating that 1 offset has occurred for the bit character, the second row indicating that 2 offsets have occurred for the bit character (1 indicating right offset, -1 indicating left offset, 0 indicating no offset).
The offset in the X matrix is calculated according to the following equation:
where bv is an offset and i and j are arbitrary indices.
Examples are as follows: when calculating the offset of two character strings of levenshtein and levenshteoin, the substring corresponding to levenshtein includes: le, ev, ve, en, ns, sh, ht, te, ei, in; co, on, nt, te, en, nt; the substring corresponding to levenshteoin includes: le, ev, vn, ns, sh, ht, te, eo, oi, in; it can be seen that ns, sh, ht, and te all have been shifted, but since ns, sh, ht, and te are four consecutive substrings, it can be considered that one shift has occurred, i.e. the shift amount is 1. Wherein, the offset matrix X of the two character strings of levenshtein and levenshteoin is as follows:
after the offset of the two character strings is obtained, whether the two character strings are matched can be determined by judging whether the offset is smaller than a preset offset threshold value; specifically, whether two character strings are matched or not can be determined by judging whether the ratio of the offset to the length of the character string is smaller than a preset offset threshold or not. Namely: and comparing the offset with the length of the character string, and judging whether the ratio of the offset to the length of the character string is smaller than an offset threshold value or not. Wherein the length of the character string is half of the sum of the lengths of the two character strings.
Among them, the offset matrix can be obtained as follows. Determining substring results Q1 and Q2 corresponding to the two character strings respectively; and calculating the first row of the offset matrix according to the first row, namely judging whether the characters of the corresponding position substrings are offset leftwards, rightwards or not. For each row of the offset matrix, Q1 is assignediAnd Q2i+1By comparison, the same offset is used to offset X in the matrix ,ij0, if not Xij1 is ═ 1; then Q1 is addediAnd Q2i-1Comparing X in the same offset matrix ij0, if not XijAnd-1, so that an offset matrix can be obtained.
Wherein, Q1iRepresenting a certain substring in Q1, Q2iA certain substring in Q2, for example: if Q1 ═ co, on, nt, te, en, nt, then the substrings therein can be sequentially represented as: { Q11、Q12、Q13、Q14、Q15、Q16Q2 can be expressed correspondingly by analogy, so that the description is not repeated herein.
ij denotes the rows and columns of the offset matrix, and when the X matrix is as follows, the first row elements (0001111000) may be sequentially denoted as X11、X12、X13、X14、X15、X16、X17、X18、X19、X20。
As can be seen, the present embodiment provides another character string matching method, in which a target character string and a character string to be matched are firstly segmented to obtain a sub-character string set; further distributing a hash value for each seed character string in the sub character string set, and updating each hash value; and finally, calculating the similarity of the target character string and the character string to be matched based on the updated hash value, and determining whether the target character string and the character string to be matched are matched based on the similarity. The method judges the similarity of the target character string and the character string to be matched based on the hash value, judges whether the target character string and the character string to be matched are matched or not by referring to the offset, has low calculation complexity and less consumed computer resources, can improve the matching efficiency and accuracy of the character strings, and reduces the calculation cost.
Based on any of the above embodiments, it should be noted that the calculating the similarity between the target character string and the character string to be matched according to the target hash sequence includes: calculating the similarity between the target character string and the character string to be matched by adopting a similarity calculation formula;
wherein, the similarity calculation formula is as follows: sim (S)1,S2)=1-count(H)/n,sim(S1,S2) Representing similarity of the target character string and the character string to be matchedDegree, S1Representing said target string, S2Representing the character string to be matched; count (h) represents the number of different substrings of the target string and the string to be matched in the target hash sequence; n represents the length of the character string, and the length of the character string is half of the sum of the lengths of the target character string and the character string to be matched.
Specifically, the target hash sequence may reflect the number of identical sub-strings and the number of different sub-strings of the target string and the string to be matched. But whether the two strings are similar cannot be distinguished by the difference alone; for example, if the number of different characters compared by a pair of character strings is 5, it can be clearly determined that the two character strings are not similar for the character string with the length of 8; however, if the two strings are DNA sequences consisting of millions of characters, the difference is negligible, i.e., the two strings are determined to be similar. It is determined whether there is a match between different character strings by the similarity degree with respect to the length of the character string.
Specifically, the target character string S can be calculated as follows1And the character string S to be matched2And judging whether the two are matched, wherein the similarity calculation formula is as follows:
where n denotes a character string length, and n ═ length (S)1)+length(S2) 2)/2; count (h) is the number of bits in the target hash sequence that are extracted for accumulation (i.e., the number of bits that have a value of 1).
When the judgment is carried out based on the similarity, a similarity threshold value can be preset in advance, namely: judging whether the similarity is greater than a preset similarity threshold value or not; and when the similarity is greater than a preset similarity threshold, determining that the two character strings are matched. The similarity threshold value can be set based on the requirements for similarity under different scenes.
In the following, a character string matching device provided by an embodiment of the present invention is introduced, and a character string matching device described below and a character string matching method described above may be referred to each other.
Referring to fig. 3, an embodiment of the present invention provides a character string matching apparatus, including:
the acquiring module 301 is configured to acquire a target character string and a character string to be matched, and divide the target character string and the character string to be matched into sub-character strings of preset lengths, respectively, to obtain a sub-character string set;
an allocating module 302, configured to allocate a hash value to each seed string in the substring set to obtain an initial hash sequence;
an updating module 303, configured to update each hash value in the initial hash sequence according to a preset logical operation to obtain a target hash sequence;
a matching module 304, configured to calculate a similarity between the target character string and the character string to be matched according to the target hash sequence, and determine that the target character string and the character string to be matched are matched when the similarity is greater than a preset similarity threshold.
Wherein, still include:
the calculation module is used for calculating the offset of the target character string and the character string to be matched;
the judging module is used for judging whether the offset is smaller than a preset offset threshold value or not;
and the execution module is used for determining that the target character string is matched with the character string to be matched when the offset is smaller than a preset offset threshold.
Wherein, the judging module is specifically configured to:
judging whether the ratio of the offset to the length of the character string is smaller than a preset offset threshold value or not; and the length of the character string is half of the sum of the lengths of the target character string and the character string to be matched.
Wherein the allocation module is specifically configured to:
and distributing a hash value to each seed character string in the sub character string set according to a preset increment rule.
Wherein the update module is specifically configured to:
and updating each hash value in the initial hash sequence to be zero, and performing exclusive or operation on each updated hash value and a preset identification value to obtain the target hash sequence.
Wherein the matching module is specifically configured to:
calculating the similarity between the target character string and the character string to be matched by adopting a similarity calculation formula; wherein, the similarity calculation formula is as follows: sim (S)1,S2)=1-count(H)/n,sim(S1,S2) Representing the similarity of the target string and the string to be matched, S1Representing said target string, S2Representing the character string to be matched; count (h) represents the number of different substrings of the target string and the string to be matched in the target hash sequence; n represents the length of the character string, and the length of the character string is half of the sum of the lengths of the target character string and the character string to be matched.
It can be seen that this embodiment provides a character string matching device, including: the device comprises an acquisition module, a distribution module, an updating module and a matching module. Firstly, an acquisition module acquires a target character string and a character string to be matched, and the target character string and the character string to be matched are respectively segmented into sub character strings with preset lengths to obtain a sub character string set; then, the distribution module distributes a hash value to each seed character string in the sub character string set to obtain an initial hash sequence; the updating module updates each hash value in the initial hash sequence according to preset logical operation to obtain a target hash sequence; and finally, the matching module calculates the similarity between the target character string and the character string to be matched according to the target hash sequence, and when the similarity is greater than a preset similarity threshold value, the matching between the target character string and the character string to be matched is determined. Thus, all modules are in work and cooperation and take their own roles, so that the calculation complexity and the consumption of computer resources are reduced, and the character string matching efficiency and accuracy are improved.
In the following, a character string matching device provided by an embodiment of the present invention is introduced, and a character string matching device described below and a character string matching method and device described above may refer to each other.
Referring to fig. 4, an embodiment of the present invention provides a character string matching apparatus, including:
a memory 401 for storing a computer program;
a processor 402 for implementing the steps of the string matching method according to any of the above embodiments when executing the computer program.
In the following, a readable storage medium provided by an embodiment of the present invention is introduced, and a readable storage medium described below and a character string matching method, apparatus, and device described above may be referred to each other.
A readable storage medium having stored thereon a computer program which, when executed by a processor, implements the steps of the string matching method as described in any of the above embodiments.
To further show the beneficial effects of the present invention (hash-sequence-based matching method) compared to the prior art (distance-matrix-based matching method), the following comparative experiment is now performed. See table 3 for strings to be compared.
TABLE 3
Character string to be compared
|
Distance threshold
|
Offset threshold
|
that
|
0.4
|
0.4
|
Student
|
0.4
|
0.4 |
The results of the two matching methods are shown in table 4.
TABLE 4
The experimental result shows that the invention is basically consistent with the output result of the distance matrix, and the invention is superior to the prior art in the aspect of the recall ratio.
The computational complexity of distance matrix based matching methods grows exponentially as the length of the string increases, (since the distance matrix needs to fill the entire matrix). Assuming a string length of n, the computational complexity of the distance matrix based matching method is close to n ^ 2. The invention is obviously superior to a distance matrix when matching long character strings, the implementation process of the invention comprises 6 steps of splitting content, distributing hash values, calculating hash sequences, calculating deviation values, traversing character sequences and the like, and the algorithm complexity is close to 6 n.
See fig. 5 for comparison of the performance of the present invention with that of the prior art. As can be seen from fig. 5, when the length of the character string (i.e., the matching content length in fig. 5) is less than 100, there is substantially no difference between the present invention and the prior art; when the length of the character string is greater than 100, the present invention exhibits better performance than the prior art. Therefore, when a large amount of long contents are matched, the method can save computer resources and improve matching efficiency.
The embodiments in the present description are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other.
The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present invention. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the invention. Thus, the present invention is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.