CN109408681B

CN109408681B - Character string matching method, device and equipment and readable storage medium

Info

Publication number: CN109408681B
Application number: CN201811183579.4A
Authority: CN
Inventors: 吴晓鸰; 詹彬
Original assignee: Guangdong University of Technology
Current assignee: Guangdong University of Technology
Priority date: 2018-10-11
Filing date: 2018-10-11
Publication date: 2021-11-26
Anticipated expiration: 2038-10-11
Also published as: CN109408681A

Abstract

The invention discloses a string matching method, comprising: acquiring a target string and a to-be-matched character string, dividing the target character string and the to-be-matched character string into substrings of preset lengths, respectively, to obtain a substring set; Assign a hash value to each substring in the substring set to obtain an initial hash sequence; update each hash value in the initial hash sequence according to a preset logical operation to obtain a target hash sequence; The Greek sequence calculates the similarity between the target string and the string to be matched, and when the similarity is greater than a preset similarity threshold, it is determined that the target string and the string to be matched match. The method discriminates whether the target string matches the to-be-matched string based on the hash value, has low computational complexity, consumes less computer resources, can improve string matching efficiency and reduce computational cost. The string matching device, device and readable storage medium disclosed by the present invention also have the above technical effects.

Description

Character string matching method, device and equipment and readable storage medium

Technical Field

The present invention relates to the field of information science and technology, and more particularly, to a method, an apparatus, a device and a readable storage medium for matching a character string.

Background

With the popularization of computers and the increasing development of information engineering, obtaining information from the internet becomes an important way for people to live and work in daily life, and the internet gradually becomes a resource storage space with an overlarge information amount nowadays. Search engines have been developed to obtain useful information from the internet.

The search engine is called as a black box for information processing and acquisition, crawls information resources in the internet through a certain rule, processes and extracts information, and provides an interface for facilitating user query, thereby playing a role in guiding a user to acquire information. String matching is a key technique in search engines. When a user searches for information by keywords, a search engine searches in a database, and if the information which is consistent with the content required by the user is found, the inquired information is returned to the user so that the user can check and select.

To say thatIt is clear that the existing string matching generally discriminates the similarity of two strings based on a distance matrix. In the calculation process, the calculation complexity when filling the distance matrix is n²(n is the length of the string) level. That is, as the length of the string increases, the size and computational complexity of the distance matrix will grow exponentially, thus consuming enormous computer resources; when the number of query requests is large, the computer resources consumed by the query requests are immeasurable, so that the character string matching efficiency is reduced, and the calculation cost is increased.

Therefore, how to improve the character string matching efficiency is a problem to be solved by those skilled in the art.

Disclosure of Invention

The invention aims to provide a character string matching method, a device, equipment and a readable storage medium, so as to improve the character string matching efficiency.

In order to achieve the above purpose, the embodiment of the present invention provides the following technical solutions:

a string matching method, comprising:

acquiring a target character string and a character string to be matched, and segmenting the target character string and the character string to be matched into sub-character strings with preset lengths respectively to obtain a sub-character string set;

distributing a hash value to each seed character string in the sub character string set to obtain an initial hash sequence;

updating each hash value in the initial hash sequence according to a preset logical operation to obtain a target hash sequence;

and calculating the similarity between the target character string and the character string to be matched according to the target hash sequence, and determining that the target character string is matched with the character string to be matched when the similarity is greater than a preset similarity threshold value.

Wherein, the allocating a hash value to each substring in the substring set includes:

and distributing a hash value to each seed character string in the sub character string set according to a preset increment rule.

Wherein, updating each hash value in the initial hash sequence according to a preset logical operation to obtain a target hash sequence, including:

and updating each hash value in the initial hash sequence to be zero, and performing exclusive or operation on each updated hash value and a preset identification value to obtain the target hash sequence.

Wherein, the calculating the similarity between the target character string and the character string to be matched according to the target hash sequence comprises:

calculating the similarity between the target character string and the character string to be matched by adopting a similarity calculation formula;

wherein, the similarity calculation formula is as follows: sim (S)₁,S₂)＝1-count(H)/n，sim(S₁,S₂) Representing the similarity of the target string and the string to be matched, S₁Representing said target string, S₂Representing the character string to be matched; count (h) represents the number of different substrings of the target string and the string to be matched in the target hash sequence; n represents the length of the character string, and the length of the character string is half of the sum of the lengths of the target character string and the character string to be matched.

Before determining that the target character string is matched with the character string to be matched, the method further comprises the following steps:

calculating the offset of the target character string and the character string to be matched;

judging whether the offset is smaller than a preset offset threshold value or not;

and if so, executing the step of determining that the target character string is matched with the character string to be matched.

Wherein, the judging whether the offset is smaller than a preset offset threshold value comprises:

judging whether the ratio of the offset to the length of the character string is smaller than a preset offset threshold value or not;

and the length of the character string is half of the sum of the lengths of the target character string and the character string to be matched.

A character string matching apparatus comprising:

the system comprises an acquisition module, a matching module and a matching module, wherein the acquisition module is used for acquiring a target character string and a character string to be matched, and respectively segmenting the target character string and the character string to be matched into sub-character strings with preset lengths to obtain a sub-character string set;

the distribution module is used for distributing a hash value to each seed character string in the sub character string set to obtain an initial hash sequence;

the updating module is used for updating each hash value in the initial hash sequence according to preset logical operation to obtain a target hash sequence;

and the matching module is used for calculating the similarity between the target character string and the character string to be matched according to the target hash sequence, and when the similarity is greater than a preset similarity threshold value, determining that the target character string is matched with the character string to be matched.

Wherein, still include:

the calculation module is used for calculating the offset of the target character string and the character string to be matched;

the judging module is used for judging whether the offset is smaller than a preset offset threshold value or not;

and the execution module is used for determining that the target character string is matched with the character string to be matched when the offset is smaller than a preset offset threshold.

A character string matching apparatus comprising:

a memory for storing a computer program;

a processor for implementing the steps of the string matching method of any one of the above items when executing the computer program.

A readable storage medium having stored thereon a computer program which, when executed by a processor, implements the steps of the character string matching method of any one of the above.

According to the scheme, the character string matching method provided by the embodiment of the invention comprises the following steps: acquiring a target character string and a character string to be matched, and segmenting the target character string and the character string to be matched into sub-character strings with preset lengths respectively to obtain a sub-character string set; distributing a hash value to each seed character string in the sub character string set to obtain an initial hash sequence; updating each hash value in the initial hash sequence according to a preset logical operation to obtain a target hash sequence; and calculating the similarity between the target character string and the character string to be matched according to the target hash sequence, and determining that the target character string is matched with the character string to be matched when the similarity is greater than a preset similarity threshold value.

The method comprises the steps that firstly, a target character string and a character string to be matched are segmented to obtain a sub-character string set; further distributing a hash value for each seed character string in the sub character string set, and updating each hash value; and finally, calculating the similarity of the target character string and the character string to be matched based on the updated hash value, and determining whether the target character string and the character string to be matched are matched based on the similarity. The method judges whether the target character string is matched with the character string to be matched or not based on the hash value, has low calculation complexity and less consumed computer resources, and can improve the character string matching efficiency and reduce the calculation cost.

Accordingly, the character string matching device, the equipment and the readable storage medium provided by the embodiment of the invention also have the technical effects.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.

FIG. 1 is a flow chart of a string matching method disclosed in the embodiments of the present invention;

FIG. 2 is a flow chart of another string matching method disclosed in the embodiments of the present invention;

FIG. 3 is a schematic diagram of a string matching apparatus according to an embodiment of the present invention;

FIG. 4 is a schematic diagram of a string matching apparatus according to an embodiment of the present invention;

fig. 5 is a graph comparing the performance of the present invention with that of the prior art.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

The embodiment of the invention discloses a character string matching method, a device, equipment and a readable storage medium, which aim to improve the character string matching efficiency.

Referring to fig. 1, a method for matching a character string provided in an embodiment of the present invention includes:

s101, obtaining a target character string and a character string to be matched, and segmenting the target character string and the character string to be matched into sub character strings with preset lengths respectively to obtain a sub character string set;

assuming that the target character string is content and the character string to be matched is content, after the target character string and the character string to be matched are respectively segmented (the length of the sub-character string is 2), the sub-character string corresponding to the target character string comprises: co, on, nt, te, en, nt; the sub-character strings corresponding to the character strings to be matched comprise: co, on, nt, te, en, and nd, and Q represents the set of substrings, { co, on, nt, te, en, nt, co, on, nt, te, en, and nd }.

The length of the substring can be any number greater than or equal to 2, and the larger the substring is, the higher the accuracy is, but for the convenience of calculation, the larger the substring is not generally; of course, the length of the substring can be flexibly adjusted according to practical application, for example, a short substring is not suitable for a long segmentation distance.

S102, distributing a hash value to each seed character string in the sub character string set to obtain an initial hash sequence;

wherein, the allocating a hash value to each substring in the substring set includes: and distributing a hash value to each seed character string in the sub character string set according to a preset increment rule.

Note that, when assigning a hash value, the type of the substring is used as a reference. Namely: the same hash value is assigned to the same substring in the set of substrings. The distribution can be carried out in the following way: traversing the sub-character string set, distributing a hash value 1 to the traversed first sub-character string co, distributing a hash value 2 to the second sub-character string on, distributing a hash value 3 to the third sub-character string nt, and repeating the steps until all sub-character strings in the sub-character string set are traversed, wherein the method can be regarded as an incremental rule. Wherein when traversing to substring co again, hash value 1 is still assigned. Of course, the increment rule can be similar to the arithmetic sequence, and the initial value (i.e. the initial term) and the size of the interval value (i.e. the tolerance) can be preset in advance and flexibly changed.

After the hash value is assigned to each substring in the substring set, the obtained initial hash sequence is {1, 2, 3, 4, 5, 6, 7}, and the corresponding relationship between each hash value and the substring is shown in table 1.

TABLE 1

Based on this, the length L of the hash sequence is represented by the largest hash value in the initial hash sequence_HI.e. L_H＝7。

S103, updating each hash value in the initial hash sequence according to a preset logical operation to obtain a target hash sequence;

wherein, updating each hash value in the initial hash sequence according to a preset logical operation to obtain a target hash sequence, including: and updating each hash value in the initial hash sequence to be zero, and performing exclusive or operation on each updated hash value and a preset identification value to obtain the target hash sequence.

Specifically, each initial hash value is initialized to 0, and the maximum value in the initial hash values is used as the length of the initial hash sequence; and traversing the sub-character string set, carrying out XOR operation on the initial hash value corresponding to each sub-character string in the sub-character string set and a preset identification value 1, and repeatedly executing the XOR operation when a certain sub-character string appears for multiple times.

For example: for a substring "te", an initial hash value corresponding to the first appearing "te" is 4, and an exclusive-or operation is performed, that is, 4 ≦ 1, and the obtained hash value is 1, and when "te" appears for the second time, an exclusive-or operation is performed on the hash value 1 obtained for the previous time and a preset identification value 1, that is, 1 ≦ 0, and then a target hash value corresponding to the last obtained "te" is 0. By analogy, please refer to table 2 for the corresponding relationship among the strings, the initial hash value and the target hash value obtained by the xor operation and the preset identification value 1.

TABLE 2

Initializing hash sequences	0	0	0	0	0	0	0
								Target hash sequence	0	0	0	0	0	1	1

S104, calculating the similarity between the target character string and the character string to be matched according to the target hash sequence;

s105, judging whether the similarity is larger than a preset similarity threshold value or not; if yes, executing S106; if not, executing S107;

s106, determining that the target character string is matched with the character string to be matched;

and S107, determining that the target character string is not matched with the character string to be matched.

Specifically, sim (S) may be used₁,S₂) The similarity between the target character string and the character string to be matched is calculated by the formula 1-count (H)/n, wherein sim (S)₁,S₂) Representing the similarity of the target string and the string to be matched, S₁Representing a target string, S₂Representing a character string to be matched; count (H) represents the number of different sub-character strings of the target character string and the character string to be matched in the target hash sequence; n represents the length of the character string, and the length of the character string is half of the sum of the lengths of the target character string and the character string to be matched. It should be noted that the number of different sub-strings of the target string and the string to be matched may be an estimated value.

When the obtained similarity is larger than a preset similarity threshold, determining that the target character string is matched with the character string to be matched; and when the obtained similarity is not greater than a preset similarity threshold, determining that the target character string is not matched with the character string to be matched.

The method includes the steps that firstly, a target character string and a character string to be matched are segmented to obtain a sub-character string set; further distributing a hash value for each seed character string in the sub character string set, and updating each hash value; and finally, calculating the similarity of the target character string and the character string to be matched based on the updated hash value, and determining whether the target character string and the character string to be matched are matched based on the similarity. The method judges whether the target character string is matched with the character string to be matched or not based on the hash value, has low calculation complexity and less consumed computer resources, and can improve the character string matching efficiency and reduce the calculation cost.

The embodiment of the invention discloses another character string matching method, and compared with the previous embodiment, the technical scheme is further explained and optimized in the embodiment.

Referring to fig. 2, another character string matching method provided in the embodiment of the present invention includes:

s201, obtaining a target character string and a character string to be matched, and segmenting the target character string and the character string to be matched into sub character strings with preset lengths respectively to obtain a sub character string set;

s202, distributing a hash value to each seed character string in the sub character string set to obtain an initial hash sequence;

s203, updating each hash value in the initial hash sequence according to a preset logical operation to obtain a target hash sequence;

s204, calculating the similarity between the target character string and the character string to be matched according to the target hash sequence;

s205, judging whether the similarity is larger than a preset similarity threshold value; if yes, go to S206; if not, executing S209;

s206, calculating the offset of the target character string and the character string to be matched, and executing S207;

s207, judging whether the offset is smaller than a preset offset threshold value; if yes, go to step S208; if not, executing S209;

s208, determining that the target character string is matched with the character string to be matched;

s209, determining that the target character string is not matched with the character string to be matched.

In the embodiment, in order to further improve the accuracy of character string matching, an offset is introduced for corresponding calculation. The offset may verify the order of the characters and may further distinguish whether two strings match. When the similarity between the target character string and the character string to be matched is larger than a preset similarity threshold value, calculating the offset of the target character string and the character string to be matched; when the offset is smaller than a preset offset threshold, determining that the target character string is matched with the character string to be matched; otherwise the two do not match.

Wherein, the judging whether the offset is smaller than a preset offset threshold value comprises: judging whether the ratio of the offset to the length of the character string is smaller than a preset offset threshold value or not; and the length of the character string is half of the sum of the lengths of the target character string and the character string to be matched.

Specifically, in the offset calculation principle, when one or more consecutive characters are equally offset, an offset is considered to exist, and the offset is considered to be within 2. The calculation process comprises the following steps:

initializing a 2X n/2X matrix for storing the character offset for each substring in the set Q, the first row indicating that 1 offset has occurred for the bit character, the second row indicating that 2 offsets have occurred for the bit character (1 indicating right offset, -1 indicating left offset, 0 indicating no offset).

The offset in the X matrix is calculated according to the following equation:

where bv is an offset and i and j are arbitrary indices.

Examples are as follows: when calculating the offset of two character strings of levenshtein and levenshteoin, the substring corresponding to levenshtein includes: le, ev, ve, en, ns, sh, ht, te, ei, in; co, on, nt, te, en, nt; the substring corresponding to levenshteoin includes: le, ev, vn, ns, sh, ht, te, eo, oi, in; it can be seen that ns, sh, ht, and te all have been shifted, but since ns, sh, ht, and te are four consecutive substrings, it can be considered that one shift has occurred, i.e. the shift amount is 1. Wherein, the offset matrix X of the two character strings of levenshtein and levenshteoin is as follows:

after the offset of the two character strings is obtained, whether the two character strings are matched can be determined by judging whether the offset is smaller than a preset offset threshold value; specifically, whether two character strings are matched or not can be determined by judging whether the ratio of the offset to the length of the character string is smaller than a preset offset threshold or not. Namely: and comparing the offset with the length of the character string, and judging whether the ratio of the offset to the length of the character string is smaller than an offset threshold value or not. Wherein the length of the character string is half of the sum of the lengths of the two character strings.

Among them, the offset matrix can be obtained as follows. Determining substring results Q1 and Q2 corresponding to the two character strings respectively; and calculating the first row of the offset matrix according to the first row, namely judging whether the characters of the corresponding position substrings are offset leftwards, rightwards or not. For each row of the offset matrix, Q1 is assigned_iAnd Q2_i+1By comparison, the same offset is used to offset X in the matrix _,ij0, if not X_ij1 is ═ 1; then Q1 is added_iAnd Q2_i-1Comparing X in the same offset matrix _ij0, if not X_ijAnd-1, so that an offset matrix can be obtained.

Wherein, Q1_iRepresenting a certain substring in Q1, Q2_iA certain substring in Q2, for example: if Q1 ═ co, on, nt, te, en, nt, then the substrings therein can be sequentially represented as: { Q1₁、Q1₂、Q1₃、Q1₄、Q1₅、Q1₆Q2 can be expressed correspondingly by analogy, so that the description is not repeated herein.

ij denotes the rows and columns of the offset matrix, and when the X matrix is as follows, the first row elements (0001111000) may be sequentially denoted as X₁₁、X₁₂、X₁₃、X₁₄、X₁₅、X₁₆、X₁₇、X₁₈、X₁₉、X₂₀。

As can be seen, the present embodiment provides another character string matching method, in which a target character string and a character string to be matched are firstly segmented to obtain a sub-character string set; further distributing a hash value for each seed character string in the sub character string set, and updating each hash value; and finally, calculating the similarity of the target character string and the character string to be matched based on the updated hash value, and determining whether the target character string and the character string to be matched are matched based on the similarity. The method judges the similarity of the target character string and the character string to be matched based on the hash value, judges whether the target character string and the character string to be matched are matched or not by referring to the offset, has low calculation complexity and less consumed computer resources, can improve the matching efficiency and accuracy of the character strings, and reduces the calculation cost.

Based on any of the above embodiments, it should be noted that the calculating the similarity between the target character string and the character string to be matched according to the target hash sequence includes: calculating the similarity between the target character string and the character string to be matched by adopting a similarity calculation formula;

wherein, the similarity calculation formula is as follows: sim (S)₁,S₂)＝1-count(H)/n，sim(S₁,S₂) Representing similarity of the target character string and the character string to be matchedDegree, S₁Representing said target string, S₂Representing the character string to be matched; count (h) represents the number of different substrings of the target string and the string to be matched in the target hash sequence; n represents the length of the character string, and the length of the character string is half of the sum of the lengths of the target character string and the character string to be matched.

Specifically, the target hash sequence may reflect the number of identical sub-strings and the number of different sub-strings of the target string and the string to be matched. But whether the two strings are similar cannot be distinguished by the difference alone; for example, if the number of different characters compared by a pair of character strings is 5, it can be clearly determined that the two character strings are not similar for the character string with the length of 8; however, if the two strings are DNA sequences consisting of millions of characters, the difference is negligible, i.e., the two strings are determined to be similar. It is determined whether there is a match between different character strings by the similarity degree with respect to the length of the character string.

Specifically, the target character string S can be calculated as follows₁And the character string S to be matched₂And judging whether the two are matched, wherein the similarity calculation formula is as follows:

where n denotes a character string length, and n ═ length (S)₁)+length(S₂) 2)/2; count (h) is the number of bits in the target hash sequence that are extracted for accumulation (i.e., the number of bits that have a value of 1).

When the judgment is carried out based on the similarity, a similarity threshold value can be preset in advance, namely: judging whether the similarity is greater than a preset similarity threshold value or not; and when the similarity is greater than a preset similarity threshold, determining that the two character strings are matched. The similarity threshold value can be set based on the requirements for similarity under different scenes.

In the following, a character string matching device provided by an embodiment of the present invention is introduced, and a character string matching device described below and a character string matching method described above may be referred to each other.

Referring to fig. 3, an embodiment of the present invention provides a character string matching apparatus, including:

the acquiring module 301 is configured to acquire a target character string and a character string to be matched, and divide the target character string and the character string to be matched into sub-character strings of preset lengths, respectively, to obtain a sub-character string set;

an allocating module 302, configured to allocate a hash value to each seed string in the substring set to obtain an initial hash sequence;

an updating module 303, configured to update each hash value in the initial hash sequence according to a preset logical operation to obtain a target hash sequence;

a matching module 304, configured to calculate a similarity between the target character string and the character string to be matched according to the target hash sequence, and determine that the target character string and the character string to be matched are matched when the similarity is greater than a preset similarity threshold.

Wherein, still include:

Wherein, the judging module is specifically configured to:

judging whether the ratio of the offset to the length of the character string is smaller than a preset offset threshold value or not; and the length of the character string is half of the sum of the lengths of the target character string and the character string to be matched.

Wherein the allocation module is specifically configured to:

Wherein the update module is specifically configured to:

Wherein the matching module is specifically configured to:

calculating the similarity between the target character string and the character string to be matched by adopting a similarity calculation formula; wherein, the similarity calculation formula is as follows: sim (S)₁,S₂)＝1-count(H)/n，sim(S₁,S₂) Representing the similarity of the target string and the string to be matched, S₁Representing said target string, S₂Representing the character string to be matched; count (h) represents the number of different substrings of the target string and the string to be matched in the target hash sequence; n represents the length of the character string, and the length of the character string is half of the sum of the lengths of the target character string and the character string to be matched.

It can be seen that this embodiment provides a character string matching device, including: the device comprises an acquisition module, a distribution module, an updating module and a matching module. Firstly, an acquisition module acquires a target character string and a character string to be matched, and the target character string and the character string to be matched are respectively segmented into sub character strings with preset lengths to obtain a sub character string set; then, the distribution module distributes a hash value to each seed character string in the sub character string set to obtain an initial hash sequence; the updating module updates each hash value in the initial hash sequence according to preset logical operation to obtain a target hash sequence; and finally, the matching module calculates the similarity between the target character string and the character string to be matched according to the target hash sequence, and when the similarity is greater than a preset similarity threshold value, the matching between the target character string and the character string to be matched is determined. Thus, all modules are in work and cooperation and take their own roles, so that the calculation complexity and the consumption of computer resources are reduced, and the character string matching efficiency and accuracy are improved.

In the following, a character string matching device provided by an embodiment of the present invention is introduced, and a character string matching device described below and a character string matching method and device described above may refer to each other.

Referring to fig. 4, an embodiment of the present invention provides a character string matching apparatus, including:

a memory 401 for storing a computer program;

a processor 402 for implementing the steps of the string matching method according to any of the above embodiments when executing the computer program.

In the following, a readable storage medium provided by an embodiment of the present invention is introduced, and a readable storage medium described below and a character string matching method, apparatus, and device described above may be referred to each other.

A readable storage medium having stored thereon a computer program which, when executed by a processor, implements the steps of the string matching method as described in any of the above embodiments.

To further show the beneficial effects of the present invention (hash-sequence-based matching method) compared to the prior art (distance-matrix-based matching method), the following comparative experiment is now performed. See table 3 for strings to be compared.

TABLE 3

Character string to be compared	Distance threshold	Offset threshold
			that	0.4	0.4
Student	0.4	0.4

The results of the two matching methods are shown in table 4.

TABLE 4

The experimental result shows that the invention is basically consistent with the output result of the distance matrix, and the invention is superior to the prior art in the aspect of the recall ratio.

The computational complexity of distance matrix based matching methods grows exponentially as the length of the string increases, (since the distance matrix needs to fill the entire matrix). Assuming a string length of n, the computational complexity of the distance matrix based matching method is close to n ^ 2. The invention is obviously superior to a distance matrix when matching long character strings, the implementation process of the invention comprises 6 steps of splitting content, distributing hash values, calculating hash sequences, calculating deviation values, traversing character sequences and the like, and the algorithm complexity is close to 6 n.

See fig. 5 for comparison of the performance of the present invention with that of the prior art. As can be seen from fig. 5, when the length of the character string (i.e., the matching content length in fig. 5) is less than 100, there is substantially no difference between the present invention and the prior art; when the length of the character string is greater than 100, the present invention exhibits better performance than the prior art. Therefore, when a large amount of long contents are matched, the method can save computer resources and improve matching efficiency.

The embodiments in the present description are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other.

The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present invention. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the invention. Thus, the present invention is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims

1. a string matching method, is characterized in that, comprises:

Obtain the target string and the string to be matched, and divide the target string and the string to be matched into substrings of preset lengths to obtain a set of substrings; if a string is: T ₁ , T ₂ , T ₃ ......T _N , the preset length is i, 2≤i≤N, then the substrings are: T ₁ , T ₂ , T ₃ ...... T _i ; T ₂ , T ₃ ...... T _i , T _i+1 ; T ₃ . . . T _i , T _i+1 , T _i+2 ; … ; T _N-(i-1) …T _N-1 , T _N ;

Assign a hash value to each substring in the set of substrings to obtain an initial hash sequence;

Update each hash value in the initial hash sequence according to a preset logical operation to obtain a target hash sequence;

Calculate the similarity between the target string and the to-be-matched character string according to the target hash sequence, and when the similarity is greater than a preset similarity threshold, determine the target string and the to-be-matched character string match;

Wherein, assigning a hash value to each substring in the substring set includes:

Allocate a hash value to each substring in the set of substrings according to a preset incrementing rule;

Update each hash value in the initial hash sequence to zero, and perform XOR operation on each updated hash value with a preset identification value to obtain the target hash sequence; When the substring appears multiple times, the XOR operation is repeatedly performed, that is, the XOR operation is performed again on the hash value obtained by the previous XOR operation and the preset identification value.

2. The string matching method according to claim 1, wherein calculating the similarity between the target character string and the character string to be matched according to the target hash sequence comprises:

Calculate the similarity between the target character string and the character string to be matched using a similarity calculation formula;

Wherein, the similarity calculation formula is: sim(S ₁ , S ₂ )=1-count(H)/n, sim(S ₁ , S ₂ ) represents the difference between the target string and the to-be-matched string similarity, S ₁ represents the target string, S ₂ represents the to-be-matched string; count(H) represents the difference between the target string in the target hash sequence and the to-be-matched string The number of substrings; n represents the length of the string, which is half of the sum of the length of the target string and the string to be matched.

3. The character string matching method according to claim 1 or 2, wherein before the determining that the target character string is matched with the character string to be matched, further comprises:

Calculate the offset between the target string and the to-be-matched string;

Determine whether the offset is less than a preset offset threshold;

If so, the step of determining that the target character string matches the to-be-matched character string is performed.

4. The string matching method according to claim 3, wherein the judging whether the offset is less than a preset offset threshold comprises:

Determine whether the ratio of the offset to the string length is less than a preset offset threshold;

Wherein, the length of the character string is half of the sum of the lengths of the target character string and the character string to be matched.

5. A string matching device, characterized in that, comprising:

The acquisition module is used to acquire the target string and the string to be matched, and divide the target string and the string to be matched into substrings of preset lengths to obtain a set of substrings; if a string is: T ₁ , T ₂ , T ₃ ......T _N , the preset length is i, 2≤i≤N, then the substrings are: T ₁ , T ₂ , T ₃ ...... T _i ; T ₂ , T ₃ ...T _i , T _i+1 ; T ₃ ...T _i , T _i+1 , T _i+2 ; ...; T _N-(i-1) ... T _N-1 , T _N ;

an assigning module, configured to assign a hash value to each substring in the set of substrings to obtain an initial hash sequence;

an update module, configured to update each hash value in the initial hash sequence according to a preset logical operation to obtain a target hash sequence;

The matching module is used to calculate the similarity between the target string and the to-be-matched string according to the target hash sequence, and when the similarity is greater than a preset similarity threshold, determine the target string and the string to be matched. the to-be-matched strings match;

Wherein, the allocation module is specifically used for:

Wherein, the update module is specifically used for:

6. The character string matching device according to claim 5, further comprising:

a calculation module for calculating the offset between the target character string and the character string to be matched;

a judgment module for judging whether the offset is less than a preset offset threshold;

An execution module, configured to determine that the target string matches the to-be-matched string when the offset is less than a preset offset threshold.

7. A string matching device, characterized in that, comprising:

memory for storing computer programs;

The processor is configured to implement the steps of the string matching method according to any one of claims 1-4 when executing the computer program.

8. A readable storage medium, characterized in that, a computer program is stored on the readable storage medium, and when the computer program is executed by a processor, the string matching according to any one of claims 1-4 is realized steps of the method.