[go: up one dir, main page]

CN119336866A - Standard text review methods, devices, equipment and media - Google Patents

Standard text review methods, devices, equipment and media Download PDF

Info

Publication number
CN119336866A
CN119336866A CN202411355837.8A CN202411355837A CN119336866A CN 119336866 A CN119336866 A CN 119336866A CN 202411355837 A CN202411355837 A CN 202411355837A CN 119336866 A CN119336866 A CN 119336866A
Authority
CN
China
Prior art keywords
matching
character block
jump
suffix
text
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202411355837.8A
Other languages
Chinese (zh)
Inventor
王立玺
邹佳桐
祝贺
吕千千
胡晨
安淑荻
魏梅
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
BEIJING SAIXI TECHNOLOGY DEVELOPMENT CO LTD
China Electronics Standardization Institute
Original Assignee
BEIJING SAIXI TECHNOLOGY DEVELOPMENT CO LTD
China Electronics Standardization Institute
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by BEIJING SAIXI TECHNOLOGY DEVELOPMENT CO LTD, China Electronics Standardization Institute filed Critical BEIJING SAIXI TECHNOLOGY DEVELOPMENT CO LTD
Priority to CN202411355837.8A priority Critical patent/CN119336866A/en
Publication of CN119336866A publication Critical patent/CN119336866A/en
Pending legal-status Critical Current

Links

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention provides a standard text examination method, device, equipment and medium, which comprises the steps of obtaining a mode string set, wherein the mode string set comprises a plurality of mode strings, constructing a skip list and a hash list according to suffix character blocks of all mode string matching windows, constructing a prefix list according to prefix character blocks of the matching windows, constructing a displacement list according to non-suffix character blocks of the matching windows, obtaining a combined character block, constructing a bloom filter according to the combined character block, constructing the combined character block by the suffix character blocks of the current matching windows and the prefix character blocks of the next matching windows, and carrying out mobile scanning matching on the standard text through the matching windows to be checked by adopting the hash list, the skip list, the prefix list, the displacement list and the bloom filter to obtain a text examination result. The scheme can avoid the condition that accurate matching is not needed, thereby accelerating the matching process, having the advantages of large jump distance and less accurate verification times, and further improving the text examination efficiency.

Description

Standard text examination method, device, equipment and medium
Technical Field
The present invention relates to the field of data processing technologies, and in particular, to a standard text inspection method, apparatus, device, and medium.
Background
With the development of information technology and the explosive growth of internet content, text information has become an important way for people to communicate, learn and acquire knowledge. Wherein, the standardized file is "a file formulated by a standardized activity". The normalized file belongs to one of the normalized files, and is mainly different from other normalized files in the formation of process entropy in that it is generated in the normalization activity. In large amounts of standard text data, there may be a variety of quality problems such as grammar mistakes, spelling errors, logic inaugurations, semantic ambiguity, etc. In order to ensure the accuracy and consistency of standard documents, the study of examining standard texts is particularly important.
At present, in the related art, a Wu-Manber multi-mode matching algorithm is adopted to detect specific words or phrases in a text in the process of examining a standard text, however, the scheme has some limitations, such as small jump distance and large hash conflict, and needs to take a long time when accurately examining, so that the text examination efficiency is reduced.
Disclosure of Invention
In view of the foregoing, the present invention provides a standard text inspection method, apparatus, device and medium, which at least partially solve the problems in the prior art.
In a first aspect, an embodiment of the present application provides a standard text review method, including:
The method comprises the steps of obtaining a pattern string set, wherein the pattern string set comprises a plurality of pattern strings;
Constructing a skip list and a hash list according to the suffix character blocks of all the pattern string matching windows, constructing a prefix list according to the prefix character blocks of the matching windows, and constructing a displacement list according to the non-suffix character blocks of the matching windows;
The method comprises the steps of obtaining a combined character block, and constructing a bloom filter according to the combined character block, wherein the combined character block consists of a suffix character block of a current matching window and a prefix character block of a next matching window;
and carrying out mobile scanning matching on the standard text through a matching window to be checked by adopting the hash table, the jump table, the prefix table, the displacement table and the bloom filter to obtain a text examination result.
Optionally, constructing a displacement table according to the non-suffix character blocks of the matching window includes:
The mode strings are coded, and the number of the mode strings in the mode string set and the length value of the shortest mode string in the mode strings are determined, wherein the mode strings comprise suffix character blocks and non-suffix character blocks;
calculating a character block size value according to the number of the mode strings and the length value of the shortest mode string;
calculating the non-suffix character block and a first skip value corresponding to the non-suffix character block based on the character block size value;
and correspondingly storing the non-suffix character blocks and the first jump values corresponding to the non-suffix character blocks to construct the displacement table, wherein the first jump values corresponding to the suffix character blocks in the displacement table are zero.
Optionally, constructing a jump table according to suffix character blocks of all the pattern string matching windows includes:
When the suffix character block is in a first condition, the second jump value is m-B+4, wherein the first condition is that the suffix character block does not appear in a matching window of other mode strings, the suffix character block is a character block which is currently verified and is positioned at the rightmost end of the matching window and has a character block size of B, m is the length value of the shortest mode string, and B is the size value of the character block;
when the suffix character block is in a second condition, the second jump value is m-B, wherein the second condition is that the suffix character block only appears at the leftmost end of other mode strings;
When the suffix character block is in other cases, the second jump value is 4, wherein the other cases are the cases except the first case and the second case;
And constructing the jump table according to the second jump value and the suffix character block.
Optionally, performing mobile scanning matching on the standard text through a matching window to be checked by using the hash table, the jump table, the prefix table, the displacement table and the bloom filter to obtain a text examination result, including:
acquiring a target character block of the matching window to be verified and a first jump value of the target character block in the displacement table;
judging whether the first jump value is larger than zero or not;
When the first jump value is not greater than zero, constructing a target combined character block from the suffix character block of the matching window and the prefix character block of the next matching window;
The bloom filter is adopted to check the target combined character block in the standard text to obtain a check result, wherein the check result is used for representing check failure or check success;
and according to the verification result, adopting the prefix table, the hash table and the jump table to scan and match the standard text, and obtaining the text examination result.
Optionally, after determining whether the first jump value is greater than zero, the method further comprises:
Skipping the first jump value positions when the first jump value is greater than zero;
And continuing to search and check from the standard text until the standard text is checked, and obtaining the text examination result.
Optionally, according to the verification result, the prefix table, the hash table and the skip table are adopted to scan and match the standard text, so as to obtain the text examination result, which includes:
when the verification result represents that verification is successful, verifying the target character block of the matching window based on the prefix table and the hash table to obtain the text examination result;
and when the verification result represents verification failure, scanning and matching the standard text based on the jump table to obtain the text examination result.
Optionally, scanning and matching the standard text based on the jump table to obtain the text examination result, including:
acquiring a second jump value of the target character block in the jump table;
And skipping the positions of the second skipping value, and continuing to search and check from the standard text until the standard text is checked, so as to obtain the text examination result.
In a second aspect, the present application provides a standard text inspection apparatus, the apparatus comprising:
the system comprises an acquisition module, a processing module and a processing module, wherein the acquisition module is used for acquiring a mode string set, and the mode string set comprises a plurality of mode strings;
the first construction module is used for constructing a skip list and a hash list according to the suffix character blocks of all the pattern string matching windows, constructing a prefix list according to the prefix character blocks of the matching windows, and constructing a displacement list according to the non-suffix character blocks of the matching windows;
the second construction module is used for acquiring a combined character block and constructing a bloom filter according to the combined character block, wherein the combined character block consists of a suffix character block of a current matching window and a prefix character block of a next matching window;
And the examination module is used for carrying out mobile scanning matching on the standard text through a matching window to be checked by adopting the hash table, the jump table, the prefix table, the displacement table and the bloom filter to obtain a text examination result.
In a third aspect, an embodiment of the present application provides a computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the standard text inspection method as described above in the first aspect when the program is executed.
In a fourth aspect, embodiments of the present application provide a computer-readable storage medium having stored thereon a computer program for implementing the standard text inspection method of the first aspect above.
The method, the device, the equipment and the medium for inspecting the standard text provided by the embodiment of the application are characterized by acquiring a mode string set, constructing a skip list and a hash list according to suffix character blocks of all mode string matching windows, constructing a prefix list according to prefix character blocks of the matching windows, constructing a displacement list according to non-suffix character blocks of the matching windows, acquiring a combined character block, constructing a bloom filter according to the combined character block, constructing the combined character block by the suffix character block of the current matching window and the prefix character block of the next matching window, and performing mobile scanning matching on the standard text by adopting the hash list, the skip list, the prefix list, the displacement list and the bloom filter through the matching windows to be checked to obtain a text inspection result. Compared with the prior art, the technical scheme has the advantages that the displacement table is accurately constructed according to the non-suffix character blocks of the matching window, the matching jump processing can be dynamically carried out according to the data in the displacement table, so that the jump distance is increased, the jump amplitude is increased, the jump value can be preset by constructing the jump table according to the suffix character blocks of all pattern string matching windows, the time consumption of the matching stage is transferred to the preprocessing stage, the matching performance is accelerated, the bloom filter is accurately constructed based on the combined character blocks, the character blocks can be subjected to preliminary filtering processing, the condition that the accurate matching is not needed is avoided, the matching process is accelerated, and when the shortest pattern string is longer and the same prefix and suffix exist in each pattern string, the advantages of large jump distance and less accurate verification times are achieved, so that the text inspection efficiency is further improved.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings that are needed in the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and that other drawings can be obtained according to these drawings without inventive effort for a person skilled in the art.
Fig. 1 is a flow chart of a standard text review method according to an embodiment of the present application
FIG. 2 is a flowchart illustrating a method for constructing a displacement table according to non-suffix character blocks of a matching window according to an embodiment of the present application;
FIG. 3 is a schematic diagram of a method for performing scan matching on standard text to obtain text examination results according to an embodiment of the present application;
FIG. 4 is a schematic diagram of a structure of a hash table, a displacement table, a jump table, a prefix table and a bloom filter according to an embodiment of the present application;
FIG. 5 is a pseudo code schematic diagram of a scan matching stage according to an embodiment of the present application;
FIG. 6 is a schematic diagram of a constructed displacement table, jump table and bloom filter according to an embodiment of the present application;
fig. 7 is a schematic diagram of a process of performing a scan matching stage on a standard text T according to an embodiment of the present application;
Fig. 8 is a schematic flow chart of a standard text censoring device according to an embodiment of the present application;
Fig. 9 is a schematic structural diagram of a computer device according to an embodiment of the present application.
Detailed Description
Embodiments of the present invention will be described in detail below with reference to the accompanying drawings.
It is noted that the following embodiments and features of the embodiments may be combined with each other without conflict, and that all other embodiments obtained by persons of ordinary skill in the art without creative efforts based on the embodiments in the present disclosure are within the scope of protection of the present disclosure.
It is noted that various aspects of the embodiments are described below within the scope of the following claims. It should be apparent that the aspects described herein may be embodied in a wide variety of forms and that any specific structure and/or function described herein is merely illustrative. Based on the present disclosure, one skilled in the art will appreciate that one aspect described herein may be implemented independently of any other aspect, and that two or more of these aspects may be combined in various ways. For example, an apparatus may be implemented and/or a method practiced using any number of the aspects set forth herein. In addition, such apparatus may be implemented and/or such methods practiced using other structure and/or functionality in addition to one or more of the aspects set forth herein.
At present, in the related art, a Wu-Manber multi-mode matching algorithm is adopted to detect specific words or phrases in a text in the process of examining a standard text, however, the scheme has some limitations, such as small jump distance and large hash conflict, and needs to take a long time when accurately examining, so that the text examination efficiency is reduced. The Wu-manner algorithm is mainly used for searching a plurality of modes (i.e. a plurality of sub-strings) in one text at the same time, and can complete a matching task within the time complexity of O (n), wherein n is the length of the text.
Based on the defects, compared with the prior art, the technical scheme can dynamically perform matching jump processing according to data in the displacement table by accurately constructing the displacement table according to the non-suffix character blocks of the matching window, thereby improving the jump distance, enlarging the jump amplitude, presetting the jump value by constructing the jump table according to the suffix character blocks of all pattern string matching windows, transferring the time consumption of the matching stage to the preprocessing stage, thereby accelerating the matching performance, performing preliminary filtering processing on the character blocks by accurately constructing the bloom filter based on the combined character blocks, avoiding the condition of no need of performing accurate matching, accelerating the matching process, and having the advantages of large jump distance and less accurate verification times when the shortest pattern string length is longer and the same prefix and suffix exist in each pattern string, thereby further improving the text checking efficiency.
For ease of understanding and explanation, the standard text inspection method, apparatus, device, and medium provided by the embodiments of the present application are described in detail below with reference to fig. 1 to 9.
Fig. 1 is a flow chart of a standard text review method according to an embodiment of the application, which may be performed by a computer device. As shown in fig. 1, the method includes:
s101, acquiring a pattern string set, wherein the pattern string set comprises a plurality of pattern strings.
It should be noted that, the above-mentioned pattern string set may be a chinese text to be checked, which may include a plurality of pattern strings, and the character blocks corresponding to each pattern string may be the same or different, and the chinese text may be in PDF format. For example, the pattern string set may be p= { disagreeable, wonderful, information age }, and "disagreeable", "wonderful, and" information age "are each pattern strings.
The mode string set may be obtained by importing from an external device, may be obtained from a database, or may be obtained from a blockchain, and the mode of obtaining the mode string set in the embodiment of the present application is not limited in any way.
S102, constructing a skip list and a hash list according to the suffix character blocks of all the pattern string matching windows, constructing a prefix list according to the prefix character blocks of the matching windows, and constructing a displacement list according to the non-suffix character blocks of the matching windows.
It will be appreciated that the above Jump Table may be referred to as a Jump Table, which is a data structure for accelerating the matching process, and provides different Jump distances based on a target character block H, which is typically the right-most character block of size B of the matching window. The Shift Table may be referred to as a Shift Table, which is a key part of the original Wu-Manber algorithm, to determine how many characters the window should slide forward when the target character block does not exactly match any one of the patterns. The Prefix Table may be referred to as a Prefix Table (Prefix Table) that is used to quickly determine if a complete pattern exists during the exact verification phase, and this Table is used to reduce unnecessary character comparisons once a possible match is found.
Wherein the prefix table and hash table are used in the exact match process after the bloom filter check passes to ensure that the standard text does contain a complete pattern. The displacement table and skip list table determine the distance that should be skipped when the target character block does not match.
In one embodiment, the application provides a specific implementation mode of a method for constructing a displacement table according to non-suffix character blocks of a matching window. Referring to fig. 2, the method includes:
s201, coding the pattern strings, determining the number of the pattern strings in the pattern string set and the shortest pattern string length value in the pattern strings, wherein the pattern strings comprise suffix character blocks and non-suffix character blocks.
S202, calculating a character block size value according to the number of the mode strings and the length value of the shortest mode string.
S203, calculating a non-suffix character block and a first jump value corresponding to the non-suffix character block based on the character block size value.
S204, correspondingly storing the non-suffix character blocks and the first jump values corresponding to the non-suffix character blocks to construct a displacement table, wherein the first jump values corresponding to the suffix character blocks in the displacement table are zero.
It can be understood that, because the PDF chinese text code has the characteristic that four characters form a chinese character, in the preprocessing stage, the values of the keys may be separated by four units when the SHIFT table is initialized, and the values of the keys should be separated by four units when the SHIFT hash table is initialized, so that each jump can be ensured to span a complete chinese character. The displacement table comprises character blocks and corresponding first jump values, wherein the first jump values are used for representing the jump distance. Wherein, if the character block of the current window does not appear in any pattern string, the default skip value is changed from m-b+1 to m-b+4, because each Chinese character occupies four characters, so the minimum effective skip distance should be four characters. In the matching phase, the first jump value after the exact check result is fixed to 1 in the Wu-Manber algorithm, however, the first jump value can be dynamically obtained and the minimum jump value is 4 due to the particularity of CID encoding.
Specifically, after a plurality of pattern strings are acquired, a preset encoding algorithm may be adopted to encode the pattern strings, so as to obtain an encoded result, then, according to the encoded result, a non-suffix character block and a suffix character block of a matching window are determined, a first jump value of the suffix character block is set to zero, the number of pattern strings in a pattern string set and the shortest pattern string length value in the pattern strings are determined, and then, a character block size value is calculated according to the following formula:
B=〔4×logΣ(2×m×r)〕;
Where B is the character block size value, Σ is the size of the pattern string set, m is the length value of the shortest pattern string, and r is the number of pattern strings.
After determining the character block size value, a non-suffix character block and a first skip value corresponding to the non-suffix character block can be calculated, wherein the first skip value of each character block is a multiple of 4, thereby increasing the skip amplitude. For example, if a certain character block does not appear in any pattern string, its jump value is m-b+4.
For example, assuming that the pattern string set is p= { unpredictably, wonderful, information age }, it may be subjected to encoding processing first, for example, CID encoding processing, and the result obtained after the encoding processing is p= {4e0d53ef601d8bae,59994e0d53ef8a00,4fe1606f65f64 e3}, the number of pattern strings is determined to be 3, the shortest pattern string length value is 16, and the character block size value may be determined according to the number of pattern strings and the shortest pattern string length value. Assuming that the character block size value b=8, the SHIFT displacement table constructed from the pattern string set is shown in the following table 1, and the table 1 includes each character block and the corresponding skip value:
TABLE 1
As can be seen from the above Table 1, the skip value corresponding to each character block is a multiple of 4, increasing the skip distance and thus increasing the skip amplitude.
In this embodiment, the first skip value is set to be a multiple of 4, so that the characteristics of the chinese text can be better utilized, the skip amplitude is greatly increased, the matching efficiency is improved, and unnecessary comparison operations are reduced.
In another embodiment of the present application, a specific implementation of constructing a jump table from suffix character blocks of all pattern string matching windows is provided. The method comprises the following steps:
When the suffix character block is in the first condition, the second jump value is m-B+4, the first condition is that the target character block does not appear in the matching window of other pattern strings, the suffix character block is the character block which is currently verified and is positioned at the rightmost end of the matching window and has the character block size of B, m is the shortest pattern string length value, B is the character block size value, when the target character block is in the second condition, the second jump value is m-B, the second condition is that the suffix character block only appears at the leftmost end of other pattern strings, when the suffix character block is in other conditions, the second jump value is 4, other conditions are conditions except the first condition and the second condition, and a jump table is constructed according to the second jump value and the suffix character block.
In the Wu-Manber algorithm, a character block with a size B at the rightmost end of the matching window may be used as a suffix character string and named as H. When shift h=0, the exact check is performed after checking the prefix table, and the matching window is slid only 1 distance to the right after the end. When the standard text size is large, the number of accurate verification times is large, and the sliding distance is only 1, so that the number of matching times is large, and the matching efficiency of the algorithm is low. To solve this problem, a jump table is constructed to improve the matching efficiency, in which the key is identical to the hash table, and the corresponding second jump value represents the jump value after the exact match. The skip list includes character blocks and corresponding second skip values, and may also be hash values including character blocks and corresponding second skip values.
In constructing the jump table, when the suffix character block H to be verified does not appear in the matching window of the other pattern string, the corresponding second jump value jump [ H ] =m-b+4 is to jump as large as possible those portions which are unlikely to contain the matching item, and when appearing only at the leftmost end of the other pattern string, the corresponding second jump value jump [ H ] =m-B is because if H is located only at the beginning of the pattern, we can safely skip the length of the entire pattern minus B, and otherwise the corresponding second jump value jump [ H ] =4. This value is an empirical value aimed at balancing the jump distance and avoiding the risk of excessive jumps.
In this embodiment, since the jump table including the second jump value is constructed in the preprocessing stage, the second jump value can be directly obtained from the jump table after the accurate verification is finished in the matching stage, so that unnecessary time consumption in the matching stage is transferred to the preprocessing stage, the matching speed is increased, and the matching efficiency is improved.
S103, acquiring a combined character block, and constructing a bloom filter according to the combined character block, wherein the combined character block consists of a suffix character block of a current matching window and a prefix character block of a next matching window.
It should be noted that the bloom filter may be referred to as a bloom filter, and the bloom filter is a probabilistic data structure that is used to quickly determine whether an element may exist in a set. Bloom filters (Bloom filters) are an efficient random data storage structure proposed by Bloom in 1970 that succinctly represent a set with a bitmap and quickly determine whether an element belongs to the set. The bloom filter is a data structure based on a multi-hash function mapping compression parameter space, and maps the calculated values of k hash functions into a bitmap, so that the space and time efficiency are high. In the application, the combined character block HB is preliminarily screened through the pre-constructed bloom filter so as to reduce unnecessary accurate verification.
The combined character block is a new character block with the length of B, which is formed by a suffix character block of a current matching window and a prefix character block of a next matching window, and is HB. In the preprocessing stage, the combined character blocks HB of all non-shortest pattern strings are added to the bloom filter. In the matching stage, when the possible occurrence of matching is detected, whether the combined character block HB is in a bloom filter or not is calculated before searching the prefix table and the hash table, if yes, accurate verification is carried out, otherwise, the method directly jumps according to a second jump value jump [ H ] in the jump table to search whether a matching item exists or not, and therefore a text examination result is obtained.
The bloom filter is used in the embodiment, so that the number of cases needing to be accurately checked can be greatly reduced, and the efficiency is improved. And a more flexible jump strategy is provided by means of the jump table, avoiding the situation where only one unit is moved at a time, especially if the target character block H is not present in any mode. In practical implementations, this improvement enables the Wu-manner algorithm to more efficiently handle matching problems for large-scale text and large numbers of keywords.
S104, performing mobile scanning matching on the standard text through a matching window to be checked by adopting a hash table, a jump table, a prefix table, a displacement table and a bloom filter to obtain a text examination result.
It should be noted that, the standard text is a template text which is edited in advance and accords with a preset standard, and the standard text can have an editing function, and the preset standard can include a national standard, an industry standard, a group standard, an enterprise standard, a local standard and a standardized guiding technical file. The text censoring results are used to characterize whether the matching window is in standard text.
In one embodiment, the present application provides a specific implementation manner of performing mobile scanning matching on a standard text through a matching window to be checked by adopting a hash table, a jump table, a prefix table, a displacement table and a bloom filter to obtain a text examination result, and please refer to fig. 3, the method includes:
s301, obtaining a target character block of a matching window to be checked and a first jump value of the target character block in a displacement table.
S302, judging whether the first jump value is larger than zero.
S303, when the first jump value is not greater than zero, constructing a target combined character block from the suffix character block of the matching window and the prefix character block of the next matching window.
S304, checking the target combined character block in the standard text by adopting a bloom filter to obtain a checking result, wherein the checking result is used for representing that the checking fails or is successful.
And S305, scanning and matching the standard text by adopting a prefix table, a hash table and a jump table according to the verification result to obtain a text examination result.
Specifically, after the pattern string set is obtained, the shortest pattern string length value can be determined, a jump table and a hash table are built by utilizing suffix character blocks of all pattern string matching windows, a prefix table is built according to prefix character blocks of the matching windows, a displacement table is built according to non-suffix character blocks of the matching windows, and finally a bloom filter is built according to the combined character block HB.
Optionally, after determining whether the first jump value is greater than zero, the method further includes:
And continuing to search and check from the standard text until the standard text is checked, and obtaining a text examination result.
In the process of adopting a prefix table, a hash table and a skip table to scan and match a standard text to obtain a text examination result, specifically, when the verification result represents that verification is successful, the target character block of a matching window is verified based on the prefix table and the hash table to obtain the text examination result, and when the verification result represents that verification is failed, the standard text is scanned and matched based on the skip table to obtain the text examination result.
The method for obtaining the text examination result comprises the following steps of:
and jumping the second jump value position, continuing to search and check from the standard text until the standard text is checked, and obtaining a text examination result.
Referring to fig. 4, fig. 4 is a schematic diagram of a hash table, a jump table, a prefix table, a shift table and a bloom filter constructed in a preprocessing stage provided in the embodiment of the present application, wherein the gray part represents a matching window with the length of m, the matching window comprises a target character block which is used for calculating a hash table, a jump table, a prefix table and a displacement table shifttable, index in each table represents a hash value calculated by a target character block, shift in shift table is a first skip value, skip in jump table is a second skip value, id represents a serial number of a pattern string, dark gray part in a matching window represents HB combined character block, and the bloom filter is essentially a bit array, and bit values comprise 0 and 1.
Referring to the pseudo-code illustration of figure 5,
Algorithm 1 scanning text matching algorithm
After constructing hash table, jump table, prefix table, shift table and Bloom filter, executing scanning matching stage by using the generated four hash tables and Bloom filter. And (3) always maintaining a matching window with the length m to move rightwards in the matching process, setting an initial index as m-1 on the assumption that a character block with the suffix length B of the current matching window is H, storing the text length of a standard text in a variable text_len, and circularly executing the steps of a, if an index is smaller than the text length text_len, repeatedly executing the steps of a, extracting the character block from the matching window to be checked, namely acquiring a target character block_hash with the rightmost length B of the current window, namely text [ index-B: index ]. And c, performing matching by using a bloom filter, performing bloom filter verification according to a target combined character block HB with the length of B, which consists of a suffix character block of a matching window and a prefix character block of a next matching window, and performing accurate verification based on a prefix table prefix and a hash table if the verification passes, and if the verification fails, obtaining a second jump value jump [ H ] of the target character block in the jump table, jumping the position of the second jump value jump [ H ], and continuing to search and verify from a standard text until the standard text is checked, thereby obtaining a text checking result.
In the process of performing accurate verification based on the prefix table prefix and the hash table, prefix hash value prefix_hash of the current matching window can be used for performing accurate verification by using the hash table and the prefix table, possible matching items are found, if the matching items are found, the index and ID of the matching items are output, index is updated to index+jump [ block_hash ], and then circulation is continued. When the index reaches the text length of the standard text, the standard text is verified, the circulation is stopped, and the text examination result is obtained.
In this embodiment, the optimal skip policy can be pre-calculated in the preprocessing stage, so that unnecessary character comparison times are reduced in the actual searching process, and thus time overhead in the actual matching process is significantly reduced, especially in the case of processing a large volume of text or having a large number of keywords. The method effectively transfers some work originally belonging to the matching stage to the preprocessing stage, thereby improving the overall performance.
Illustratively, assuming that the pattern string set is p= { new era, zehnder, geometric subject }, CID encoding processing is performed on each pattern string in the pattern string set, resulting in encoded results p= {079e85h679c1, 3f89463d8a09237f, 293b463d8a091c6373a2}. From this, the shortest pattern string length value m=12, the character block size value b=8, and the hash table, jump table, prefix table, shift table and Bloom filter are obtained by preprocessing calculation. Fig. 6 is a schematic structural diagram of the constructed shift table, jump table and Bloom filter.
Referring to fig. 7, after four hash tables and Bloom filters are constructed, a scan matching stage is performed using the generated four hash tables and Bloom filters. The method comprises the steps of firstly obtaining a first jump value corresponding to a target character block 463d8a09 of a matching window in a displacement table, when the first jump value is 0, not immediately carrying out accurate verification, constructing a target combined character block 8a09465c with the length of B by a suffix character block of the current matching window and a prefix character block of the next matching window, then verifying the target combined character block 8a09465c in a standard text by adopting a bloom filter to obtain a verification result, and when the verification is failed, moving the standard text backwards by a second jump value jump [463d8a09] =4 distances, and continuing to check the standard text until the standard text verification result.
The standard text examination method comprises the steps of obtaining a pattern string set, constructing a skip list and a hash list according to suffix character blocks of all pattern string matching windows, constructing a prefix list according to prefix character blocks of the matching windows, constructing a displacement list according to non-suffix character blocks of the matching windows, obtaining a combined character block, constructing a bloom filter according to the combined character block, wherein the combined character block consists of the suffix character blocks of the current matching windows and the prefix character blocks of the next matching windows, and performing mobile scanning matching on the standard text through the matching windows to be checked by adopting the hash list, the skip list, the prefix list, the displacement list and the bloom filter to obtain a text examination result. Compared with the prior art, the technical scheme has the advantages that the displacement table is accurately constructed according to the non-suffix character blocks of the matching window, the matching jump processing can be dynamically carried out according to the data in the displacement table, so that the jump distance is increased, the jump amplitude is increased, the jump value can be preset by constructing the jump table according to the suffix character blocks of all pattern string matching windows, the time consumption of the matching stage is transferred to the preprocessing stage, the matching performance is accelerated, the bloom filter is accurately constructed based on the combined character blocks, the character blocks can be subjected to preliminary filtering processing, the condition that the accurate matching is not needed is avoided, the matching process is accelerated, and when the shortest pattern string is longer and the same prefix and suffix exist in each pattern string, the advantages of large jump distance and less accurate verification times are achieved, so that the text inspection efficiency is further improved.
On the other hand, fig. 8 is a schematic structural diagram of a standard text inspection device according to an embodiment of the present application. The device may be a device in a terminal or a server, as shown in fig. 8, including:
A receiving module 710, configured to obtain a pattern string set, where the pattern string set includes a plurality of pattern strings;
a first construction module 720, configured to construct a skip table and a hash table according to the suffix character blocks of all the pattern string matching windows, construct a prefix table according to the prefix character blocks of the matching windows, and construct a displacement table according to the non-suffix character blocks of the matching windows;
a second construction module 730, configured to obtain a combined character block, and construct a bloom filter according to the combined character block, where the combined character block is composed of a suffix character block of a current matching window and a prefix character block of a next matching window;
And the examination module 740 is used for carrying out mobile scanning matching on the standard text through a matching window to be checked by adopting the hash table, the jump table, the prefix table, the displacement table and the bloom filter to obtain a text examination result.
Optionally, the first construction module 720 is specifically configured to:
the method comprises the steps of carrying out coding processing on pattern strings, and determining the number of pattern strings in a pattern string set and the length value of the shortest pattern string in the pattern strings, wherein the pattern strings comprise suffix character blocks and non-suffix character blocks;
calculating a character block size value according to the number of the mode strings and the length value of the shortest mode string;
calculating a non-suffix character block and a first skip value corresponding to the non-suffix character block based on the character block size value;
And correspondingly storing the non-suffix character blocks and the first jump values corresponding to the non-suffix character blocks to construct a displacement table, wherein the first jump values corresponding to the suffix character blocks in the displacement table are zero.
Optionally, the second construction module 730 is specifically configured to:
When the suffix character block is in the first condition, the second jump value is m-B+4, wherein the first condition is that the suffix character block does not appear in the matching window of other mode strings, the suffix character block is the character block which is currently verified and is positioned at the rightmost end of the matching window and has the character block size of B, m is the length value of the shortest mode string, and B is the character block size value;
when the suffix character block is in the second condition, the second jump value is m-B, wherein the second condition is that the suffix character block only appears at the leftmost end of other mode strings;
When the suffix character block is in other conditions, the second jump value is 4, wherein the other conditions are the conditions except the first condition and the second condition;
And constructing a jump table according to the second jump value and the suffix character block.
Optionally, the censoring module 740 is specifically configured to:
Acquiring a target character block of a matching window to be checked and a first jump value of the target character block in a displacement table;
Judging whether the first jump value is larger than zero;
When the first jump value is not greater than zero, constructing a target combined character block by the suffix character block of the matching window and the prefix character block of the next matching window;
the bloom filter is adopted to check the target combined character block in the standard text to obtain a check result, wherein the check result is used for representing that the check fails or the check is successful;
and according to the verification result, adopting a prefix table, a hash table and a jump table to scan and match the standard text, and obtaining a text examination result.
Optionally, the censoring module 740 is further configured to:
skipping the first jump value when the first jump value is greater than zero;
and continuing to search and check from the standard text until the standard text is checked, and obtaining a text examination result.
Optionally, the censoring module 740 is further configured to:
when the verification result represents successful verification, verifying the target character block of the matching window based on the prefix table and the hash table to obtain a text examination result;
and when the verification result represents that the verification fails, scanning and matching are carried out on the standard text based on the jump table, so that a text examination result is obtained.
Optionally, the censoring module 740 is further configured to:
Acquiring a second jump value of the target character block in the jump table;
And skipping the second skip value, continuing to search and check from the standard text until the standard text is checked, and obtaining a text examination result.
It may be understood that the functions of each functional module of the standard text inspection device of the present embodiment may be specifically implemented according to the method in the foregoing method embodiment, and the specific implementation process may refer to the relevant description of the foregoing method embodiment, which is not repeated herein.
On the other hand, referring to fig. 9, a computer device provided in an embodiment of the present application includes a memory, a processor, and a computer program stored in the memory and executable on the processor, where the processor implements the standard text inspection method as described above when executing the program.
The computer device includes a Central Processing Unit (CPU) 401, which can perform various appropriate actions and processes according to a program stored in a Read Only Memory (ROM) 402 or a program loaded from a storage section 408 into a Random Access Memory (RAM) 403. In the RAM403, various programs and data required for the system operation are also stored. The CPU401, ROM402, and RAM403 are connected to each other by a bus 404. An input/output (I/O) interface 406 is also connected to bus 404.
Connected to the I/O interface 405 are an input section 406 including a keyboard, a mouse, and the like, an output section 407 including a Cathode Ray Tube (CRT), a Liquid Crystal Display (LCD), and the like, a speaker, and the like, a storage section 408 including a hard disk, and the like, and a communication section 409 including a network interface card such as a LAN card, a modem, and the like. The communication section 409 performs communication processing via a network such as the internet. The drive 410 is also connected to the I/O interface 406 as needed. A removable medium 411 such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, or the like is installed on the drive 410 as needed, so that a computer program read therefrom is installed into the storage section 408 as needed.
In particular, according to embodiments of the present disclosure, the process described above with reference to fig. 1 may be implemented as a computer software program. For example, embodiments of the present disclosure include a computer program product comprising a computer program tangibly embodied on a machine-readable medium, the computer program comprising program code for performing the method of fig. 1. In such an embodiment, the computer program may be downloaded and installed from a network via the communication portion 409 and/or installed from the removable medium 411.
The flowcharts and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present invention. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.
The units or modules involved in the embodiments of the present application may be implemented in software or in hardware. The described units or modules may also be provided in the processor. For example, a processor may be described that includes an acquisition module, a first build module, a second build module, and an inspection module. The names of these units or modules do not in some way limit the units or modules themselves, and the acquisition module may also be described as "for acquiring a pattern string set including a plurality of pattern strings".
As another aspect, the present application also provides a computer-readable storage medium that may be included in the electronic device described in the above embodiment, or may exist alone without being incorporated in the electronic device. The computer-readable storage medium stores one or more programs that, when used by one or more processors, perform the standard text inspection method described in the present application:
The method comprises the steps of obtaining a pattern string set, wherein the pattern string set comprises a plurality of pattern strings;
Constructing a skip list and a hash list according to the suffix character blocks of all the pattern string matching windows, constructing a prefix list according to the prefix character blocks of the matching windows, and constructing a displacement list according to the non-suffix character blocks of the matching windows;
The method comprises the steps of obtaining a combined character block, and constructing a bloom filter according to the combined character block, wherein the combined character block consists of a suffix character block of a current matching window and a prefix character block of a next matching window;
and carrying out mobile scanning matching on the standard text through a matching window to be checked by adopting the hash table, the jump table, the prefix table, the displacement table and the bloom filter to obtain a text examination result.
In summary, the standard text examination method, device, equipment and medium provided by the embodiment of the application are characterized by acquiring a pattern string set, wherein the pattern string set comprises a plurality of pattern strings, constructing a skip list and a hash list according to suffix character blocks of all pattern string matching windows, constructing a prefix list according to prefix character blocks of the matching windows, constructing a displacement list according to non-suffix character blocks of the matching windows, acquiring a combined character block, constructing a bloom filter according to the combined character block, constructing the combined character block by the suffix character blocks of the current matching window and the prefix character blocks of the next matching window, and adopting the hash list, the skip list, the prefix list, the displacement list and the bloom filter to perform mobile scanning matching on the standard text through the matching windows to be checked to obtain a text examination result. Compared with the prior art, the technical scheme has the advantages that the displacement table is accurately constructed according to the non-suffix character blocks of the matching window, the matching jump processing can be dynamically carried out according to the data in the displacement table, so that the jump distance is increased, the jump amplitude is increased, the jump value can be preset by constructing the jump table according to the suffix character blocks of all pattern string matching windows, the time consumption of the matching stage is transferred to the preprocessing stage, the matching performance is accelerated, the bloom filter is accurately constructed based on the combined character blocks, the character blocks can be subjected to preliminary filtering processing, the condition that the accurate matching is not needed is avoided, the matching process is accelerated, and when the shortest pattern string is longer and the same prefix and suffix exist in each pattern string, the advantages of large jump distance and less accurate verification times are achieved, so that the text inspection efficiency is further improved.
It should be noted that although in the above detailed description several modules or units of a device for action execution are mentioned, such a division is not mandatory. Indeed, the features and functionality of two or more modules or units described above may be embodied in one module or unit in accordance with embodiments of the present disclosure. Conversely, the features and functions of one module or unit described above may be further divided into a plurality of modules or units to be embodied.
Furthermore, although the steps of the methods in the present disclosure are depicted in a particular order in the drawings, this does not require or imply that the steps must be performed in that particular order, or that all illustrated steps be performed, to achieve desirable results. Additionally or alternatively, certain steps may be omitted, multiple steps combined into one step to perform, and/or one step decomposed into multiple steps to perform, etc. From the above description of embodiments, those skilled in the art will readily appreciate that the example embodiments described herein may be implemented in software, or may be implemented in software in combination with the necessary hardware.
The above description is only illustrative of the preferred embodiments of the present application and of the principles of the technology employed. It will be appreciated by persons skilled in the art that the scope of the application referred to in the present application is not limited to the specific combinations of the technical features described above, but also covers other technical features formed by any combination of the technical features described above or their equivalents without departing from the inventive concept. Such as the above-mentioned features and the technical features disclosed in the present application (but not limited to) having similar functions are replaced with each other.

Claims (10)

1. A standard text review method, the standard text review method comprising:
The method comprises the steps of obtaining a pattern string set, wherein the pattern string set comprises a plurality of pattern strings;
Constructing a skip list and a hash list according to the suffix character blocks of all the pattern string matching windows, constructing a prefix list according to the prefix character blocks of the matching windows, and constructing a displacement list according to the non-suffix character blocks of the matching windows;
The method comprises the steps of obtaining a combined character block, and constructing a bloom filter according to the combined character block, wherein the combined character block consists of a suffix character block of a current matching window and a prefix character block of a next matching window;
and carrying out mobile scanning matching on the standard text through a matching window to be checked by adopting the hash table, the jump table, the prefix table, the displacement table and the bloom filter to obtain a text examination result.
2. The method of claim 1, wherein constructing a displacement table from non-suffix character blocks of the matching window comprises:
The mode strings are coded, and the number of the mode strings in the mode string set and the length value of the shortest mode string in the mode strings are determined, wherein the mode strings comprise suffix character blocks and non-suffix character blocks;
calculating a character block size value according to the number of the mode strings and the length value of the shortest mode string;
calculating the non-suffix character block and a first skip value corresponding to the non-suffix character block based on the character block size value;
and correspondingly storing the non-suffix character blocks and the first jump values corresponding to the non-suffix character blocks to construct the displacement table, wherein the first jump values corresponding to the suffix character blocks in the displacement table are zero.
3. The method of claim 2, wherein constructing a jump table from suffix character blocks of all of the pattern string matching windows comprises:
When the suffix character block is in a first condition, the second jump value is m-B+4, wherein the first condition is that the suffix character block does not appear in a matching window of other mode strings, the suffix character block is a character block which is currently verified and is positioned at the rightmost end of the matching window and has a character block size of B, m is the length value of the shortest mode string, and B is the size value of the character block;
when the suffix character block is in a second condition, the second jump value is m-B, wherein the second condition is that the suffix character block only appears at the leftmost end of other mode strings;
When the suffix character block is in other cases, the second jump value is 4, wherein the other cases are the cases except the first case and the second case;
And constructing the jump table according to the second jump value and the suffix character block.
4. The method of claim 1, wherein performing mobile scan matching on standard text through a matching window to be checked by using the hash table, the jump table, the prefix table, the displacement table and the bloom filter to obtain a text examination result, comprising:
acquiring a target character block of the matching window to be verified and a first jump value of the target character block in the displacement table;
judging whether the first jump value is larger than zero or not;
When the first jump value is not greater than zero, constructing a target combined character block from the suffix character block of the matching window and the prefix character block of the next matching window;
The bloom filter is adopted to check the target combined character block in the standard text to obtain a check result, wherein the check result is used for representing check failure or check success;
and according to the verification result, adopting the prefix table, the hash table and the jump table to scan and match the standard text, and obtaining the text examination result.
5. The method of claim 4, wherein after determining whether the first jump value is greater than zero, the method further comprises:
Skipping the first jump value positions when the first jump value is greater than zero;
And continuing to search and check from the standard text until the standard text is checked, and obtaining the text examination result.
6. The method of claim 4, wherein scanning and matching the standard text using the prefix table, the hash table, and the skip table according to the verification result to obtain the text review result comprises:
when the verification result represents that verification is successful, verifying the target character block of the matching window based on the prefix table and the hash table to obtain the text examination result;
and when the verification result represents verification failure, scanning and matching the standard text based on the jump table to obtain the text examination result.
7. The method of claim 6, wherein performing scan matching on the standard text based on the jump table to obtain the text review result comprises:
acquiring a second jump value of the target character block in the jump table;
And skipping the positions of the second skipping value, and continuing to search and check from the standard text until the standard text is checked, so as to obtain the text examination result.
8. A standard text inspection device, the standard text inspection device comprising:
the system comprises an acquisition module, a processing module and a processing module, wherein the acquisition module is used for acquiring a mode string set, and the mode string set comprises a plurality of mode strings;
the first construction module is used for constructing a skip list and a hash list according to the suffix character blocks of all the pattern string matching windows, constructing a prefix list according to the prefix character blocks of the matching windows, and constructing a displacement list according to the non-suffix character blocks of the matching windows;
the second construction module is used for acquiring a combined character block and constructing a bloom filter according to the combined character block, wherein the combined character block consists of a suffix character block of a current matching window and a prefix character block of a next matching window;
And the examination module is used for carrying out mobile scanning matching on the standard text through a matching window to be checked by adopting the hash table, the jump table, the prefix table, the displacement table and the bloom filter to obtain a text examination result.
9. A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor for implementing the method according to any one of claims 1-7 when the program is executed.
10. A computer readable storage medium, on which a computer program is stored, characterized in that the computer program is adapted to implement the method according to any one of claims 1-7.
CN202411355837.8A 2024-09-26 2024-09-26 Standard text review methods, devices, equipment and media Pending CN119336866A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202411355837.8A CN119336866A (en) 2024-09-26 2024-09-26 Standard text review methods, devices, equipment and media

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202411355837.8A CN119336866A (en) 2024-09-26 2024-09-26 Standard text review methods, devices, equipment and media

Publications (1)

Publication Number Publication Date
CN119336866A true CN119336866A (en) 2025-01-21

Family

ID=94269320

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202411355837.8A Pending CN119336866A (en) 2024-09-26 2024-09-26 Standard text review methods, devices, equipment and media

Country Status (1)

Country Link
CN (1) CN119336866A (en)

Similar Documents

Publication Publication Date Title
US6904430B1 (en) Method and system for efficiently identifying differences between large files
CN101976253B (en) Chinese variation text matching recognition method
CN113673228B (en) Text error correction method, apparatus, computer storage medium and computer program product
US10903851B2 (en) Page filtering via compression dictionary filtering
WO2012026197A1 (en) Document analysis system, document analysis method, document analysis program and recording medium
CN110210043A (en) Text translation method and device, electronic equipment and readable storage medium
US9330323B2 (en) Redigitization system and service
CN113051894A (en) Text error correction method and device
CN110147558B (en) Method and device for processing translation corpus
CN117112754A (en) Information processing method, information processing device, electronic equipment and storage medium
CN109271526A (en) Method for text detection, device, electronic equipment and computer readable storage medium
CN114742037A (en) Text error correction method and device, computer equipment and storage medium
TW202407602A (en) Store duplicate removal processing method, apparatus and device, and storage medium
US12243337B2 (en) Text extraction using optical character recognition
CN119336866A (en) Standard text review methods, devices, equipment and media
EP0847026A2 (en) Pattern encoding method
CN118245618A (en) Combined word matching detection method, system, electronic equipment and storage medium
CN117253239A (en) End-to-end document image translation method and device integrating layout information
CN113595557B (en) Data processing method and device
Beck et al. Transforming scanned zbMATH volumes to LaTeX: planning the next level digitisation
Dias et al. Optimization of image processing algorithms for character recognition in cultural typewritten documents
CN115858797A (en) Method and system for generating Chinese near-meaning words based on OCR technology
CN116070644A (en) Auxiliary translation method, device, electronic equipment and storage medium
CN114329024A (en) A kind of iconfont icon search method and system
WO2022141855A1 (en) Text regularization method and apparatus, and electronic device and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination