CN113435186A - Chinese text error correction system, method, device and computer readable storage medium - Google Patents
Chinese text error correction system, method, device and computer readable storage medium Download PDFInfo
- Publication number
- CN113435186A CN113435186A CN202110675560.7A CN202110675560A CN113435186A CN 113435186 A CN113435186 A CN 113435186A CN 202110675560 A CN202110675560 A CN 202110675560A CN 113435186 A CN113435186 A CN 113435186A
- Authority
- CN
- China
- Prior art keywords
- text
- chinese
- corrected
- correction
- semantic
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000012937 correction Methods 0.000 title claims abstract description 94
- 238000000034 method Methods 0.000 title claims abstract description 33
- 238000012549 training Methods 0.000 claims abstract description 26
- 238000010801 machine learning Methods 0.000 claims abstract description 13
- 238000011156 evaluation Methods 0.000 claims abstract description 5
- 238000012545 processing Methods 0.000 claims description 13
- 230000011218 segmentation Effects 0.000 claims description 10
- 238000001514 detection method Methods 0.000 claims description 9
- 230000002159 abnormal effect Effects 0.000 claims description 7
- 238000004364 calculation method Methods 0.000 claims description 6
- 238000007781 pre-processing Methods 0.000 claims description 5
- 238000012163 sequencing technique Methods 0.000 claims description 4
- 230000000717 retained effect Effects 0.000 claims description 3
- 238000012986 modification Methods 0.000 description 5
- 230000004048 modification Effects 0.000 description 5
- 235000019580 granularity Nutrition 0.000 description 4
- 238000013145 classification model Methods 0.000 description 1
- 238000013135 deep learning Methods 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 230000018109 developmental process Effects 0.000 description 1
- 238000010586 diagram Methods 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 239000011521 glass Substances 0.000 description 1
- 239000000463 material Substances 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/232—Orthographic correction, e.g. spell checking or vowelisation
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/205—Parsing
- G06F40/211—Syntactic parsing, e.g. based on context-free grammar [CFG] or unification grammars
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/205—Parsing
- G06F40/216—Parsing using statistical methods
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/289—Phrasal analysis, e.g. finite state techniques or chunking
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/30—Semantic analysis
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N20/00—Machine learning
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Artificial Intelligence (AREA)
- General Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- Computational Linguistics (AREA)
- General Health & Medical Sciences (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Software Systems (AREA)
- Probability & Statistics with Applications (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Data Mining & Analysis (AREA)
- Evolutionary Computation (AREA)
- Medical Informatics (AREA)
- Computing Systems (AREA)
- Mathematical Physics (AREA)
- Document Processing Apparatus (AREA)
Abstract
The invention relates to a Chinese text error correction system, method and device based on a machine learning model and a computer readable storage medium. The system comprises a Chinese text pre-training module, a Chinese text input module, a Chinese spelling check module, a Chinese spelling correction module, a semantic correction module and a grammar language evaluation module. The Chinese text error correction method can focus on the consistency of upper and lower sentences, avoid the condition that a single phrase is correct but homophonic phrase selection has deviation, and ensure that the semantics are smooth without deviation when a plurality of phrases in the whole sentence are connected together.
Description
Technical Field
The invention relates to the technical field of computer word processing, in particular to a Chinese text error correction system, method and device based on a machine learning model and a computer readable storage medium.
Background
The development of Chinese as the language with the most number of people in the world in the field of machine learning has a lot of limitations, and because the pronunciation, font, grammar sequence and the like of Chinese are complex, the spelling check and error correction of Chinese are in great demand in the fields of manual input or machine recognition.
The patent CN 111639489A checks and corrects various errors in Chinese texts by various methods of machine learning, and corrects unsmooth texts into smooth Chinese texts suitable for reading; and inquiring the position where the error character occurs through the confusion degree, selecting a correct modification mode to replace the error character by using a confusion set and a language model scoring mode, and finally returning to correct Chinese language expression. Although the method can adopt multithread processing, a plurality of text sentences can be simultaneously concurrent, the correction processing efficiency is high, a plurality of errors in the Chinese text are checked and corrected by a plurality of methods of machine learning, the unsmooth text is corrected into the smooth Chinese text suitable for reading, the position where the error character occurs is inquired by the confusion degree, the correct correction mode is selected by using a confusion set and a language model to replace the error character, and finally the correct Chinese language expression is returned. However, in this method, the consistency of the upper and lower sentences is easily ignored, so that a single phrase is correct, but the selection of homophones phrases is biased, and a problem of error correction occurs when a plurality of phrases are connected together in the whole sentence, so that semantic bias is caused. And moreover, different correction modes are given by correcting similar phrases before and after the same text, so that the problem of error correction between contexts is caused, and semantic deviation is caused.
Disclosure of Invention
In view of the above, the present invention provides a system, a method, a device and a computer readable storage medium for correcting a chinese text based on a machine learning model, so as to solve the technical problems of time delay, low delivery efficiency and economic loss caused by delivery errors due to incomplete or non-standard filling information of consumers in the existing express industry.
In order to solve the above problems, the present invention provides a chinese text error correction system, which is based on a machine learning model, the system comprising: the Chinese text pre-training module is used for pre-training the Chinese text and acquiring the confusion degree, the confusion set, the language model and the semantic model of the Chinese text; the Chinese text input module is used for preprocessing the input text, deleting the non-used punctuations and the spaces with abnormal length, and converting Chinese and English punctuations and coding formats; the Chinese spelling check module is used for automatically returning the position of an incorrect character when the character in the Chinese text has misspelling; the Chinese spelling correction module is used for positioning the positions of wrong characters through the Chinese text pre-training module and the Chinese spelling checking module, replacing the characters one by using candidate words, calculating a smoothness result through a language model, and selecting an optimal spelling correction text to output to form a first correction text; the semantic correction module is used for calculating a semantic smoothness result in upper and lower sentences of the first corrected text through a semantic model in the Chinese text pre-training module, reselecting candidate words for the first corrected text to replace characters one by one so that the semantics in the upper and lower sentences are unified, and outputting a plurality of corrected texts with unified semantics to form a second corrected text; and the grammar language evaluation module is used for inputting the second corrected texts and evaluating the total semantic scores in all the second corrected texts, sequencing the total semantic scores in all the second corrected texts from large to small, and outputting the second corrected text with the highest score as the final corrected text.
A Chinese text error correction method comprises the Chinese text error correction system, and the Chinese text error correction method comprises the following steps: s1: pre-training a Chinese text to obtain a confusion degree, a confusion set, a language model and a semantic model of the Chinese text; s2: preprocessing the input text, deleting the non-used punctuations and the spaces with abnormal length, and converting Chinese and English punctuations and coding formats; s3: taking each character or punctuation as a position, performing residual processing by taking the character as a unit, and returning the position of an incorrect character by the system when a spelling error exists in the Chinese text; s4: after all suspected errors are positioned through error detection, the characters are replaced one by using candidate words, the smoothness calculation result of the similar candidate short text set is obtained on the basis of a language model, and finally, the optimal spelling correction text is selected and output to form a first correction text; s5: calculating semantic smoothness results in upper and lower sentences of the first corrected text through a semantic model in a Chinese text pre-training module, reselecting candidate words for the first corrected text to replace characters one by one so that the semantics in the upper and lower sentences are unified, and outputting a plurality of corrected texts with unified semantics to form a second corrected text; s6: and evaluating the total semantic scores in all the second corrected texts, sorting the total semantic scores in all the second corrected texts from large to small, and outputting the second corrected text with the highest score as the final corrected text.
Further, the step S5 specifically includes: s51: calculating the score of each candidate word in the error correction candidate words by utilizing a semantic model for each error correction candidate word in the first correction text; s52: accumulating the scores of each word and adjacent words in the error correction candidate words to obtain the total score of the semantic compliance result of the error correction candidate words; s53: and sorting the total scores of the semantic currency results in all the first correction texts from large to small, and outputting the first correction text with the highest score as a second correction text.
Further, the step S6 specifically includes: s61: calculating the score of the semantics in each sentence in the second corrected text by using a semantic model for the semantics of the sentence; s62: multiplying the occurrence probability of each sentence in the second corrected text with the occurrence probability of the adjacent sentences to obtain the total semantic compliance result score of the second corrected text; s63: and sorting the total semantic scores in all the second corrected texts from large to small, and outputting the second corrected text with the highest score as a final corrected text.
Further, the semantic model calculates the score of the semantics in the sentence in the following manner: the semantic model is formed by classifying and processing harmonic sound words, confusing sound words, word order reversal, incomplete words, misshapen characters, sensitive words, common sense errors and multiple words, and is P (S) approximately equal to P (w)1)*P(w2|w1)*P(w3|w2)*...*P(wn|wn-1),P(wi|wi-1)= count(wi,wi-1)/count(wi-1) (ii) a Wherein, p (w)1…wn) Being sentencesProbability, P is the probability of a sentence, n is the sentence length, P (w)i|wi-1) Is the conditional probability of two words co-occurrence, w represents the word, count (w)i-1) Is the word wi-1Number of occurrences in the corpus, count (w)i,wi-1) Is wi,wi-1The number of times two words appear simultaneously; calculating the probability p (w) of sentence occurrence for each sentence in the second corrected text1…wn) And multiplying the probability of all sentences in the second corrected text by the probability of all sentences in the second corrected text to obtain the total semantic compliance result score P (S) of the second corrected text.
Further, the step S4 specifically includes: s41: obtaining a candidate set of replacement characters of suspected wrong characters, and obtaining the sound similarity, the shape similarity and the common recognition wrong candidate words of all suspected wrong characters in a confusion set after positioning all suspected mistakes through error detection; s42: replacing the positions of the characters by using the candidate characters, and enumerating each character of the confusion set to replace the original character for each replaceable character, thereby obtaining a short text candidate set for replacing suspected wrong characters; s43: and obtaining a popularity ranking result of the candidate short texts based on the n-element language model in the S13, and selecting the sentence with the highest popularity score as the final candidate text.
Further, the S43 specifically includes: s431: taking the word as a minimum calculation unit, and performing word segmentation by using the existing Chinese word segmentation model; s432: calculating the corresponding occurrence frequency of common words in a specific language database based on a specific language model to obtain the smoothness; s433: replacing the original text if the text smoothness is greater than a predefined threshold; s434: if the final candidate text smoothness is less than the predefined threshold, the original text is correct, and the original text is retained.
Further, the S3 specifically includes: s31: removing special symbols in the training corpus, and replacing invalid characters in the text, wherein the invalid characters are characters except Chinese, English, numbers and common punctuations; s32: dividing the long text into short texts, and dividing the long text into the short texts according to the specific punctuation marks and the spaces; s33: and returning the suspected incorrect character position, calculating the likelihood probability value of each character by using the confusion degree and the occurrence probability of the word, and if the likelihood probability value of the character is lower than the average probability value of the text, judging that the character is the suspected wrongly-written character and returning the position of the character in the text.
A Chinese text correction apparatus, the apparatus comprising a memory, a processor and a Chinese text correction processing program stored in the memory and operable on the processor, the Chinese text correction processing program when executed by the processor implementing the steps of the Chinese text correction method.
A computer readable storage medium having stored thereon a chinese text error correction processing program, which when executed by a processor implements the steps of the chinese text error correction method.
The Chinese text error correction system, the method, the device and the computer readable storage medium based on the machine learning model can adopt multi-thread processing, can simultaneously and concurrently carry out a plurality of text sentences, have high correction processing efficiency, check and correct a plurality of errors in the Chinese text by a plurality of methods of machine learning, correct an unordered text into a smooth Chinese text suitable for reading, inquire the position where an error character occurs by the confusion degree, select a correct modification mode to replace the error character by using a confusion set and a language model, and finally return a correct Chinese language expression. Particularly, the mode can pay attention to the continuity of the upper sentence and the lower sentence, avoids the condition that the single phrase is correct but the homophonic phrase selection has deviation, and ensures that the semantics are smooth without deviation when a plurality of phrases are connected together in the whole sentence.
Drawings
Various other advantages and benefits will become apparent to those of ordinary skill in the art upon reading the following detailed description of the preferred embodiments. The drawings are only for purposes of illustrating the preferred embodiments and are not to be construed as limiting the invention.
Fig. 1 is a block diagram of a chinese text correction system according to an embodiment of the present invention.
Fig. 2 is a flowchart of a method for correcting errors of a chinese text according to an embodiment of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.
As used herein, the singular forms "a", "an", "the" and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It will be further understood that the terms "comprises" and/or "comprising," when used in this specification, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.
As shown in FIG. 1, the present invention provides a Chinese text correction system 10, which is based on a machine learning model, the system 10 includes a Chinese text pre-training module 1, a Chinese text input module 2, a Chinese spell checking module 3, a Chinese spell correction module 4, a semantic correction module 5, and a grammatical language evaluation module 6. The Chinese text pre-training module 1 is used for pre-training a Chinese text and acquiring the confusion degree, the confusion set, the language model and the semantic model of the Chinese text; the Chinese text input module 2 preprocesses the input text, deletes the non-used punctuations and the spaces with abnormal length, and converts Chinese and English punctuations and coding formats; the Chinese spelling check module 3 is used for automatically returning the position of an incorrect character when the character in the Chinese text has misspelling; the Chinese spelling correction module 4 positions wrong characters through a Chinese text pre-training module and a Chinese spelling checking module, replaces the characters one by using candidate words, calculates a smoothness result through a language model, and selects an optimal spelling correction text to output to form a first correction text; the semantic correction module 5 calculates semantic smoothness results in upper and lower sentences of the first corrected text through a semantic model in the Chinese text pre-training module, reselects candidate words for the first corrected text to replace characters one by one so that the semantics in the upper and lower sentences are unified, and outputs a plurality of corrected texts with unified semantics to form a second corrected text; and the grammar language evaluation module 6 is used for inputting the second corrected texts, evaluating the total semantic scores in all the second corrected texts, sequencing the total semantic scores in all the second corrected texts from large to small, and outputting the second corrected text with the highest score as the final corrected text.
The implementation principle is that whether a sentence has errors or not and the corresponding error types are judged according to the deep learning bert model and the language model, the positions of wrongly-typed characters are further detected through the language model, and the wrongly-typed characters are corrected through the pinyin sound-like characteristic, the stroke five-stroke editing distance characteristic and the language model confusion characteristic.
In the task of correcting the chinese text, common error types include: (1) harmonious words, such as with eyes-with glasses; (2) confusing sound words, such as wandering girl-cowherd girl; (3) the order of the words is reversed, such as Wudi Allen-Allen Wudy; (4) completing words, if love is happy-if love is happy; (5) misshapers such as jowar-sorghum; (6) a sensitive word; (7) errors of common sense; (8) the characters are multiple, for example, the three-ring auxiliary road of the shop village has sharp turning-the three-ring auxiliary road of the shop village has sharp turning.
And a semantic model is formed in the Chinese text pre-training module 1 to score full-text semantics, the mode can pay attention to the continuity of upper and lower sentences, the condition that a single phrase is correct but homophonic phrase selection can be deviated is avoided, and the semantics are smooth and are not deviated when a plurality of phrases in the whole sentence are connected together.
As shown in FIG. 2, the present invention provides a Chinese text error correction method including the above Chinese text error correction system, the Chinese text error correction method including the following steps S1-S6.
S1: and pre-training the Chinese text to obtain the confusion degree, the confusion set, the language model and the semantic model of the Chinese text.
S2: preprocessing the input text, deleting the abnormal punctuations and the spaces with abnormal length, and converting Chinese and English punctuations and coding formats.
S3: and positioning the position of the incorrect character, taking each character or punctuation as a position, performing residual processing by taking the character as a unit, and returning the position of the incorrect character by the system when the Chinese text has spelling errors.
In this embodiment, the S3 specifically includes: s31: removing special symbols in the training corpus, and replacing invalid characters in the text, wherein the invalid characters are characters except Chinese, English, numbers and common punctuations; s32: dividing the long text into short texts, and dividing the long text into the short texts according to the specific punctuation marks and the spaces; s33: and returning the suspected incorrect character position, calculating the likelihood probability value of each character by using the confusion degree and the occurrence probability of the word, and if the likelihood probability value of the character is lower than the average probability value of the text, judging that the character is the suspected wrongly-written character and returning the position of the character in the text.
S4: and outputting a first corrected text, positioning all suspected errors through error detection, replacing the characters one by using candidate words, obtaining a smoothness calculation result of the similar candidate short text set based on a language model, and finally selecting an optimal spelling corrected text to output to form the first corrected text.
In this embodiment, the step S4 specifically includes: s41: obtaining a candidate set of replacement characters of suspected wrong characters, and obtaining the sound similarity, the shape similarity and the common recognition wrong candidate words of all suspected wrong characters in a confusion set after positioning all suspected mistakes through error detection; s42: replacing the positions of the characters by using the candidate characters, and enumerating each character of the confusion set to replace the original character for each replaceable character, thereby obtaining a short text candidate set for replacing suspected wrong characters; s43: and obtaining a popularity ranking result of the candidate short texts based on the n-element language model in the S13, and selecting the sentence with the highest popularity score as the final candidate text.
In this embodiment, the S43 specifically is: s431: taking the word as a minimum calculation unit, and performing word segmentation by using the existing Chinese word segmentation model; s432: calculating the corresponding occurrence frequency of common words in a specific language database based on a specific language model to obtain the smoothness; s433: replacing the original text if the text smoothness is greater than a predefined threshold; s434: if the final candidate text smoothness is less than the predefined threshold, the original text is correct, and the original text is retained.
S5: and forming a second corrected text according to the semantic compliance, calculating the semantic compliance result in upper and lower sentences of the first corrected text through a semantic model in a Chinese text pre-training module, reselecting candidate words for the first corrected text to replace characters one by one so that the semantics in the upper and lower sentences are unified, and outputting a plurality of corrected texts with unified semantics to form the second corrected text.
In this embodiment, the step S5 specifically includes: s51: calculating the score of each candidate word in the error correction candidate words by utilizing a semantic model for each error correction candidate word in the first correction text; s52: accumulating the scores of each word and adjacent words in the error correction candidate words to obtain the total score of the semantic compliance result of the error correction candidate words; s53: and sorting the total scores of the semantic currency results in all the first correction texts from large to small, and outputting the first correction text with the highest score as a second correction text.
S6: and outputting the final corrected text, evaluating the total semantic scores in all the second corrected texts, sorting the total semantic scores in all the second corrected texts from large to small, and outputting the second corrected text with the highest score as the final corrected text.
In this embodiment, the step S6 specifically includes: s61: calculating the score of the semantics in each sentence in the second corrected text by using a semantic model for the semantics of the sentence; s62: multiplying the occurrence probability of each sentence in the second corrected text with the occurrence probability of the adjacent sentences to obtain the total semantic compliance result score of the second corrected text; s63: and sorting the total semantic scores in all the second corrected texts from large to small, and outputting the second corrected text with the highest score as a final corrected text.
In this embodiment, the scoring manner of the semantic model for calculating the semantics in the sentence is as follows: according to harmonious sound words and characters, confusing sound words and charactersThe semantic model is formed by carrying out classification processing on the reversed word sequence, incomplete words, misshapen characters, sensitive words, miscommonsense and multiple words, and is P (S) ≈ P (w)1)*P(w2|w1)*P(w3|w2)*...*P(wn|wn-1),P(wi|wi-1)= count(wi,wi-1)/count(wi-1) (ii) a Wherein, p (w)1…wn) Is the probability of a sentence, P is the probability of a sentence, n is the length of the sentence, P (w)i|wi-1) Is the conditional probability of two words co-occurrence, w represents the word, count (w)i-1) Is the word wi-1Number of occurrences in the corpus, count (w)i,wi-1) Is wi,wi-1The number of times two words appear simultaneously; calculating the probability p (w) of sentence occurrence for each sentence in the second corrected text1…wn) And multiplying the probability of all sentences in the second corrected text by the probability of all sentences in the second corrected text to obtain the total semantic compliance result score P (S) of the second corrected text.
In another embodiment of the present application, text correction is performed as follows.
(1) The error correction scheme comprises two steps, wherein the first step is error detection, and the second step is error correction;
(2) an error detection section:
firstly, dividing the language materials into nine categories according to the existing linguistic data, and using a bert pre-training model to finely adjust a classification model.
Secondly, a kenlm tool is used for training the ngrama model, and the used linguistic data are 20G approximate to that of the national daily report and that of Wikipedia Chinese.
And thirdly, establishing linguistic dictionaries such as easily confused words, homophones, similar words, sensitive words, common sense word banks and the like, and establishing a jieba word segmentation user dictionary, a person name, a place name, a mechanism name and the like.
Combining a bert model and an n-gram language model to identify, cutting words by a specific error detection part through a Chinese word segmentation device, wherein the sentences contain wrongly written words, so that the word cutting result always has the condition of wrong segmentation, and errors are detected from two aspects of word granularity and word granularity: and (3) detecting that the likelihood probability value of a word is lower than the text average value of the sentence by language model confusion (ppl), judging that the word is a suspected wrongly written word with high probability, and determining the word granularity: the probability that a word not in the dictionary after word segmentation is a suspected wrong word is high. Integrating the suspected error results of the two granularities to form a suspected error position candidate set;
specifically, the confusion of the characters or words at different positions is determined according to the confusion of the language model, the greater the confusion, the more likely the characters or words are wrongly written or wrongly written, a threshold value is set, and if the threshold value is exceeded, the error is determined.
(3) An error correction section:
firstly, traversing all suspected error positions, replacing words in the error positions by using similar dictionaries, then calculating sentence confusion degree through a language model, comparing and sequencing results of all candidate sets, and obtaining a combination with the minimum model confusion degree to obtain an optimal corrected word, wherein a corresponding sensitive word library is required to be established for sensitive word problems and common sense errors.
(4) Multi-word, missing word, out-of-order part:
aiming at the problems of multiple words, few words and disordered word sequences, only error prompts can be given, and error positions and modification suggestions cannot be effectively identified, a crf algorithm model is used for identifying specific error positions in the prior art, and the identification effect is poor due to the lack of linguistic data and the imbalance of labels.
(5) And (3) a model part:
bert model, kenlm statistical language model tool.
(6) And (3) semantic part:
semantic scoring is performed in the same manner as in steps S5 and S6.
A Chinese text correction apparatus, the apparatus comprising a memory, a processor and a Chinese text correction processing program stored in the memory and operable on the processor, the Chinese text correction processing program when executed by the processor implementing the steps of the Chinese text correction method.
A computer readable storage medium having stored thereon a chinese text error correction processing program, which when executed by a processor implements the steps of the chinese text error correction method.
The Chinese text error correction system, the method, the device and the computer readable storage medium based on the machine learning model can adopt multi-thread processing, can simultaneously and concurrently carry out a plurality of text sentences, have high correction processing efficiency, check and correct a plurality of errors in the Chinese text by a plurality of methods of machine learning, correct an unordered text into a smooth Chinese text suitable for reading, inquire the position where an error character occurs by the confusion degree, select a correct modification mode to replace the error character by using a confusion set and a language model, and finally return a correct Chinese language expression. Particularly, the mode can pay attention to the continuity of the upper sentence and the lower sentence, avoids the condition that the single phrase is correct but the homophonic phrase selection has deviation, and ensures that the semantics are smooth without deviation when a plurality of phrases are connected together in the whole sentence.
The technical features of the embodiments described above may be arbitrarily combined, and for the sake of brevity, all possible combinations of the technical features in the embodiments described above are not described, but should be considered as being within the scope of the present specification as long as there is no contradiction between the combinations of the technical features.
The above-mentioned embodiments only express some exemplary embodiments of the present invention, and the description thereof is more specific and detailed, but not construed as limiting the scope of the present invention. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the inventive concept, which falls within the scope of the present invention. Therefore, the protection scope of the present patent shall be subject to the appended claims.
Claims (10)
1. A chinese text correction system, the system based on a machine learning model, the system comprising:
the Chinese text pre-training module is used for pre-training the Chinese text and acquiring the confusion degree, the confusion set, the language model and the semantic model of the Chinese text;
the Chinese text input module is used for preprocessing the input text, deleting the non-used punctuations and the spaces with abnormal length, and converting Chinese and English punctuations and coding formats;
the Chinese spelling check module is used for automatically returning the position of an incorrect character when the character in the Chinese text has misspelling;
the Chinese spelling correction module is used for positioning the positions of wrong characters through the Chinese text pre-training module and the Chinese spelling checking module, replacing the characters one by using candidate words, calculating a smoothness result through a language model, and selecting an optimal spelling correction text to output to form a first correction text;
the semantic correction module is used for calculating a semantic smoothness result in upper and lower sentences of the first corrected text through a semantic model in the Chinese text pre-training module, reselecting candidate words for the first corrected text to replace characters one by one so that the semantics in the upper and lower sentences are unified, and outputting a plurality of corrected texts with unified semantics to form a second corrected text;
and the grammar language evaluation module is used for inputting the second corrected texts and evaluating the total semantic scores in all the second corrected texts, sequencing the total semantic scores in all the second corrected texts from large to small, and outputting the second corrected text with the highest score as the final corrected text.
2. A chinese text correction method comprising the chinese text correction system of claim 1, wherein the chinese text correction method comprises the steps of:
s1: pre-training a Chinese text to obtain a confusion degree, a confusion set, a language model and a semantic model of the Chinese text;
s2: preprocessing the input text, deleting the non-used punctuations and the spaces with abnormal length, and converting Chinese and English punctuations and coding formats;
s3: taking each character or punctuation as a position, performing residual processing by taking the character as a unit, and returning the position of an incorrect character by the system when a spelling error exists in the Chinese text;
s4: after all suspected errors are positioned through error detection, the characters are replaced one by using candidate words, the smoothness calculation result of the similar candidate short text set is obtained on the basis of a language model, and finally, the optimal spelling correction text is selected and output to form a first correction text;
s5: calculating semantic smoothness results in upper and lower sentences of the first corrected text through a semantic model in a Chinese text pre-training module, reselecting candidate words for the first corrected text to replace characters one by one so that the semantics in the upper and lower sentences are unified, and outputting a plurality of corrected texts with unified semantics to form a second corrected text;
s6: and evaluating the total semantic scores in all the second corrected texts, sorting the total semantic scores in all the second corrected texts from large to small, and outputting the second corrected text with the highest score as the final corrected text.
3. The method for correcting errors of chinese text according to claim 2, wherein the step S5 specifically includes:
s51: calculating the score of each candidate word in the error correction candidate words by utilizing a semantic model for each error correction candidate word in the first correction text;
s52: accumulating the scores of each word and adjacent words in the error correction candidate words to obtain the total score of the semantic compliance result of the error correction candidate words;
s53: and sorting the total scores of the semantic currency results in all the first correction texts from large to small, and outputting the first correction text with the highest score as a second correction text.
4. The method for correcting errors of chinese text according to claim 2, wherein the step S6 specifically includes:
s61: calculating the score of the semantics in each sentence in the second corrected text by using a semantic model for the semantics of the sentence;
s62: multiplying the occurrence probability of each sentence in the second corrected text with the occurrence probability of the adjacent sentences to obtain the total semantic compliance result score of the second corrected text;
s63: and sorting the total semantic scores in all the second corrected texts from large to small, and outputting the second corrected text with the highest score as a final corrected text.
5. The method of claim 4, wherein the semantic model calculates the score of the semantics in the sentence in a manner of:
the semantic model is formed by classifying and processing harmonic sound words, confusing sound words, word order reversal, incomplete words, misshapen characters, sensitive words, common sense errors and multiple words, and is P (S) approximately equal to P (w)1)*P(w2|w1)*P(w3|w2)*...*P(wn|wn-1),P(wi|wi-1)=count(wi,wi-1)/count(wi-1) (ii) a Wherein, p (w)1…wn) Is the probability of a sentence, P is the probability of a sentence, n is the length of the sentence, P (w)i|wi-1) Is the conditional probability of two words co-occurrence, w represents the word, count (w)i-1) Is the word wi-1Number of occurrences in the corpus, count (w)i,wi-1) Is wi,wi-1The number of times two words appear simultaneously;
calculating the probability p (w) of sentence occurrence for each sentence in the second corrected text1…wn) And multiplying the probability of all sentences in the second corrected text by the probability of all sentences in the second corrected text to obtain the total semantic compliance result score P (S) of the second corrected text.
6. The method for correcting errors of chinese text according to claim 2, wherein the step S4 specifically includes:
s41: obtaining a candidate set of replacement characters of suspected wrong characters, and obtaining the sound similarity, the shape similarity and the common recognition wrong candidate words of all suspected wrong characters in a confusion set after positioning all suspected mistakes through error detection;
s42: replacing the positions of the characters by using the candidate characters, and enumerating each character of the confusion set to replace the original character for each replaceable character, thereby obtaining a short text candidate set for replacing suspected wrong characters;
s43: and obtaining a popularity ranking result of the candidate short texts based on the n-element language model in the S13, and selecting the sentence with the highest popularity score as the final candidate text.
7. The method for correcting errors of chinese text according to claim 6, wherein the S43 specifically is:
s431: taking the word as a minimum calculation unit, and performing word segmentation by using the existing Chinese word segmentation model;
s432: calculating the corresponding occurrence frequency of common words in a specific language database based on a specific language model to obtain the smoothness;
s433: replacing the original text if the text smoothness is greater than a predefined threshold;
s434: if the final candidate text smoothness is less than the predefined threshold, the original text is correct, and the original text is retained.
8. The method for correcting errors of chinese text according to claim 2, wherein the S3 specifically includes:
s31: removing special symbols in the training corpus, and replacing invalid characters in the text, wherein the invalid characters are characters except Chinese, English, numbers and common punctuations;
s32: dividing the long text into short texts, and dividing the long text into the short texts according to the specific punctuation marks and the spaces;
s33: and returning the suspected incorrect character position, calculating the likelihood probability value of each character by using the confusion degree and the occurrence probability of the word, and if the likelihood probability value of the character is lower than the average probability value of the text, judging that the character is the suspected wrongly-written character and returning the position of the character in the text.
9. A chinese text correction apparatus comprising a memory, a processor, and a chinese text correction processing program stored in the memory and executable on the processor, the chinese text correction processing program when executed by the processor implementing the steps of the chinese text correction method according to any one of claims 2 to 8.
10. A computer-readable storage medium, characterized in that the computer-readable storage medium has stored thereon a chinese text correction processing program, which when executed by a processor implements the steps of the chinese text correction method according to any one of claims 2 to 8.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110675560.7A CN113435186B (en) | 2021-06-18 | 2021-06-18 | Chinese text error correction system, method, device and computer readable storage medium |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110675560.7A CN113435186B (en) | 2021-06-18 | 2021-06-18 | Chinese text error correction system, method, device and computer readable storage medium |
Publications (2)
Publication Number | Publication Date |
---|---|
CN113435186A true CN113435186A (en) | 2021-09-24 |
CN113435186B CN113435186B (en) | 2022-05-20 |
Family
ID=77756389
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202110675560.7A Active CN113435186B (en) | 2021-06-18 | 2021-06-18 | Chinese text error correction system, method, device and computer readable storage medium |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN113435186B (en) |
Cited By (14)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113935313A (en) * | 2021-10-15 | 2022-01-14 | 江苏省未来网络创新研究院 | Method for realizing automatic detection and identification of multiple Chinese characters |
CN114065738A (en) * | 2022-01-11 | 2022-02-18 | 湖南达德曼宁信息技术有限公司 | Chinese spelling error correction method based on multitask learning |
CN114078254A (en) * | 2022-01-07 | 2022-02-22 | 华中科技大学同济医学院附属协和医院 | Intelligent data acquisition system based on robot |
CN114118065A (en) * | 2021-10-28 | 2022-03-01 | 国网江苏省电力有限公司电力科学研究院 | A method, device, storage medium and computing device for Chinese text error correction in the field of electric power |
CN114153971A (en) * | 2021-11-09 | 2022-03-08 | 浙江大学 | A device for error correction, recognition and classification of Chinese text containing errors |
CN114169315A (en) * | 2021-12-21 | 2022-03-11 | 深圳供电局有限公司 | A text error correction method, system, device and medium |
CN114254623A (en) * | 2021-12-14 | 2022-03-29 | 河北省讯飞人工智能研究院 | A text error correction method, device, equipment and storage medium |
CN114282523A (en) * | 2021-11-22 | 2022-04-05 | 北京方寸无忧科技发展有限公司 | A sentence correction method and device based on bert model and ngram model |
CN114328798A (en) * | 2021-11-09 | 2022-04-12 | 腾讯科技(深圳)有限公司 | Processing method, device, equipment, storage medium and program product for searching text |
CN114386403A (en) * | 2022-01-07 | 2022-04-22 | 北京方寸无忧科技发展有限公司 | Method and system for correcting multiple same wrong words in Chinese text |
CN114510925A (en) * | 2022-01-25 | 2022-05-17 | 森纵艾数(北京)科技有限公司 | Chinese text error correction method, system, terminal equipment and storage medium |
CN114510926A (en) * | 2022-02-14 | 2022-05-17 | 维沃移动通信有限公司 | Text error correction method, text error correction device and electronic equipment |
CN114611524A (en) * | 2022-02-08 | 2022-06-10 | 马上消费金融股份有限公司 | Text error correction method and device, electronic equipment and storage medium |
CN116090441A (en) * | 2022-12-30 | 2023-05-09 | 永中软件股份有限公司 | Chinese spelling error correction method integrating local semantic features and global semantic features |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20140012567A1 (en) * | 2012-07-09 | 2014-01-09 | International Business Machines Corporation | Text Auto-Correction via N-Grams |
CN110852087A (en) * | 2019-09-23 | 2020-02-28 | 腾讯科技(深圳)有限公司 | Chinese error correction method and device, storage medium and electronic device |
CN111639489A (en) * | 2020-05-15 | 2020-09-08 | 民生科技有限责任公司 | Chinese text error correction system, method, device and computer readable storage medium |
CN111723791A (en) * | 2020-06-11 | 2020-09-29 | 腾讯科技(深圳)有限公司 | Character error correction method, device, equipment and storage medium |
CN112232055A (en) * | 2020-10-28 | 2021-01-15 | 中国电子科技集团公司第二十八研究所 | Text detection and correction method based on pinyin similarity and language model |
-
2021
- 2021-06-18 CN CN202110675560.7A patent/CN113435186B/en active Active
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20140012567A1 (en) * | 2012-07-09 | 2014-01-09 | International Business Machines Corporation | Text Auto-Correction via N-Grams |
CN110852087A (en) * | 2019-09-23 | 2020-02-28 | 腾讯科技(深圳)有限公司 | Chinese error correction method and device, storage medium and electronic device |
CN111639489A (en) * | 2020-05-15 | 2020-09-08 | 民生科技有限责任公司 | Chinese text error correction system, method, device and computer readable storage medium |
CN111723791A (en) * | 2020-06-11 | 2020-09-29 | 腾讯科技(深圳)有限公司 | Character error correction method, device, equipment and storage medium |
CN112232055A (en) * | 2020-10-28 | 2021-01-15 | 中国电子科技集团公司第二十八研究所 | Text detection and correction method based on pinyin similarity and language model |
Cited By (17)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113935313A (en) * | 2021-10-15 | 2022-01-14 | 江苏省未来网络创新研究院 | Method for realizing automatic detection and identification of multiple Chinese characters |
CN114118065A (en) * | 2021-10-28 | 2022-03-01 | 国网江苏省电力有限公司电力科学研究院 | A method, device, storage medium and computing device for Chinese text error correction in the field of electric power |
CN114328798A (en) * | 2021-11-09 | 2022-04-12 | 腾讯科技(深圳)有限公司 | Processing method, device, equipment, storage medium and program product for searching text |
CN114328798B (en) * | 2021-11-09 | 2024-02-23 | 腾讯科技(深圳)有限公司 | Processing method, device, equipment, storage medium and program product for searching text |
CN114153971A (en) * | 2021-11-09 | 2022-03-08 | 浙江大学 | A device for error correction, recognition and classification of Chinese text containing errors |
CN114282523A (en) * | 2021-11-22 | 2022-04-05 | 北京方寸无忧科技发展有限公司 | A sentence correction method and device based on bert model and ngram model |
CN114254623A (en) * | 2021-12-14 | 2022-03-29 | 河北省讯飞人工智能研究院 | A text error correction method, device, equipment and storage medium |
CN114169315A (en) * | 2021-12-21 | 2022-03-11 | 深圳供电局有限公司 | A text error correction method, system, device and medium |
CN114386403A (en) * | 2022-01-07 | 2022-04-22 | 北京方寸无忧科技发展有限公司 | Method and system for correcting multiple same wrong words in Chinese text |
CN114078254A (en) * | 2022-01-07 | 2022-02-22 | 华中科技大学同济医学院附属协和医院 | Intelligent data acquisition system based on robot |
CN114065738A (en) * | 2022-01-11 | 2022-02-18 | 湖南达德曼宁信息技术有限公司 | Chinese spelling error correction method based on multitask learning |
CN114510925A (en) * | 2022-01-25 | 2022-05-17 | 森纵艾数(北京)科技有限公司 | Chinese text error correction method, system, terminal equipment and storage medium |
CN114611524A (en) * | 2022-02-08 | 2022-06-10 | 马上消费金融股份有限公司 | Text error correction method and device, electronic equipment and storage medium |
CN114611524B (en) * | 2022-02-08 | 2023-11-17 | 马上消费金融股份有限公司 | Text error correction method and device, electronic equipment and storage medium |
CN114510926A (en) * | 2022-02-14 | 2022-05-17 | 维沃移动通信有限公司 | Text error correction method, text error correction device and electronic equipment |
CN116090441A (en) * | 2022-12-30 | 2023-05-09 | 永中软件股份有限公司 | Chinese spelling error correction method integrating local semantic features and global semantic features |
CN116090441B (en) * | 2022-12-30 | 2023-10-20 | 永中软件股份有限公司 | Chinese spelling error correction method integrating local semantic features and global semantic features |
Also Published As
Publication number | Publication date |
---|---|
CN113435186B (en) | 2022-05-20 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN113435186B (en) | Chinese text error correction system, method, device and computer readable storage medium | |
CN111639489A (en) | Chinese text error correction system, method, device and computer readable storage medium | |
Kissos et al. | OCR error correction using character correction and feature-based word classification | |
US9875254B2 (en) | Method for searching for, recognizing and locating a term in ink, and a corresponding device, program and language | |
JP4568774B2 (en) | How to generate templates used in handwriting recognition | |
CN109800414B (en) | Method and system for recommending language correction | |
CN110276077A (en) | Chinese error correction method, device and equipment | |
CN107045496A (en) | The error correction method and error correction device of text after speech recognition | |
US20140298168A1 (en) | System and method for spelling correction of misspelled keyword | |
Chanlekha et al. | Thai named entity extraction by incorporating maximum entropy model with simple heuristic information | |
Fahda et al. | A statistical and rule-based spelling and grammar checker for Indonesian text | |
CN110147546B (en) | Grammar correction method and device for spoken English | |
Chen et al. | Integrating natural language processing with image document analysis: what we learned from two real-world applications | |
Noaman et al. | Automatic Arabic spelling errors detection and correction based on confusion matrix-noisy channel hybrid system | |
Uthayamoorthy et al. | Ddspell-a data driven spell checker and suggestion generator for the tamil language | |
Singh et al. | Review of real-word error detection and correction methods in text documents | |
Rana et al. | Detection and correction of real-word errors in Bangla language | |
Wu et al. | Integrating dictionary and web N-grams for chinese spell checking | |
Sakuntharaj et al. | Detecting and correcting real-word errors in Tamil sentences | |
Yang et al. | Spell Checking for Chinese. | |
Mridha et al. | Semantic error detection and correction in Bangla sentence | |
Lin et al. | A study on Chinese spelling check using confusion sets and? n-gram statistics | |
Jayasuriya et al. | Learning a stochastic part of speech tagger for sinhala | |
Chiu et al. | Chinese spell checking based on noisy channel model | |
Cissé et al. | Automatic Spell Checker and Correction for Under-represented Spoken Languages: Case Study on Wolof |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |