Disclosure of Invention
The application aims to provide a text error detection method and device, an electronic device and a storage medium, which can detect semantic unknown errors in a text and improve the text detection accuracy.
In order to solve the above technical problem, the present application provides a text error detection method, including:
acquiring a training text of a first language, and determining the confusion and grammar error information of the training text;
translating the training text into a pivot language text in a second language, translating the pivot language text into a target text in the first language;
calculating the text similarity between the training text and the target text, and determining the confusion degree of the target text;
respectively carrying out word alignment operation on the training text and the target text and the pivot language text to obtain alignment evaluation information of the training text and the target text;
training an initial model according to the confusion degree of the training text, the grammar error information of the training text, the text similarity, the confusion degree of the target text, the confusion degree ratio of the target text and the training text and the alignment evaluation information to obtain a semantic unidentified detection model;
and executing text error detection operation on the sentence text to be detected through the semantic unclear detection model.
Optionally, the syntax error information includes an average probability of syntax error existing in each word in the training text, and a number of word replacement errors with different root words.
Optionally, performing word alignment operation on the training text and the target text with a pivot language text respectively to obtain alignment evaluation information of the training text and the target text, including:
performing word alignment operation on the training text and the pivot language text to obtain a first alignment result;
performing word alignment operation on the target text and the pivot language text to obtain a second alignment result;
and determining the alignment evaluation information of the training text and the target text according to the first alignment result and the second alignment result.
Optionally, the alignment evaluation information includes:
the ratio of the number of the aligned real words in the training text to the number of all the real words in the sentence;
the ratio of the number of the aligned real words in the target text to the number of all the real words in the sentence;
alignment number ratio; wherein the determining process of the alignment ratio value comprises: connecting lines of words in the training text and the target text which correspond to the pivot language text at the same time, and taking the ratio of the number of crossed connecting lines to the total alignment as the alignment number ratio;
a probability of a first word alignment probability being proportional to a second word alignment probability; wherein the first word alignment probability is a probability that the training text is aligned with the pivot language text word, and the second word alignment ratio is a probability that the training text is aligned with the pivot language text word.
Optionally, the executing, by the semantic unclear detection model, a text error detection operation on the sentence text to be detected includes:
determining a sentence text to be detected;
and inputting the sentence text to be detected into the semantic ambiguity detection model, and judging whether text errors exist in the sentence text to be detected according to a detection result output by the semantic ambiguity detection model.
Optionally, the determining the sentence text to be detected includes:
and if the voice information is received, converting the voice information into the sentence text to be detected of the first language.
Optionally, after the semantic unclear detection model performs a text error detection operation on the sentence text to be detected, the method further includes:
and marking the text content with errors in the sentence text to be detected, and generating the corrected text of the first language according to the text content with errors.
The present application also provides a text error detection apparatus, including:
the training text processing module is used for acquiring a training text of a first language and determining the confusion and grammar error information of the training text;
the language translation module is used for translating the training text into a pivot language text of a second language and translating the pivot language text into a target text of the first language;
the target text processing module is used for calculating the text similarity between the training text and the target text and determining the confusion degree of the target text;
the word alignment module is used for respectively carrying out word alignment operation on the training text and the target text with the pivot language text to obtain alignment evaluation information of the training text and the target text;
the model training module is used for training an initial model according to the confusion degree of the training text, the grammar error information of the training text, the text similarity, the confusion degree of the target text, the confusion degree ratio of the target text and the training text and the alignment evaluation information to obtain a semantic unidentified detection model;
and the detection module is used for executing text error detection operation on the sentence text to be detected through the semantic unidentified detection model.
The present application further provides a storage medium having a computer program stored thereon, which when executed, implements the steps performed by the above-described text error detection method.
The application also provides an electronic device, which comprises a memory and a processor, wherein the memory is stored with a computer program, and the processor realizes the steps executed by the text error detection method when calling the computer program in the memory.
The application provides a text error detection method which comprises the steps of obtaining a training text of a first language, and determining the confusion degree and grammar error information of the training text; translating the training text into a pivot language text in a second language, translating the pivot language text into a target text in the first language; calculating the text similarity between the training text and the target text, and determining the confusion degree of the target text; respectively carrying out word alignment operation on the training text and the target text and the pivot language text to obtain alignment evaluation information of the training text and the target text; training an initial model according to the confusion degree of the training text, the grammar error information of the training text, the text similarity, the confusion degree of the target text, the confusion degree ratio of the target text and the training text and the alignment evaluation information to obtain a semantic unidentified detection model; and executing text error detection operation on the sentence text to be detected through the semantic unclear detection model.
The method comprises the steps of obtaining a training text of a first language, and determining the confusion degree and grammar crop information of the training text. And translating the training text into a second language, and then translating the second language into a target text of the first language, thereby determining the confusion degree of the target text and the text similarity between the target text and the training text. The method comprises the steps of obtaining alignment evaluation information by respectively carrying out word alignment operation on a training text and a target text and a pivot language text, then obtaining a semantic unidentified detection model by utilizing the initial model trained by the characteristic information about the training text and the target text, and detecting semantic unidentified errors in a sentence text to be detected by utilizing the semantic unidentified detection model. Therefore, the semantic ambiguity errors in the text can be detected, and the text detection accuracy is improved. The application also provides a text error detection device, an electronic device and a storage medium, which have the beneficial effects and are not repeated herein.
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present application clearer, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are some embodiments of the present application, but not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.
Referring to fig. 1, fig. 1 is a flowchart of a text error detection method according to an embodiment of the present disclosure.
The specific steps may include:
s101: acquiring a training text of a first language, and determining the confusion and grammar error information of the training text;
the first language and the second language mentioned in this embodiment are two different languages, and may be selected according to a specific application scenario, for example, the first language may be english, and the second language is chinese. In this step, training texts in the first language are obtained first, and the number of the training texts can be any number. The training text can be sentence text input by a user in the language teaching process, namely, various errors including semantic unknown errors input by the user can exist in the training text.
After the training text in the first language is obtained, the confusion of each sentence of the training text can be calculated in units of sentences. As a possible implementation, the present embodiment may define the confusion perplexity as an exponential form of the cross entropy, and the formula for calculating the confusion is as follows:
in the above formula, H (W) represents the entropy of a certain word, W represents a word, and W1w2...wNN words representing the sentence, N representing the number of words, P (w)1w2...wN) Represents the joint probability of this sentence, and P (wi | w1... wi-1) represents the conditional probability of wi given w1... wi-1. The degree of deviation of the training text from the normal sentence can be determined according to the confusion.
The step can also determine grammatical error information of each sentence of the training text, and specifically can determine the grammatical error information of the training text by using a grammatical error detection model. The grammar error information mentioned in the embodiment may include an average probability of grammar error existing in each word in the training text, and a number of word replacement errors with different root words. Word replacement errors with different roots are the same grammar errors after the root word is replaced, for example, the roots of writing and writing are both writing, reading and writing belong to word replacement with different roots, and "I like to reading" and "I like to writing" are word replacement errors with different roots.
S102: translating the training text into a pivot language text of a second language, and translating the pivot language text into a target text of a first language;
in this step, the translator may be used to translate the training text in the first language into the pivot language text in the second language, and translate the pivot language text into the target text in the first language, which is equivalent to performing a translation operation on the training text to obtain the target text. The pivot language text is a conversion bridge between the training text and the target text.
S103: calculating the text similarity between the training text and the target text, and determining the confusion degree of the target text;
as a feasible implementation manner, the text similarity between the training text and the target text is calculated in this step on the basis of obtaining the training text and the target text, and the text similarity between the training text and the target text may be calculated based on Word Mover's Distance measurement in this embodiment. The present embodiment can also calculate the confusion of the target text using the above method of calculating the confusion of the training text.
S104: respectively carrying out word alignment operation on the training text and the target text and the pivot language text to obtain alignment evaluation information of the training text and the target text;
in this step, the training text may be aligned with the pivot language text, and the target text may be aligned with the pivot language text to obtain alignment evaluation information of the training text and the target text. The embodiment may perform word alignment on the training text sentence and the target text with the pivot language text respectively through a fast _ align method.
Specifically, the process of obtaining the alignment evaluation information of the training text and the target text may include: performing word alignment operation on the training text and the pivot language text to obtain a first alignment result; performing word alignment operation on the target text and the pivot language text to obtain a second alignment result; and determining the alignment evaluation information of the training text and the target text according to the first alignment result and the second alignment result. The alignment evaluation information includes: the ratio of the number of the aligned real words in the training text to all the real words in the sentence, the ratio of the number of the aligned real words in the target text to all the real words in the sentence, the ratio of the alignment number, and the probability of the first word alignment probability to the second word alignment ratio. The above alignment ratio value determination process includes: connecting the words in the training text and the target text which correspond to the pivot language text at the same time, taking the ratio of the number of crossed connecting lines to the total alignment as the alignment number ratio, wherein the first word alignment probability is the probability of aligning the training text with the pivot language text words, and the second word alignment ratio is the probability of aligning the training text with the pivot language text words. Real words may include verbs, nouns, adjectives, and adverbs.
Referring to fig. 2, fig. 2 is a schematic diagram illustrating the principle of determining an alignment ratio value provided in the embodiment of the present application, in which the training text is "thin app waves other than button for me", the pivot language text is "That apple is purchased by mom", the target text obtained by translation is "thin app waves bottom for me by my button, and the words in the training text and the target text corresponding to the pivot language text at the same time are" thin "and" thin "," applet "and" applet "," wa "and" wa "," other "and" other "," button "and" bottom "," for "and" for "," me "and" me ", and the connecting result thereof is That three connecting lines intersect with the ratio of the number of connecting lines to the total alignment (i.e., the ratio of the alignment ratio) shown in fig. 2 is 3/7.
S105: training an initial model according to the confusion degree of the training text, grammar error information of the training text, text similarity, the confusion degree of the target text, the confusion degree ratio of the target text and the training text and alignment evaluation information to obtain a semantic unclear detection model;
on the basis of executing the relevant operations of S101-S104, the confusion degree of the training text, the grammar error information of the training text, the text similarity, the confusion degree of the target text, the confusion degree ratio of the target text and the training text and the alignment evaluation information are used as characteristic information for training an initial model, and then the semantic unknown detection model with the capability of detecting the semantic unknown text is obtained.
S106: and executing text error detection operation on the sentence text to be detected through a semantic unclear detection model.
On the basis of obtaining the semantic unclear detection model, the text of the sentence to be detected can be input into the semantic unclear detection model, and whether the text error with unclear semantics exists in the sentence to be detected is judged according to the output result of the semantic unclear detection model.
After the training text of the first language is obtained, the confusion and grammar crop information of the training text are determined. And translating the training text into a second language, and then translating the second language into a target text of the first language, thereby determining the confusion degree of the target text and the text similarity between the target text and the training text. The embodiment also performs word alignment operation on the training text and the target text respectively with the pivot language text to obtain alignment evaluation information, further trains an initial model by using the characteristic information about the training text and the target text to obtain a semantic unclear detection model, and detects semantic unclear errors in the sentence text to be detected by using the semantic unclear detection model. Therefore, the semantic ambiguity errors in the text can be detected, and the text detection accuracy is improved.
As a further introduction to the corresponding embodiment of fig. 1, the process of performing the text error detection operation by the semantic unclear detection model in S106 may include: determining a sentence text to be detected; and inputting the sentence text to be detected into the semantic ambiguity detection model, and judging whether text errors exist in the sentence text to be detected according to a detection result output by the semantic ambiguity detection model.
In a specific implementation scenario, the embodiment can implement spoken language detection, and if speech information is received, the speech information is converted into the to-be-detected sentence text in the first language, so that a text error detection operation is performed on the to-be-detected sentence text by using the ambiguous detection model.
Further, after the text error detection operation is performed on the sentence text to be detected through the semantic unidentified detection model, the text content with errors can be marked on the sentence text to be detected, and the corrected text in the first language can be generated according to the text content with errors.
The flow described in the above embodiment is described as an embodiment in practical application, and this embodiment provides a method for recognizing a sentence with unknown semantics, which is input by a learner of a second language, and is used to determine whether the sentence in the english writing has a condition with unknown semantics due to missing sentence structure, Chinese expression habit, misspelling of words or phrases, or mismnemonics. According to the method, effective text features are constructed through various tools and strategies, and the semantic unknown detection model is trained based on the features, so that whether the sentence is semantically unknown or not can be recognized, and the sentence can be timely fed back to the user, and the requirement of the user for correct expression is met. The present embodiment may include the following steps:
step 1: extracting text features;
specifically, the present embodiment may extract a plurality of different text features based on the training text, for example, the text features may include:
(1) a confusion of each training text;
(2) the number of grammar errors present in each training text;
the number of grammar errors existing in each training text may include average grammar error occurrence probability of each word and word replacement error number with different root words.
(3) Text similarity, confusion of a target text, and confusion ratio of the target text to the training text;
in the embodiment, the training text can be used as an input sentence, and the target text is obtained by translating the input sentence into english and then back translating the input sentence. And calculating the text similarity of the training text and the target text based on Word Mover's Distance measurement. And calculating the confusion degree of the target text, and determining the ratio of the confusion degree of the target text to the confusion degree of the training text.
(4) Training alignment evaluation information of the text and the target text;
respectively carrying out word alignment on the training text and the target text with a pivot language (Chinese) sentence through a fast _ align method: acquiring the ratio of the number of aligned real words obtained in the training text to the number of all real words in the training text, calculating the ratio of the number of aligned real words obtained in the target text to the number of all real words in the target text, and simultaneously corresponding the training text and the target text to the word connecting line of the same Chinese word, the ratio of the number of crossed connecting lines to the total alignment number; and calculating the ratio of the word alignment scores of the training text and the target text. The alignment evaluation information may include: the ratio of the number of aligned real words in the training text to all real words in the sentence, the ratio of the number of aligned real words in the target text to all real words in the sentence, the ratio of the number of alignments, and the probability of the first word alignment probability to the second word alignment ratio (i.e. the ratio of the word alignment scores of the training text to the target text is calculated).
Step 2: and fitting the text characteristics of the training samples by using a nonlinear classifier and storing the model, wherein the sample label is that whether the semantics is unknown or not.
The specific fitting process may include: and (3) performing maximum and minimum normalization on the text features in the step (1) so as to convert the text features into a range of [0,1], uniformly dividing the text into N parts, taking N-1 parts as a training set each time, and performing N times of fitting by using a two-classifier on the rest 1 parts as a verification set to obtain 5 models.
And step 3: predicting a new sample by using the trained nonlinear classifier;
the specific prediction process may include: and (3) extracting the text features in the step (1) from the new sample, normalizing the text features extracted from the new sample by using the maximum and minimum values in a training set, predicting the normalized new sample by using N trained binary models to obtain N predicted values, and taking the prediction result of most models in the N predicted values as the prediction result of the new sample.
The embodiment provides a general scheme for predicting semantically unknown sentences, can solve the problem that English learners express semantically unknown in writing or oral evaluation, and can obtain reasonable results on the problem of semantically unknown recognition by training efficient models based on a plurality of text features.
Referring to fig. 3, fig. 3 is a schematic structural diagram of a text error detection apparatus according to an embodiment of the present disclosure;
the apparatus may include:
a training text processing module 100, configured to obtain a training text in a first language, and determine confusion and grammar error information of the training text;
a language translation module 200, configured to translate the training text into a pivot language text in a second language, and translate the pivot language text into a target text in the first language;
the target text processing module 300 is configured to calculate text similarity between the training text and the target text, and determine a confusion degree of the target text;
a word alignment module 400, configured to perform word alignment operation on the training text and the target text respectively with the pivot language text, so as to obtain alignment evaluation information of the training text and the target text;
the model training module 500 is configured to train an initial model according to the confusion of the training text, the grammar error information of the training text, the text similarity, the confusion of the target text, the confusion ratio of the target text to the training text, and the alignment evaluation information to obtain a semantic unidentified detection model;
the detection module 600 is configured to perform a text error detection operation on the sentence text to be detected through the semantic unidentified detection model.
After the training text of the first language is obtained, the confusion and grammar crop information of the training text are determined. And translating the training text into a second language, and then translating the second language into a target text of the first language, thereby determining the confusion degree of the target text and the text similarity between the target text and the training text. The embodiment also performs word alignment operation on the training text and the target text respectively with the pivot language text to obtain alignment evaluation information, further trains an initial model by using the characteristic information about the training text and the target text to obtain a semantic unclear detection model, and detects semantic unclear errors in the sentence text to be detected by using the semantic unclear detection model. Therefore, the semantic ambiguity errors in the text can be detected, and the text detection accuracy is improved.
Further, the grammar error information includes an average probability of grammar error existing in each word in the training text, and the number of word replacement errors with different word roots.
Further, the word alignment module 400 is configured to perform a word alignment operation on the training text and the pivot language text to obtain a first alignment result; the pivot language text is used for carrying out word alignment operation on the target text and the pivot language text to obtain a second alignment result; and the alignment evaluation information is used for determining the alignment evaluation information of the training text and the target text according to the first alignment result and the second alignment result.
Further, the alignment evaluation information includes:
the ratio of the number of the aligned real words in the training text to the number of all the real words in the sentence;
the ratio of the number of the aligned real words in the target text to the number of all the real words in the sentence;
alignment number ratio; wherein the determining process of the alignment ratio value comprises: connecting lines of words in the training text and the target text which correspond to the pivot language text at the same time, and taking the ratio of the number of crossed connecting lines to the total alignment as the alignment number ratio;
a probability of a first word alignment probability being proportional to a second word alignment probability; wherein the first word alignment probability is a probability that the training text is aligned with the pivot language text word, and the second word alignment ratio is a probability that the training text is aligned with the pivot language text word.
Further, the detection module 600 includes:
the text determining unit is used for determining the text of the sentence to be detected;
and the detection unit is used for inputting the sentence text to be detected into the semantic unclear detection model and judging whether the sentence text to be detected has text errors according to a detection result output by the semantic unclear detection model.
Further, the text detection unit is configured to convert the voice information into a to-be-detected sentence text in the first language if the voice information is received.
Further, the method also comprises the following steps:
and the error correction module is used for marking the text content with errors in the sentence text to be detected after executing text error detection operation on the sentence text to be detected through the semantic unclear detection model, and generating the corrected text of the first language according to the text content with errors.
Since the embodiments of the apparatus portion and the method portion correspond to each other, please refer to the description of the embodiments of the method portion for the embodiments of the apparatus portion, which is not repeated here.
The present application also provides a storage medium having a computer program stored thereon, which when executed, may implement the steps provided by the above-described embodiments. The storage medium may include: various media capable of storing program codes, such as a usb disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk, or an optical disk.
The application further provides an electronic device, which may include a memory and a processor, where the memory stores a computer program, and the processor may implement the steps provided by the foregoing embodiments when calling the computer program in the memory. Of course, the electronic device may also include various network interfaces, power supplies, and the like.
The embodiments are described in a progressive manner in the specification, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other. The device disclosed by the embodiment corresponds to the method disclosed by the embodiment, so that the description is simple, and the relevant points can be referred to the method part for description. It should be noted that, for those skilled in the art, it is possible to make several improvements and modifications to the present application without departing from the principle of the present application, and such improvements and modifications also fall within the scope of the claims of the present application.
It is further noted that, in the present specification, relational terms such as first and second, and the like are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.