Disclosure of Invention
The embodiment of the invention provides a character string identification verification method and device, which can correct an identification result by using domain terms and can effectively improve the identification accuracy.
A first aspect of an embodiment of the present invention provides a method for identifying and checking a character string, including:
Creating a text library of domain terms, wherein each domain term in the text library has a corresponding index;
searching a domain term corresponding to the character string to be corrected in the text library based on a preset similarity algorithm of adjacent words.
Optionally, the method further comprises: establishing indexes of terms in each field according to the pinyin of the Chinese characters; or, establishing the index of the terms in each field according to the pinyin and the position of the Chinese characters.
Optionally, the method further comprises: and setting word frequency probability of terms in each field.
Optionally, the searching the text library for the domain term corresponding to the character string to be corrected based on the similarity algorithm of the preset adjacent words includes:
searching the domain term corresponding to the character string to be corrected through the following algorithm:
decomposing the character string to be corrected into a set of binary groups of adjacent words;
Searching each binary group in the set in the text library to obtain a searching set corresponding to each binary group;
respectively calculating the similarity of terms in each field in each search set;
And respectively determining the domain terms with the highest similarity corresponding to the two groups, and determining the domain terms with the highest similarity as the domain terms corresponding to the character string to be corrected.
Optionally, after calculating the similarity of the terms in each field in each search set, the method further includes: and if the similarity of the domain terms is equal, determining the domain term with the highest word frequency probability in the domain terms as the domain term corresponding to the character string to be corrected.
A second aspect of an embodiment of the present invention provides a device for identifying and verifying a character string, including:
the system comprises a creation module, a search module and a search module, wherein the creation module is used for creating a text library of domain terms, and each domain term in the text library has a corresponding index;
the searching module is used for searching the field term corresponding to the character string to be corrected in the text library based on a preset similarity algorithm of the adjacent words.
Optionally, the method further comprises:
the establishing module is used for establishing indexes of the terms in each field according to the pinyin of the Chinese characters; or, establishing the index of the terms in each field according to the pinyin and the position of the Chinese characters.
Optionally, the method further comprises:
The setting module is used for setting word frequency probability of terms in various fields.
Optionally, the searching module is specifically configured to search a domain term corresponding to the character string to be corrected by using the following algorithm:
decomposing the character string to be corrected into a set of binary groups of adjacent words;
Searching each binary group in the set in the text library to obtain a searching set corresponding to each binary group;
respectively calculating the similarity of terms in each field in each search set;
And respectively determining the domain terms with the highest similarity corresponding to the two groups, and determining the domain terms with the highest similarity as the domain terms corresponding to the character string to be corrected.
Optionally, the method further comprises:
and the determining module is used for determining the domain term with the highest word frequency probability in the domain terms as the domain term corresponding to the character string to be corrected if the similarity of the domain terms is equal.
A third aspect of an embodiment of the present invention provides an electronic device, including at least one processor;
And a memory communicatively coupled to the at least one processor;
Wherein the memory stores a program of instructions executable by the at least one processor to enable the at least one processor to perform the method as described above.
A fourth aspect of an embodiment of the present invention provides a computer program product for use in a device for identification verification of a character string, the computer program product comprising a functional module as defined in any one of the preceding claims.
From the above technical solutions, the embodiment of the present invention has the following advantages: creating a text library of domain terms, wherein each domain term in the text library has a corresponding index; searching a domain term corresponding to the character string to be corrected in the text library based on a preset similarity algorithm of adjacent words. Therefore, the recognition result can be corrected by using the field terms, and the recognition accuracy can be effectively improved.
Detailed Description
The embodiment of the invention provides a character string identification and verification method and device. The domain terms can be utilized to correct the recognition result, so that the recognition accuracy can be effectively improved.
The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to fall within the scope of the invention.
The terms first and second in the description and claims of the invention and in the above-described figures are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used may be interchanged where appropriate such that the embodiments described herein may be implemented in other sequences than those illustrated or otherwise described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.
Because of the limitations of OCR technology, the photo photographed by the mobile phone has the problems of shadow and blurry, so that the recognition rate is not high. Therefore, the embodiment of the invention takes into consideration that if the text library of the pre-established domain terms is used, the characters of the recognition result are compared with the domain terms in the text library of the domain terms (the characters are recognized by aiming at the test document photo of the hospital in a specific scene), so that the domain term closest to the recognition result is obtained to correct the Chinese character with the wrong recognition, and the correct recognition result is obtained. Of course, the premise is that only one or two words in a domain term are misidentified, and that most words in a domain term are misidentified, so that the probability is very small. For example, assuming Ps is the error rate of a recognition result of a phrase of a chinese character, pi is the recognition error rate of each chinese character in the phrase, and since the recognition algorithm affects, each chinese character recognition error rate is independent and does not affect each other, the calculation formula is as follows:
Ps=p1×p2×pi×pn, n is the number of words of the phrase.
Assume Pi is 0.4 (i.e., the recognition error rate of a single kanji is 40%, because it is a cell phone photograph, the recognition error rate is much higher than that of the printed Chinese characters) the phrase is composed of 5 Chinese characters, and the error rate of the phrase total error is ps=0.4x0.4x0.4x0.4x0.4= 0.01024.
It can be seen that the probability of occurrence of such a result is very small, and in general, one or two words are wrong, and the result is very easily corrected by comparing domain terms in the medical word stock.
Moreover, the non-Chinese characters in medical terms are difficult to identify in practice, and Greek letters are sometimes very similar to English letters. For example, a and α, B and β, E and E, y and γ. These want to be substantially impossible by improving the accuracy of the recognition algorithm, so the choice is corrected by matching domain terms in the medical word stock.
Referring to fig. 1, fig. 1 is a schematic diagram showing a method for identifying and checking a character string according to an embodiment of the present invention, which includes the following steps:
s10, creating a text library of domain terms, wherein each domain term in the text library has a corresponding index;
In this embodiment, considering that the recognition degree of the existing OCR technology for mobile phone photographing is still relatively poor, a text library of domain terms in a specific field is established, for example, in the text recognition of a hospital examination receipt in a medical field, the recognition result is corrected by using the domain terms, so that the recognition accuracy can be effectively improved.
Taking the medical field as an example, some terms of the medical field are very strongback, some are english translation, e.g., penicillin; some are combined from different terms according to molecular formula, e.g., serum|asparaginyl|transferase|; the length of medical terms is sometimes long, and if each Chinese character is compared one by one, the searching takes a lot of time, so that it becomes very significant to set a set of indexing rules, namely, each domain term in the text library has a corresponding index.
The index of terms in each field can be established according to the pinyin of Chinese characters. For example, the term penicillin for the above example may be indexed as follows: PNXL, so that the search speed can be greatly accelerated when the Chinese character search is changed into the English letter search during the search.
Meanwhile, in order to keep the position information of the Chinese characters, indexes of terms in various fields can be established according to the pinyin initial and the position codes of the Chinese characters. For example, the term penicillin for the above example may be indexed as follows: P1N2X3L4.
Chinese characters often include English letters, numbers, symbols, greek letters, and the like. The symbols can be directly connected with position information without being changed into pinyin initials. For example, the serum gamma-glutamyl transferase assay may be indexed by X1Q2 gamma 3-4G5A6XJ8Z9Y10M11.
Of course, there are many terms of Chinese characters that are different, but the initial pinyin is likely to be the same, so that the result of indexing is a combination. If the index is to be retrieved with uniqueness, the introduction of a four corner number may be considered. The four corner number is one of the common word-checking methods of Chinese dictionary, and the Chinese characters are classified by using at most 5 Arabic numerals.
If the field to which the user aims is the medical field, the retrieval speed is improved greatly after all because the order of the terms in the medical field is not large and even if the result retrieved by the pinyin index is not unique, the relation is not large.
In addition, the word frequency probability of the term in each field can be set, if the calculated similarity based on the adjacent words is close, the word frequency probability of the term can be referred, and the higher the word frequency probability is, the higher the probability of the term is.
S20, searching a field term corresponding to the character string to be corrected in the text library based on a preset similarity algorithm of adjacent words.
The similarity algorithm based on the preset adjacent words provided in this embodiment is a reference n-gram algorithm. n-gram is a language model commonly used in large vocabulary continuous speech recognition, and is called a chinese language model (CLM, chinese Language Model) for the middle. The Chinese language model utilizes collocation information between adjacent words in the context, when continuous non-space pinyin, strokes or numbers representing letters or strokes are required to be converted into Chinese character strings (i.e. sentences), sentences with the highest probability can be calculated, so that automatic conversion of Chinese characters is realized, manual selection by a user is not required, and the problem of repeated codes of the same pinyin (or stroke strings or number strings) corresponding to a plurality of Chinese characters is avoided.
The language model is based on the assumption that the occurrence of the nth word is related to only the preceding N-1 words, but not to any other word, and the probability of the whole sentence is the product of the occurrence probabilities of the respective words.
For example, a sequence of m words (or a sentence) with probability P (w 1, w2, …, wm) can be obtained according to the chain rule
P(w1,w2,…,wm)=P(w1)P(w2|w1)P(w3|w1,w2)…P(wm|w1,…,wm-1);
This probability is obviously not well calculated, and the assumption of a markov chain is not used, i.e. the current word is only related to the first few limited words, so that it is not necessary to trace back to the first word, so that the length of the above-mentioned formula can be greatly reduced. I.e.
P(wi|w1,…,wi-1)=P(wi|wi-n+1,…,wi-1);
In particular, for the case where n takes a small value:
When n=1, one unigram (unigram model) is P (w 1, w2, …, wm) =p (wi);
when n=2, a bigram model is P (w 1, w2, …, wm) =p (wi|wi-1);
when n=3, a ternary model (trigram model) is P (w 1, w2, …, wm) =p (wi|wi-2 wi-1).
A set of parameters may then be found using a maximum likelihood method such that the probability of training samples is maximized.
For unigram model, where c (w 1,..and wn) represents the number of occurrences of n-gram w1,..and, wn in the training corpus, M is the total number of words in the corpus (e.g., m=5 for yes no yes)
P(wi)=C(wi)/M;
For the bigram model, the reference number,
P(wi|wi-1)=C(wi-1wi)/C(wi-1);
For an n-gram model,
P(wi|wi-n-1,…,wi-1)=C(wi-n-1,…,wi)/C(wi-n-1,…,wi-1)。
The n-gram technology is widely used for word segmentation, semantic analysis, text compression, spelling error checking, character string searching acceleration and literature language identification, and the application scene is not the same as that of the text, so that the algorithm of the text refers to the similarity calculation rule of the n-gram to obtain the calculation method of the algorithm:
firstly, based on one scene, as the core part of the hospital laboratory sheet is identified, the text interval before a plurality of columns is larger, and word segmentation processing is performed in the identification process. In the correction process, the character string length is accurate, but whether each Chinese character is correctly pending.
Secondly, adopting a binary model, and defining as follows:
Definition 1: adjacent word doublet (nb): refers to a binary group (requiring a record location to retrieve) of recognition results in terms of adjacent 2 words. For example, adjacent word doublets for uric acid detection are: uric acid, acid detection and detection, and the combination of the three binary groups is NB. The numerical formula of the number of NB elements |NB| is as follows Where n is the length of the recognition result string.
Definition 2: binary search results: the recognition result is searched in a term library one by one according to 2 adjacent words, and the result can be a combination (the terms with inconsistent lengths are removed in the search) and is denoted as Rnb (NB epsilon NB), wherein each element (namely a term) is r.
Definition 3: the search total Unb is a collection of each Rnb.
Unb = rnb1+ + Rnbi + + Rnbm (m is |nb|).
Definition 4: when the elements are aggregated, each element needs to record the repetition number, that is, the similarity Sr.
The similarity algorithm for adjacent words is thus as follows:
decomposing the character string to be corrected into a set of binary groups of adjacent words;
Searching each binary group in the set in the text library to obtain a searching set corresponding to each binary group;
respectively calculating the similarity of terms in each field in each search set;
And respectively determining the domain terms with the highest similarity corresponding to the two groups, and determining the domain terms with the highest similarity as the domain terms corresponding to the character string to be corrected.
In addition, if the similarity of the domain terms is equal, determining the domain term with the highest word frequency probability in the domain terms as the domain term corresponding to the character string to be corrected.
In this embodiment, a text library of domain terms is created, where each domain term in the text library has a corresponding index; searching a domain term corresponding to the character string to be corrected in the text library based on a preset similarity algorithm of adjacent words. Therefore, the recognition result can be corrected by using the field terms, and the recognition accuracy can be effectively improved.
The embodiment of the invention also provides a device for identifying and checking the character string, as shown in fig. 2, which comprises:
A creating module 10, configured to create a text library of domain terms, where each domain term in the text library has a corresponding index;
The searching module 20 is configured to search the text library for a domain term corresponding to the character string to be corrected based on a preset similarity algorithm of adjacent words.
Further, the method may further include: the establishing module is used for establishing indexes of the terms in each field according to the pinyin of the Chinese characters; or, establishing the index of the terms in each field according to the pinyin and the position of the Chinese characters.
Further, the method may further include: the setting module is used for setting word frequency probability of terms in various fields.
Further, the searching module 20 is specifically configured to search the domain term corresponding to the character string to be corrected by the following algorithm:
decomposing the character string to be corrected into a set of binary groups of adjacent words;
Searching each binary group in the set in the text library to obtain a searching set corresponding to each binary group;
respectively calculating the similarity of terms in each field in each search set;
And respectively determining the domain terms with the highest similarity corresponding to the two groups, and determining the domain terms with the highest similarity as the domain terms corresponding to the character string to be corrected.
Further, the method further comprises the following steps:
and the determining module is used for determining the domain term with the highest word frequency probability in the domain terms as the domain term corresponding to the character string to be corrected if the similarity of the domain terms is equal.
In this embodiment, considering that the recognition degree of the existing OCR technology for mobile phone photographing is still relatively poor, a text library of domain terms in a specific field is established, for example, in the text recognition of a hospital examination receipt in a medical field, the recognition result is corrected by using the domain terms, so that the recognition accuracy can be effectively improved.
Taking the medical field as an example, some terms of the medical field are very strongback, some are english translation, e.g., penicillin; some are combined from different terms according to molecular formula, e.g., serum|asparaginyl|transferase|; the length of medical terms is sometimes long, and if each Chinese character is compared one by one, the searching takes a lot of time, so that it becomes very significant to set a set of indexing rules, namely, each domain term in the text library has a corresponding index.
The index of terms in each field can be established according to the pinyin of Chinese characters. For example, the term penicillin for the above example may be indexed as follows: PNXL, so that the search speed can be greatly accelerated when the Chinese character search is changed into the English letter search during the search.
Meanwhile, in order to keep the position information of the Chinese characters, indexes of terms in various fields can be established according to the pinyin initial and the position codes of the Chinese characters. For example, the term penicillin for the above example may be indexed as follows: P1N2X3L4.
Chinese characters often include English letters, numbers, symbols, greek letters, and the like. The symbols can be directly connected with position information without being changed into pinyin initials. For example, the serum gamma-glutamyl transferase assay may be indexed by X1Q2 gamma 3-4G5A6XJ8Z9Y10M11.
Of course, there are many terms of Chinese characters that are different, but the initial pinyin is likely to be the same, so that the result of indexing is a combination. If the index is to be retrieved with uniqueness, the introduction of a four corner number may be considered. The four corner number is one of the common word-checking methods of Chinese dictionary, and the Chinese characters are classified by using at most 5 Arabic numerals.
If the field to which the user aims is the medical field, the retrieval speed is improved greatly after all because the order of the terms in the medical field is not large and even if the result retrieved by the pinyin index is not unique, the relation is not large.
In addition, the word frequency probability of the term in each field can be set, if the calculated similarity based on the adjacent words is close, the word frequency probability of the term can be referred, and the higher the word frequency probability is, the higher the probability of the term is.
The similarity algorithm based on the preset adjacent words provided in this embodiment is a reference n-gram algorithm. n-gram is a language model commonly used in large vocabulary continuous speech recognition, and is called a chinese language model (CLM, chinese Language Model) for the middle. The Chinese language model utilizes collocation information between adjacent words in the context, when continuous non-space pinyin, strokes or numbers representing letters or strokes are required to be converted into Chinese character strings (i.e. sentences), sentences with the highest probability can be calculated, so that automatic conversion of Chinese characters is realized, manual selection by a user is not required, and the problem of repeated codes of a plurality of Chinese characters corresponding to the same pinyin (or stroke strings or number strings) is avoided.
The n-gram technology is widely used for word segmentation, semantic analysis, text compression, spelling error checking, character string searching acceleration and literature language identification, and the application scene is not the same as that of the text, so that the algorithm of the text refers to the similarity calculation rule of the n-gram to obtain the calculation method of the algorithm:
firstly, based on one scene, as the core part of the hospital laboratory sheet is identified, the text interval before a plurality of columns is larger, and word segmentation processing is performed in the identification process. In the correction process, the character string length is accurate, but whether each Chinese character is correctly pending.
Secondly, adopting a binary model, and defining as follows:
Definition 1: adjacent word doublet (nb): refers to a binary group (requiring a record location to retrieve) of recognition results in terms of adjacent 2 words. For example, adjacent word doublets for uric acid detection are: uric acid, acid detection and detection, and the combination of the three binary groups is NB. The numerical formula of the number of NB elements |NB| is as follows Where n is the length of the recognition result string.
Definition 2: binary search results: the recognition result is searched in a term library one by one according to 2 adjacent words, and the result can be a combination (the terms with inconsistent lengths are removed in the search) and is denoted as Rnb (NB epsilon NB), wherein each element (namely a term) is r.
Definition 3: the search total Unb is a collection of each Rnb.
Unb = rnb1+ + Rnbi + + Rnbm (m is |nb|).
Definition 4: when the elements are aggregated, each element needs to record the repetition number, that is, the similarity Sr.
Thus, a similarity algorithm of adjacent words is obtained, and will not be described here again.
In this embodiment, a text library of domain terms is created, where each domain term in the text library has a corresponding index; searching a domain term corresponding to the character string to be corrected in the text library based on a preset similarity algorithm of adjacent words. Therefore, the recognition result can be corrected by using the field terms, and the recognition accuracy can be effectively improved.
Fig. 3 is a schematic hardware structure of an electronic device according to an embodiment of the present application, where the device includes: one or more processors 301, and a memory 302. One example is shown in fig. 3. The processor 301 and the memory 302 may be connected by a bus or other means, which is illustrated in fig. 3.
The memory 302 is used as a non-volatile computer readable storage medium, and may be used to store a non-volatile software program, a non-volatile computer executable program, and a module, such as a program instruction/module corresponding to the identification verification device of a character string in the embodiment of the present invention. The processor 301 executes various functional applications of the server and data processing, that is, implements the recognition verification device of the character string in the above-described method embodiment, by running the nonvolatile software programs, instructions, and modules stored in the memory 302.
Memory 302 may include a storage program area that may store an operating system, at least one application program required for functionality, and a storage data area; the storage data area may store data created according to the use of the identification verification device of the character string, or the like. In addition, memory 302 may include high-speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other non-volatile solid-state storage device. In some embodiments, memory 302 may optionally include memory located remotely from processor 301, which may be connected to the identification verification means of the character string via a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.
The electronic equipment can execute the device or the method provided by the embodiment of the application and has the corresponding functional modules and beneficial effects of executing the device or the method. Technical details not described in detail in this embodiment may be referred to the apparatus or method provided in the embodiments of the present application.
Also, the system embodiments described above are merely illustrative, wherein the elements illustrated as separate elements may or may not be physically separate, and the elements shown as elements may or may not be physical elements, may be located in one place, or may be distributed over a plurality of network elements. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment.
From the above description of embodiments, it will be apparent to those skilled in the art that the embodiments may be implemented by means of software plus a general purpose hardware platform, or may be implemented by hardware. Based on such understanding, the technical solution of the present invention may be embodied essentially or in part or all of the technical solution or in part in the form of a software product stored in a storage medium, including instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to perform all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (Random Access Memory, RAM), a magnetic disk, or an optical disk, or other various media capable of storing program codes.
The above embodiments are only for illustrating the technical solution of the present invention, and are not limiting; although the invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit and scope of the technical solutions of the embodiments of the present invention.