[go: up one dir, main page]

CN108564086B - Character string identification and verification method and device - Google Patents

Character string identification and verification method and device Download PDF

Info

Publication number
CN108564086B
CN108564086B CN201810221541.5A CN201810221541A CN108564086B CN 108564086 B CN108564086 B CN 108564086B CN 201810221541 A CN201810221541 A CN 201810221541A CN 108564086 B CN108564086 B CN 108564086B
Authority
CN
China
Prior art keywords
terms
domain
character string
corrected
searching
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201810221541.5A
Other languages
Chinese (zh)
Other versions
CN108564086A (en
Inventor
祝安
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shanghai Kedu Medical Technology Co ltd
Original Assignee
Shanghai Kedu Medical Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shanghai Kedu Medical Technology Co ltd filed Critical Shanghai Kedu Medical Technology Co ltd
Priority to CN201810221541.5A priority Critical patent/CN108564086B/en
Publication of CN108564086A publication Critical patent/CN108564086A/en
Application granted granted Critical
Publication of CN108564086B publication Critical patent/CN108564086B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/10Character recognition
    • G06V30/14Image acquisition
    • G06V30/148Segmentation of character regions
    • G06V30/153Segmentation of character regions using recognition of characters or words
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/22Matching criteria, e.g. proximity measures

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Artificial Intelligence (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • General Engineering & Computer Science (AREA)
  • Multimedia (AREA)
  • Character Discrimination (AREA)

Abstract

The embodiment of the invention discloses a character string identification verification method and device, which can correct an identification result by using domain terms and can effectively improve the identification accuracy. The method of the embodiment of the invention comprises the following steps: creating a text library of domain terms, wherein each domain term in the text library has a corresponding index; searching a domain term corresponding to the character string to be corrected in the text library based on a preset similarity algorithm of adjacent words.

Description

Character string identification and verification method and device
Technical Field
The present invention relates to the field of text recognition, and in particular, to a method and apparatus for recognizing and verifying a character string.
Background
With the popularization of the mobile internet, internet medical treatment is also becoming an emerging industry for the development of medical informatization. However, in the aspect of information security, hospitals do not put information on the internet to realize information interconnection and intercommunication. However, if the angle of the patient is considered, the other person can take a picture of the checking receipt of the hospital to perform character recognition so as to acquire medical electronic information, thereby facilitating the arrangement and collection of data and the structuring treatment of the electronic medical record.
However, due to the limitations of OCR technology, the photo photographed by the mobile phone has the problems of shadow and blurry, so that the recognition rate is not high.
Disclosure of Invention
The embodiment of the invention provides a character string identification verification method and device, which can correct an identification result by using domain terms and can effectively improve the identification accuracy.
A first aspect of an embodiment of the present invention provides a method for identifying and checking a character string, including:
Creating a text library of domain terms, wherein each domain term in the text library has a corresponding index;
searching a domain term corresponding to the character string to be corrected in the text library based on a preset similarity algorithm of adjacent words.
Optionally, the method further comprises: establishing indexes of terms in each field according to the pinyin of the Chinese characters; or, establishing the index of the terms in each field according to the pinyin and the position of the Chinese characters.
Optionally, the method further comprises: and setting word frequency probability of terms in each field.
Optionally, the searching the text library for the domain term corresponding to the character string to be corrected based on the similarity algorithm of the preset adjacent words includes:
searching the domain term corresponding to the character string to be corrected through the following algorithm:
decomposing the character string to be corrected into a set of binary groups of adjacent words;
Searching each binary group in the set in the text library to obtain a searching set corresponding to each binary group;
respectively calculating the similarity of terms in each field in each search set;
And respectively determining the domain terms with the highest similarity corresponding to the two groups, and determining the domain terms with the highest similarity as the domain terms corresponding to the character string to be corrected.
Optionally, after calculating the similarity of the terms in each field in each search set, the method further includes: and if the similarity of the domain terms is equal, determining the domain term with the highest word frequency probability in the domain terms as the domain term corresponding to the character string to be corrected.
A second aspect of an embodiment of the present invention provides a device for identifying and verifying a character string, including:
the system comprises a creation module, a search module and a search module, wherein the creation module is used for creating a text library of domain terms, and each domain term in the text library has a corresponding index;
the searching module is used for searching the field term corresponding to the character string to be corrected in the text library based on a preset similarity algorithm of the adjacent words.
Optionally, the method further comprises:
the establishing module is used for establishing indexes of the terms in each field according to the pinyin of the Chinese characters; or, establishing the index of the terms in each field according to the pinyin and the position of the Chinese characters.
Optionally, the method further comprises:
The setting module is used for setting word frequency probability of terms in various fields.
Optionally, the searching module is specifically configured to search a domain term corresponding to the character string to be corrected by using the following algorithm:
decomposing the character string to be corrected into a set of binary groups of adjacent words;
Searching each binary group in the set in the text library to obtain a searching set corresponding to each binary group;
respectively calculating the similarity of terms in each field in each search set;
And respectively determining the domain terms with the highest similarity corresponding to the two groups, and determining the domain terms with the highest similarity as the domain terms corresponding to the character string to be corrected.
Optionally, the method further comprises:
and the determining module is used for determining the domain term with the highest word frequency probability in the domain terms as the domain term corresponding to the character string to be corrected if the similarity of the domain terms is equal.
A third aspect of an embodiment of the present invention provides an electronic device, including at least one processor;
And a memory communicatively coupled to the at least one processor;
Wherein the memory stores a program of instructions executable by the at least one processor to enable the at least one processor to perform the method as described above.
A fourth aspect of an embodiment of the present invention provides a computer program product for use in a device for identification verification of a character string, the computer program product comprising a functional module as defined in any one of the preceding claims.
From the above technical solutions, the embodiment of the present invention has the following advantages: creating a text library of domain terms, wherein each domain term in the text library has a corresponding index; searching a domain term corresponding to the character string to be corrected in the text library based on a preset similarity algorithm of adjacent words. Therefore, the recognition result can be corrected by using the field terms, and the recognition accuracy can be effectively improved.
Drawings
FIG. 1 is a schematic diagram of an embodiment of a method for recognizing and verifying a character string according to an embodiment of the present invention;
FIG. 2 is a schematic diagram of an embodiment of a device for recognizing and verifying character strings according to an embodiment of the present invention;
Fig. 3 is a schematic diagram of an embodiment of an electronic device according to an embodiment of the present invention.
Detailed Description
The embodiment of the invention provides a character string identification and verification method and device. The domain terms can be utilized to correct the recognition result, so that the recognition accuracy can be effectively improved.
The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to fall within the scope of the invention.
The terms first and second in the description and claims of the invention and in the above-described figures are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used may be interchanged where appropriate such that the embodiments described herein may be implemented in other sequences than those illustrated or otherwise described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.
Because of the limitations of OCR technology, the photo photographed by the mobile phone has the problems of shadow and blurry, so that the recognition rate is not high. Therefore, the embodiment of the invention takes into consideration that if the text library of the pre-established domain terms is used, the characters of the recognition result are compared with the domain terms in the text library of the domain terms (the characters are recognized by aiming at the test document photo of the hospital in a specific scene), so that the domain term closest to the recognition result is obtained to correct the Chinese character with the wrong recognition, and the correct recognition result is obtained. Of course, the premise is that only one or two words in a domain term are misidentified, and that most words in a domain term are misidentified, so that the probability is very small. For example, assuming Ps is the error rate of a recognition result of a phrase of a chinese character, pi is the recognition error rate of each chinese character in the phrase, and since the recognition algorithm affects, each chinese character recognition error rate is independent and does not affect each other, the calculation formula is as follows:
Ps=p1×p2×pi×pn, n is the number of words of the phrase.
Assume Pi is 0.4 (i.e., the recognition error rate of a single kanji is 40%, because it is a cell phone photograph, the recognition error rate is much higher than that of the printed Chinese characters) the phrase is composed of 5 Chinese characters, and the error rate of the phrase total error is ps=0.4x0.4x0.4x0.4x0.4= 0.01024.
It can be seen that the probability of occurrence of such a result is very small, and in general, one or two words are wrong, and the result is very easily corrected by comparing domain terms in the medical word stock.
Moreover, the non-Chinese characters in medical terms are difficult to identify in practice, and Greek letters are sometimes very similar to English letters. For example, a and α, B and β, E and E, y and γ. These want to be substantially impossible by improving the accuracy of the recognition algorithm, so the choice is corrected by matching domain terms in the medical word stock.
Referring to fig. 1, fig. 1 is a schematic diagram showing a method for identifying and checking a character string according to an embodiment of the present invention, which includes the following steps:
s10, creating a text library of domain terms, wherein each domain term in the text library has a corresponding index;
In this embodiment, considering that the recognition degree of the existing OCR technology for mobile phone photographing is still relatively poor, a text library of domain terms in a specific field is established, for example, in the text recognition of a hospital examination receipt in a medical field, the recognition result is corrected by using the domain terms, so that the recognition accuracy can be effectively improved.
Taking the medical field as an example, some terms of the medical field are very strongback, some are english translation, e.g., penicillin; some are combined from different terms according to molecular formula, e.g., serum|asparaginyl|transferase|; the length of medical terms is sometimes long, and if each Chinese character is compared one by one, the searching takes a lot of time, so that it becomes very significant to set a set of indexing rules, namely, each domain term in the text library has a corresponding index.
The index of terms in each field can be established according to the pinyin of Chinese characters. For example, the term penicillin for the above example may be indexed as follows: PNXL, so that the search speed can be greatly accelerated when the Chinese character search is changed into the English letter search during the search.
Meanwhile, in order to keep the position information of the Chinese characters, indexes of terms in various fields can be established according to the pinyin initial and the position codes of the Chinese characters. For example, the term penicillin for the above example may be indexed as follows: P1N2X3L4.
Chinese characters often include English letters, numbers, symbols, greek letters, and the like. The symbols can be directly connected with position information without being changed into pinyin initials. For example, the serum gamma-glutamyl transferase assay may be indexed by X1Q2 gamma 3-4G5A6XJ8Z9Y10M11.
Of course, there are many terms of Chinese characters that are different, but the initial pinyin is likely to be the same, so that the result of indexing is a combination. If the index is to be retrieved with uniqueness, the introduction of a four corner number may be considered. The four corner number is one of the common word-checking methods of Chinese dictionary, and the Chinese characters are classified by using at most 5 Arabic numerals.
If the field to which the user aims is the medical field, the retrieval speed is improved greatly after all because the order of the terms in the medical field is not large and even if the result retrieved by the pinyin index is not unique, the relation is not large.
In addition, the word frequency probability of the term in each field can be set, if the calculated similarity based on the adjacent words is close, the word frequency probability of the term can be referred, and the higher the word frequency probability is, the higher the probability of the term is.
S20, searching a field term corresponding to the character string to be corrected in the text library based on a preset similarity algorithm of adjacent words.
The similarity algorithm based on the preset adjacent words provided in this embodiment is a reference n-gram algorithm. n-gram is a language model commonly used in large vocabulary continuous speech recognition, and is called a chinese language model (CLM, chinese Language Model) for the middle. The Chinese language model utilizes collocation information between adjacent words in the context, when continuous non-space pinyin, strokes or numbers representing letters or strokes are required to be converted into Chinese character strings (i.e. sentences), sentences with the highest probability can be calculated, so that automatic conversion of Chinese characters is realized, manual selection by a user is not required, and the problem of repeated codes of the same pinyin (or stroke strings or number strings) corresponding to a plurality of Chinese characters is avoided.
The language model is based on the assumption that the occurrence of the nth word is related to only the preceding N-1 words, but not to any other word, and the probability of the whole sentence is the product of the occurrence probabilities of the respective words.
For example, a sequence of m words (or a sentence) with probability P (w 1, w2, …, wm) can be obtained according to the chain rule
P(w1,w2,…,wm)=P(w1)P(w2|w1)P(w3|w1,w2)…P(wm|w1,…,wm-1);
This probability is obviously not well calculated, and the assumption of a markov chain is not used, i.e. the current word is only related to the first few limited words, so that it is not necessary to trace back to the first word, so that the length of the above-mentioned formula can be greatly reduced. I.e.
P(wi|w1,…,wi-1)=P(wi|wi-n+1,…,wi-1);
In particular, for the case where n takes a small value:
When n=1, one unigram (unigram model) is P (w 1, w2, …, wm) =p (wi);
when n=2, a bigram model is P (w 1, w2, …, wm) =p (wi|wi-1);
when n=3, a ternary model (trigram model) is P (w 1, w2, …, wm) =p (wi|wi-2 wi-1).
A set of parameters may then be found using a maximum likelihood method such that the probability of training samples is maximized.
For unigram model, where c (w 1,..and wn) represents the number of occurrences of n-gram w1,..and, wn in the training corpus, M is the total number of words in the corpus (e.g., m=5 for yes no yes)
P(wi)=C(wi)/M;
For the bigram model, the reference number,
P(wi|wi-1)=C(wi-1wi)/C(wi-1);
For an n-gram model,
P(wi|wi-n-1,…,wi-1)=C(wi-n-1,…,wi)/C(wi-n-1,…,wi-1)。
The n-gram technology is widely used for word segmentation, semantic analysis, text compression, spelling error checking, character string searching acceleration and literature language identification, and the application scene is not the same as that of the text, so that the algorithm of the text refers to the similarity calculation rule of the n-gram to obtain the calculation method of the algorithm:
firstly, based on one scene, as the core part of the hospital laboratory sheet is identified, the text interval before a plurality of columns is larger, and word segmentation processing is performed in the identification process. In the correction process, the character string length is accurate, but whether each Chinese character is correctly pending.
Secondly, adopting a binary model, and defining as follows:
Definition 1: adjacent word doublet (nb): refers to a binary group (requiring a record location to retrieve) of recognition results in terms of adjacent 2 words. For example, adjacent word doublets for uric acid detection are: uric acid, acid detection and detection, and the combination of the three binary groups is NB. The numerical formula of the number of NB elements |NB| is as follows Where n is the length of the recognition result string.
Definition 2: binary search results: the recognition result is searched in a term library one by one according to 2 adjacent words, and the result can be a combination (the terms with inconsistent lengths are removed in the search) and is denoted as Rnb (NB epsilon NB), wherein each element (namely a term) is r.
Definition 3: the search total Unb is a collection of each Rnb.
Unb = rnb1+ + Rnbi + + Rnbm (m is |nb|).
Definition 4: when the elements are aggregated, each element needs to record the repetition number, that is, the similarity Sr.
The similarity algorithm for adjacent words is thus as follows:
decomposing the character string to be corrected into a set of binary groups of adjacent words;
Searching each binary group in the set in the text library to obtain a searching set corresponding to each binary group;
respectively calculating the similarity of terms in each field in each search set;
And respectively determining the domain terms with the highest similarity corresponding to the two groups, and determining the domain terms with the highest similarity as the domain terms corresponding to the character string to be corrected.
In addition, if the similarity of the domain terms is equal, determining the domain term with the highest word frequency probability in the domain terms as the domain term corresponding to the character string to be corrected.
In this embodiment, a text library of domain terms is created, where each domain term in the text library has a corresponding index; searching a domain term corresponding to the character string to be corrected in the text library based on a preset similarity algorithm of adjacent words. Therefore, the recognition result can be corrected by using the field terms, and the recognition accuracy can be effectively improved.
The embodiment of the invention also provides a device for identifying and checking the character string, as shown in fig. 2, which comprises:
A creating module 10, configured to create a text library of domain terms, where each domain term in the text library has a corresponding index;
The searching module 20 is configured to search the text library for a domain term corresponding to the character string to be corrected based on a preset similarity algorithm of adjacent words.
Further, the method may further include: the establishing module is used for establishing indexes of the terms in each field according to the pinyin of the Chinese characters; or, establishing the index of the terms in each field according to the pinyin and the position of the Chinese characters.
Further, the method may further include: the setting module is used for setting word frequency probability of terms in various fields.
Further, the searching module 20 is specifically configured to search the domain term corresponding to the character string to be corrected by the following algorithm:
decomposing the character string to be corrected into a set of binary groups of adjacent words;
Searching each binary group in the set in the text library to obtain a searching set corresponding to each binary group;
respectively calculating the similarity of terms in each field in each search set;
And respectively determining the domain terms with the highest similarity corresponding to the two groups, and determining the domain terms with the highest similarity as the domain terms corresponding to the character string to be corrected.
Further, the method further comprises the following steps:
and the determining module is used for determining the domain term with the highest word frequency probability in the domain terms as the domain term corresponding to the character string to be corrected if the similarity of the domain terms is equal.
In this embodiment, considering that the recognition degree of the existing OCR technology for mobile phone photographing is still relatively poor, a text library of domain terms in a specific field is established, for example, in the text recognition of a hospital examination receipt in a medical field, the recognition result is corrected by using the domain terms, so that the recognition accuracy can be effectively improved.
Taking the medical field as an example, some terms of the medical field are very strongback, some are english translation, e.g., penicillin; some are combined from different terms according to molecular formula, e.g., serum|asparaginyl|transferase|; the length of medical terms is sometimes long, and if each Chinese character is compared one by one, the searching takes a lot of time, so that it becomes very significant to set a set of indexing rules, namely, each domain term in the text library has a corresponding index.
The index of terms in each field can be established according to the pinyin of Chinese characters. For example, the term penicillin for the above example may be indexed as follows: PNXL, so that the search speed can be greatly accelerated when the Chinese character search is changed into the English letter search during the search.
Meanwhile, in order to keep the position information of the Chinese characters, indexes of terms in various fields can be established according to the pinyin initial and the position codes of the Chinese characters. For example, the term penicillin for the above example may be indexed as follows: P1N2X3L4.
Chinese characters often include English letters, numbers, symbols, greek letters, and the like. The symbols can be directly connected with position information without being changed into pinyin initials. For example, the serum gamma-glutamyl transferase assay may be indexed by X1Q2 gamma 3-4G5A6XJ8Z9Y10M11.
Of course, there are many terms of Chinese characters that are different, but the initial pinyin is likely to be the same, so that the result of indexing is a combination. If the index is to be retrieved with uniqueness, the introduction of a four corner number may be considered. The four corner number is one of the common word-checking methods of Chinese dictionary, and the Chinese characters are classified by using at most 5 Arabic numerals.
If the field to which the user aims is the medical field, the retrieval speed is improved greatly after all because the order of the terms in the medical field is not large and even if the result retrieved by the pinyin index is not unique, the relation is not large.
In addition, the word frequency probability of the term in each field can be set, if the calculated similarity based on the adjacent words is close, the word frequency probability of the term can be referred, and the higher the word frequency probability is, the higher the probability of the term is.
The similarity algorithm based on the preset adjacent words provided in this embodiment is a reference n-gram algorithm. n-gram is a language model commonly used in large vocabulary continuous speech recognition, and is called a chinese language model (CLM, chinese Language Model) for the middle. The Chinese language model utilizes collocation information between adjacent words in the context, when continuous non-space pinyin, strokes or numbers representing letters or strokes are required to be converted into Chinese character strings (i.e. sentences), sentences with the highest probability can be calculated, so that automatic conversion of Chinese characters is realized, manual selection by a user is not required, and the problem of repeated codes of a plurality of Chinese characters corresponding to the same pinyin (or stroke strings or number strings) is avoided.
The n-gram technology is widely used for word segmentation, semantic analysis, text compression, spelling error checking, character string searching acceleration and literature language identification, and the application scene is not the same as that of the text, so that the algorithm of the text refers to the similarity calculation rule of the n-gram to obtain the calculation method of the algorithm:
firstly, based on one scene, as the core part of the hospital laboratory sheet is identified, the text interval before a plurality of columns is larger, and word segmentation processing is performed in the identification process. In the correction process, the character string length is accurate, but whether each Chinese character is correctly pending.
Secondly, adopting a binary model, and defining as follows:
Definition 1: adjacent word doublet (nb): refers to a binary group (requiring a record location to retrieve) of recognition results in terms of adjacent 2 words. For example, adjacent word doublets for uric acid detection are: uric acid, acid detection and detection, and the combination of the three binary groups is NB. The numerical formula of the number of NB elements |NB| is as follows Where n is the length of the recognition result string.
Definition 2: binary search results: the recognition result is searched in a term library one by one according to 2 adjacent words, and the result can be a combination (the terms with inconsistent lengths are removed in the search) and is denoted as Rnb (NB epsilon NB), wherein each element (namely a term) is r.
Definition 3: the search total Unb is a collection of each Rnb.
Unb = rnb1+ + Rnbi + + Rnbm (m is |nb|).
Definition 4: when the elements are aggregated, each element needs to record the repetition number, that is, the similarity Sr.
Thus, a similarity algorithm of adjacent words is obtained, and will not be described here again.
In this embodiment, a text library of domain terms is created, where each domain term in the text library has a corresponding index; searching a domain term corresponding to the character string to be corrected in the text library based on a preset similarity algorithm of adjacent words. Therefore, the recognition result can be corrected by using the field terms, and the recognition accuracy can be effectively improved.
Fig. 3 is a schematic hardware structure of an electronic device according to an embodiment of the present application, where the device includes: one or more processors 301, and a memory 302. One example is shown in fig. 3. The processor 301 and the memory 302 may be connected by a bus or other means, which is illustrated in fig. 3.
The memory 302 is used as a non-volatile computer readable storage medium, and may be used to store a non-volatile software program, a non-volatile computer executable program, and a module, such as a program instruction/module corresponding to the identification verification device of a character string in the embodiment of the present invention. The processor 301 executes various functional applications of the server and data processing, that is, implements the recognition verification device of the character string in the above-described method embodiment, by running the nonvolatile software programs, instructions, and modules stored in the memory 302.
Memory 302 may include a storage program area that may store an operating system, at least one application program required for functionality, and a storage data area; the storage data area may store data created according to the use of the identification verification device of the character string, or the like. In addition, memory 302 may include high-speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other non-volatile solid-state storage device. In some embodiments, memory 302 may optionally include memory located remotely from processor 301, which may be connected to the identification verification means of the character string via a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.
The electronic equipment can execute the device or the method provided by the embodiment of the application and has the corresponding functional modules and beneficial effects of executing the device or the method. Technical details not described in detail in this embodiment may be referred to the apparatus or method provided in the embodiments of the present application.
Also, the system embodiments described above are merely illustrative, wherein the elements illustrated as separate elements may or may not be physically separate, and the elements shown as elements may or may not be physical elements, may be located in one place, or may be distributed over a plurality of network elements. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment.
From the above description of embodiments, it will be apparent to those skilled in the art that the embodiments may be implemented by means of software plus a general purpose hardware platform, or may be implemented by hardware. Based on such understanding, the technical solution of the present invention may be embodied essentially or in part or all of the technical solution or in part in the form of a software product stored in a storage medium, including instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to perform all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (Random Access Memory, RAM), a magnetic disk, or an optical disk, or other various media capable of storing program codes.
The above embodiments are only for illustrating the technical solution of the present invention, and are not limiting; although the invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit and scope of the technical solutions of the embodiments of the present invention.

Claims (2)

1. The character string identification and verification method is characterized by being applied to the medical field and comprising the following steps of:
Creating a text library of domain terms, wherein each domain term in the text library has a corresponding index;
searching the field terms corresponding to the character strings to be corrected in the text library based on a preset similarity algorithm of adjacent words,
The domain terms include: establishing indexes of terms in each field according to the pinyin of the Chinese characters; or, establishing the index of the terms in each field according to the pinyin and the position of the Chinese characters,
The index of each domain term includes: the word frequency probability of the terms in each field is set,
The searching the domain terms corresponding to the character strings to be corrected in the text library based on the similarity algorithm of the preset adjacent words comprises the following steps:
searching the domain term corresponding to the character string to be corrected through the following algorithm:
decomposing the character string to be corrected into a set of binary groups of adjacent words;
Searching each binary group in the set in the text library to obtain a searching set corresponding to each binary group;
respectively calculating the repetition times of the terms in each field in each search set as similarity;
Determining the domain term of the highest similarity corresponding to each binary group, and determining the domain term of the highest similarity as the domain term corresponding to the character string to be corrected;
The step of calculating the similarity of the terms in each field in each search set comprises the following steps: and if the similarity of the domain terms is equal, determining the domain term with the highest word frequency probability in the domain terms as the domain term corresponding to the character string to be corrected.
2. The character string recognition and verification device is characterized by being applied to the medical field and comprising:
the system comprises a creation module, a search module and a search module, wherein the creation module is used for creating a text library of domain terms, and each domain term in the text library has a corresponding index;
The searching module is used for searching the field term corresponding to the character string to be corrected in the text library based on a preset similarity algorithm of the adjacent words;
the creation module comprises:
The establishing module is used for establishing indexes of the terms in each field according to the pinyin of the Chinese characters; or, establishing the index of the terms in each field according to the pinyin and the position of the Chinese characters,
The establishing module comprises:
A setting module for setting word frequency probability of terms in each field,
The searching module is specifically configured to search a domain term corresponding to a character string to be corrected by using the following algorithm:
decomposing the character string to be corrected into a set of binary groups of adjacent words;
Searching each binary group in the set in the text library to obtain a searching set corresponding to each binary group;
respectively calculating the repetition times of the terms in each field in each search set as similarity;
Determining the domain term of the highest similarity corresponding to each binary group, and determining the domain term of the highest similarity as the domain term corresponding to the character string to be corrected;
the device is characterized by further comprising:
and the determining module is used for determining the domain term with the highest word frequency probability in the domain terms as the domain term corresponding to the character string to be corrected if the similarity of the domain terms is equal.
CN201810221541.5A 2018-03-17 2018-03-17 Character string identification and verification method and device Active CN108564086B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810221541.5A CN108564086B (en) 2018-03-17 2018-03-17 Character string identification and verification method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810221541.5A CN108564086B (en) 2018-03-17 2018-03-17 Character string identification and verification method and device

Publications (2)

Publication Number Publication Date
CN108564086A CN108564086A (en) 2018-09-21
CN108564086B true CN108564086B (en) 2024-05-10

Family

ID=63532966

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810221541.5A Active CN108564086B (en) 2018-03-17 2018-03-17 Character string identification and verification method and device

Country Status (1)

Country Link
CN (1) CN108564086B (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109413504B (en) * 2018-09-30 2021-04-09 武汉斗鱼网络科技有限公司 Bullet screen checking method, device, terminal and storage medium based on character string replacement
CN111898612A (en) * 2020-06-30 2020-11-06 北京来也网络科技有限公司 OCR identification method and device, equipment and medium combining RPA and AI

Citations (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2003331214A (en) * 2002-05-15 2003-11-21 Nippon Telegr & Teleph Corp <Ntt> Character recognition error correction method, device and program
CN102324233A (en) * 2011-08-03 2012-01-18 中国科学院计算技术研究所 An Automatic Correction Method for Repeated Word Recognition Errors in Chinese Speech Recognition
CN102375807A (en) * 2010-08-27 2012-03-14 汉王科技股份有限公司 Method and device for proofing characters
CN103530840A (en) * 2013-10-10 2014-01-22 中国中医科学院 Accurate and quick electronic medical record type-in system
CN103870575A (en) * 2014-03-19 2014-06-18 北京百度网讯科技有限公司 Method and device for extracting domain keywords
US9037967B1 (en) * 2014-02-18 2015-05-19 King Fahd University Of Petroleum And Minerals Arabic spell checking technique
CN105512110A (en) * 2015-12-15 2016-04-20 江苏科技大学 Wrong word knowledge base construction method based on fuzzy matching and statistics
CN105550173A (en) * 2016-02-06 2016-05-04 北京京东尚科信息技术有限公司 Text correction method and device
CN106127265A (en) * 2016-06-22 2016-11-16 北京邮电大学 A kind of text in picture identification error correction method based on activating force model
CN106383853A (en) * 2016-08-30 2017-02-08 刘勇 Realization method and system for electronic medical record post-structuring and auxiliary diagnosis
CN106528846A (en) * 2016-11-21 2017-03-22 广州华多网络科技有限公司 Retrieval method and device
CN106682397A (en) * 2016-12-09 2017-05-17 江西中科九峰智慧医疗科技有限公司 Knowledge-based electronic medical record quality control method

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2005352888A (en) * 2004-06-11 2005-12-22 Hitachi Ltd Notation shaking correspondence dictionary creation system
US9361531B2 (en) * 2014-07-21 2016-06-07 Optum, Inc. Targeted optical character recognition (OCR) for medical terminology

Patent Citations (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2003331214A (en) * 2002-05-15 2003-11-21 Nippon Telegr & Teleph Corp <Ntt> Character recognition error correction method, device and program
CN102375807A (en) * 2010-08-27 2012-03-14 汉王科技股份有限公司 Method and device for proofing characters
CN102324233A (en) * 2011-08-03 2012-01-18 中国科学院计算技术研究所 An Automatic Correction Method for Repeated Word Recognition Errors in Chinese Speech Recognition
CN103530840A (en) * 2013-10-10 2014-01-22 中国中医科学院 Accurate and quick electronic medical record type-in system
US9037967B1 (en) * 2014-02-18 2015-05-19 King Fahd University Of Petroleum And Minerals Arabic spell checking technique
CN103870575A (en) * 2014-03-19 2014-06-18 北京百度网讯科技有限公司 Method and device for extracting domain keywords
CN105512110A (en) * 2015-12-15 2016-04-20 江苏科技大学 Wrong word knowledge base construction method based on fuzzy matching and statistics
CN105550173A (en) * 2016-02-06 2016-05-04 北京京东尚科信息技术有限公司 Text correction method and device
CN106127265A (en) * 2016-06-22 2016-11-16 北京邮电大学 A kind of text in picture identification error correction method based on activating force model
CN106383853A (en) * 2016-08-30 2017-02-08 刘勇 Realization method and system for electronic medical record post-structuring and auxiliary diagnosis
CN106528846A (en) * 2016-11-21 2017-03-22 广州华多网络科技有限公司 Retrieval method and device
CN106682397A (en) * 2016-12-09 2017-05-17 江西中科九峰智慧医疗科技有限公司 Knowledge-based electronic medical record quality control method

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
基于OCR技术的化验单识别方法研究;王宸敏;《中国优秀硕士学位论文全文数据库 信息科技辑》;20180115(第01期);I138-1250 *
基于统计和特征相结合的查询纠错方法;段建勇 等;《现代图书情报技术》;20160228(第2期);第34-42页 *

Also Published As

Publication number Publication date
CN108564086A (en) 2018-09-21

Similar Documents

Publication Publication Date Title
CN110502621B (en) Question answering method, question answering device, computer equipment and storage medium
US8881005B2 (en) Methods and systems for large-scale statistical misspelling correction
US8606559B2 (en) Method and apparatus for detecting errors in machine translation using parallel corpus
CN107341143B (en) Sentence continuity judgment method and device and electronic equipment
Chanlekha et al. Thai named entity extraction by incorporating maximum entropy model with simple heuristic information
US8725497B2 (en) System and method for detecting and correcting mismatched Chinese character
CN112231451B (en) Reference word recovery method and device, conversation robot and storage medium
US9575957B2 (en) Recognizing chemical names in a chinese document
CN111046660B (en) Method and device for identifying text professional terms
CN112381038B (en) Text recognition method, system and medium based on image
Fahda et al. A statistical and rule-based spelling and grammar checker for Indonesian text
Chen et al. Integrating natural language processing with image document analysis: what we learned from two real-world applications
CN108564086B (en) Character string identification and verification method and device
CN111782892B (en) Similar character recognition method, device, apparatus and storage medium based on prefix tree
CN110866390B (en) Method and device for recognizing Chinese grammar error, computer equipment and storage medium
CN113033185A (en) Standard text error correction method and device, electronic equipment and storage medium
Yang et al. Spell Checking for Chinese.
Gholami-Dastgerdi et al. Part of speech tagging using part of speech sequence graph
Mittra et al. A bangla spell checking technique to facilitate error correction in text entry environment
Kang et al. Two approaches for the resolution of word mismatch problem caused by English words and foreign words in Korean information retrieval
Mohapatra et al. Spell checker for OCR
CN113987135B (en) A bank product problem retrieval method and device
Hladek et al. Unsupervised spelling correction for Slovak
KS et al. Automatic error detection and correction in malayalam
CN109086272B (en) Sentence pattern recognition method and system

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
TA01 Transfer of patent application right
TA01 Transfer of patent application right

Effective date of registration: 20240403

Address after: 200333, Block E, East Side, 3rd Floor, Building 3, No. 14, Lane 172, Jinshajiang Road, Putuo District, Shanghai

Applicant after: SHANGHAI KEDU MEDICAL TECHNOLOGY CO.,LTD.

Country or region after: China

Address before: 518000, B202-33, 2nd Floor, Building 3, Yu'anju Tongjian Building, Xin'an Street, Bao'an District, Shenzhen City, Guangdong Province

Applicant before: SHENZHEN JIKE SISUO TECHNOLOGY CO.,LTD.

Country or region before: China

GR01 Patent grant
GR01 Patent grant