[go: up one dir, main page]

CN107463928A - Word sequence error correction algorithm, system and its equipment based on OCR and two-way LSTM - Google Patents

Word sequence error correction algorithm, system and its equipment based on OCR and two-way LSTM Download PDF

Info

Publication number
CN107463928A
CN107463928A CN201710630581.0A CN201710630581A CN107463928A CN 107463928 A CN107463928 A CN 107463928A CN 201710630581 A CN201710630581 A CN 201710630581A CN 107463928 A CN107463928 A CN 107463928A
Authority
CN
China
Prior art keywords
ocr
way lstm
linguistic context
error correction
context vector
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201710630581.0A
Other languages
Chinese (zh)
Inventor
王志成
邝展豪
高磊
刘志欣
王亮
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
SF Technology Co Ltd
SF Tech Co Ltd
Original Assignee
SF Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by SF Technology Co Ltd filed Critical SF Technology Co Ltd
Priority to CN201710630581.0A priority Critical patent/CN107463928A/en
Publication of CN107463928A publication Critical patent/CN107463928A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/98Detection or correction of errors, e.g. by rescanning the pattern or by human intervention; Evaluation of the quality of the acquired patterns
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/60Type of objects
    • G06V20/62Text, e.g. of license plates, overlay texts or captions on TV images
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/10Character recognition

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Multimedia (AREA)
  • Theoretical Computer Science (AREA)
  • Quality & Reliability (AREA)
  • Character Discrimination (AREA)

Abstract

Based on OCR and two-way LSTM word sequence error correction algorithm, system and its equipment, methods described includes:S1, obtain character image;S2, the character image pre-process to obtain First ray set X={ x by OCR0,x1,...,xm};S3 the, by { x of positive sequence0,x1,…,xmAnd inverted sequence { xm,xm‑1,...,x0Input two-way LSTM structure encoder in obtain linguistic context vector c;S4, the decoder decoding that the linguistic context vector c is built through two-way LSTM obtain the second arrangement set Y respectively.The system includes image capture module, OCR processing modules, the encoder of two-way LSTM structures, the decoder of two-way LSTM structures.The equipment is used for the configuration processor for carrying methods described.

Description

Word sequence error correction algorithm, system and its equipment based on OCR and two-way LSTM
Technical field
The present invention relates to machine translation field in pictograph identification process, more particularly to based on OCR's and two-way LSTM Word sequence error correction algorithm, system and its equipment.
Background technology
In recent years, as the fast development of machine learning, various machine translation algorithms emerge in an endless stream, what is be widely used has OCR Text region algorithms.OCR (Optical Character Recognition, optical character identification) refers to electronic equipment (such as scanner or digital camera) checks the character printed on paper, and its shape, Ran Houyong are determined by detecting dark, bright pattern Character identifying method translates into shape the process of computword;That is, for printed character, using optical mode by paper Text conversion in matter document turns into the image file of black and white lattice, and by identification software that the text conversion in image is written This form, the technology further edited and processed for word processor.
However, because image irradiation, angle etc. influence, OCR identifications word arithmetic accuracy is extremely difficult to it is expected.
The content of the invention
In order to solve the above-mentioned technical problem, the present invention proposes the word sequence error correction algorithm based on OCR and two-way LSTM.System System and its equipment, it can effectively improve the degree of accuracy of word sequence identification.
To achieve these goals, the technical scheme is that:
Based on OCR and two-way LSTM word sequence error correction algorithm, the identification of word suitable for image, including step:
S1, obtain character image;
S2, the character image pre-process to obtain First ray set X={ x by OCR0,x1,...,xm};
S3 the, by { x of positive sequence0,x1,…,xmAnd inverted sequence { xm,xm-1,...,x0Input the coding that two-way LSTM is built Linguistic context vector c is obtained in device;
S4, the decoder decoding that the linguistic context vector c is built through two-way LSTM obtain the second arrangement set Y respectively.
Linguistic context vector c described in step S3 is:
C=Φ ({ h1,h2,…,hTS});
ht=f (xt,ht-1)。
The second arrangement set Y described in step S4 is:
Y=(y0,y1,…,yn);
st=f (yt-1,st-1,c);
p(yt|y<T, X)=g (yt-1,st,c)。
Character image described in step S1 is express delivery single image.
The threshold value of OCR pretreatments described in step S2 is the minimum reliability threshold values that system allows.
Based on OCR and two-way LSTM word sequence error correction system, including:
Image capture module, for obtaining character image;
OCR processing modules, pre-process to obtain First ray set X={ x for carrying out OCR to the character image0, x1,...,xm};
The encoder of two-way LSTM structures the, for { x to positive sequence0,x1,…,xmAnd inverted sequence { xm,xm-1,...,x0} Encoded to obtain linguistic context vector c;
The decoder of two-way LSTM structures, the second arrangement set is obtained for being decoded to the linguistic context vector c respectively Y。
Based on OCR and two-way LSTM word sequence error correction apparatus, including it is stored with the computer-readable of computer program Medium, described program are run for performing:
S1, obtain character image;
S2, the character image pre-process to obtain First ray set X={ x by OCR0,x1,...,xm};
S3 the, by { x of positive sequence0,x1,…,xmAnd inverted sequence { xm,xm-1,...,x0Input the coding that two-way LSTM is built Linguistic context vector c is obtained in device;
S4, the decoder decoding that the linguistic context vector c is built through two-way LSTM obtain the second arrangement set Y respectively.
The beneficial effects of the invention are as follows:By integrated use OCR and two-way LSTM algorithms, the accurate of Text region is improved Degree.
Brief description of the drawings
Fig. 1 shows the flow chart according to embodiments herein.
Fig. 2 shows the operational flowchart of the two-way LSTM according to embodiments herein;
Fig. 3 shows the coding flow chart of the two-way LSTM according to embodiments herein.
Embodiment
In order to be better understood by technical scheme, the invention will be further described by 1- Fig. 3 below in conjunction with the accompanying drawings.
As shown in figure 1, the word sequence error correction algorithm based on OCR and two-way LSTM, the identification of word suitable for image, Integrated use artificial intelligence and big data, the text queue to input carry out real time data, realized to the real-time of text information Processing and application.Including step:
Obtain character image and carry out OCR pretreatments.
It is originally inputted as express delivery single image information, is pre-processed via Text region OCR, obtained OCR result queue, OCR Input of the result queue as Language Model, and combine mass text dictionary, obtain desired output sequence.
In order to improve the disadvantage of OCR technique identification word sequence precision accuracy rate relatively low (Exemplary statistical data 29.65%) End, using the method for setting minimum reliability threshold values, the value that will be greater than the threshold value takes out as OCR most this algorithm Whole output character queue, is input in language model and carries out computing.
Because the Recognition with Recurrent Neural Network RNN of the standard contextual informations that can be accessed are limited in scope, cause the defeated of hidden layer Entering the output for network influences to weaken with the continuous recurrence of network.As shown in Fig. 2 to solve this problem, by double To LSTM models (length memory network) using one as input sequence mapping for one as export sequence, this process It is made up of two links of coding input and coding output.Such as existing sequence " x0,x1,...,xm", after being passed to model successively, reflect It is " y to penetrate output0,y1,…,yn”。
Two-way LSTM core frame is Encoder-Decoder.From the point of view of simple, after list entries is passed to model, first The vector of a regular length, i.e. linguistic context vector are compiled it as by encoder.After the completion of coding, linguistic context variable will enter solution Code device is decoded, and by using local optimum resolving Algorithm, chooses a kind of module, the retrieval dictionary before equipment exports, from And obtain optimal selection.
From the point of view of specific, for given input First ray set X, expectation is generated by Encoder-Decoder frameworks The second arrangement set of target Y.X, Y are made up of respective sequence respectively.
X={ x0,x1,...,xm, its order is character string order in itself;
Y=(y0,y1,…,yn)。
M herein and n is positive integer, and m is that length -1, n of list entries is length -1 of output sequence, and wherein m and n is not It is certain equal, stop output when decoder Decoder end of output symbols.First, shown in equation below, list entries {x0,x1,…,xmAnd inverted sequence { xm,xm-1,...,x0Via two-way LSTM structure encoder recurrence obtains each hidden section one by one Point ht, each hidden node htWeighted sum be linguistic context vector c.The concept of the hidden node is:Input is removed in neutral net And all nodes of output node can be referred to as hidden node, more should accurately be changed to " linguistic context caused by each moment to Amount ".Fig. 3 is that two-way LSTM encodes to obtain c1, c2 process.Wherein c1, c2 are two linguistic context vectors, represent respectively positive sequence with And backward.
ht=f (xt,ht-1)
C=Φ ({ h1,h2,…,hTS})
Wherein h refers to the linguistic context vector of each moment encoder output, and TS refers to last moment.Φ refers to the h at all moment Pass through the stacking fusion process at each moment on the encoder.F refer to encoder a moment according to last moment linguistic context to Amount and input produce the function (process) of current time linguistic context vector,.
Linguistic context the vector c1, c2 of positive sequence backward coding generation are after the completion of coding, by merging (being usually direct splicing), Final linguistic context vector as encoder is input to decoder, obtains ultimate sequence set Y, the output sequence as needed.
st=f (yt-1,st-1,c);
p(yt|y<T, X)=g (yt-1,st,c)。
Wherein, s refers to linguistic context vector caused by each moment decoder.F refers to decoder at current time according to last moment Function (the mistake for the linguistic context vector structure current time linguistic context vector that decoder linguistic context vector, output and encoder finally export Journey).G refers to decoder and finally exported according to current time decoder linguistic context vector, the output of last moment decoder and encoder Linguistic context vector, produce the process that currently exports.Wherein p is represented and next output is produced on the premise of all inputs before Probability;X refers to the input dictionary vector at each moment that encoder receives.Parameter t in above-mentioned is the moment, and value is: T value is 0≤t≤m, in a decoder t 0≤t of value≤n in encoder.
Based on OCR and two-way LSTM word sequence error correction system, including:
Image capture module, for obtaining character image;
OCR processing modules, pre-process to obtain First ray set X={ x for carrying out OCR to the character image0, x1,...,xm};
The encoder of two-way LSTM structures the, for { x to positive sequence0,x1,…,xmAnd inverted sequence { xm,xm-1,...,x0} Encoded to obtain linguistic context vector c;
The decoder of two-way LSTM structures, the second arrangement set is obtained for being decoded to the linguistic context vector c respectively Y。
Based on OCR and two-way LSTM word sequence error correction apparatus, including it is stored with the computer-readable of computer program Medium, described program are run for performing:
S1, obtain character image;
S2, the character image pre-process to obtain First ray set X={ x by OCR0,x1,...,xm};
S3 the, by { x of positive sequence0,x1,…,xmAnd inverted sequence { xm,xm-1,...,x0Input the coding that two-way LSTM is built Linguistic context vector c is obtained in device;
S4, the decoder decoding that the linguistic context vector c is built through two-way LSTM obtain the second arrangement set Y respectively.
Above description is only the preferred embodiment of the application and the explanation to institute's application technology principle.People in the art Member should be appreciated that invention scope involved in the application, however it is not limited to the technology that the particular combination of above-mentioned technical characteristic forms Scheme, while should also cover in the case where not departing from the inventive concept, carried out by above-mentioned technical characteristic or its equivalent feature The other technical schemes for being combined and being formed.Such as features described above has similar work(with (but not limited to) disclosed herein The technical scheme that the technical characteristic of energy is replaced mutually and formed.

Claims (7)

1. based on OCR and two-way LSTM word sequence error correction algorithm, the identification of word suitable for image, it is characterised in that Including step:
S1, obtain character image;
S2, the character image pre-process to obtain First ray set X={ x by OCR0,x1,...,xm};
S3 the, by { x of positive sequence0,x1,…,xmAnd inverted sequence { xm,xm-1,...,x0Input in the encoder of two-way LSTM structures Obtain linguistic context vector c;
S4, the decoder decoding that the linguistic context vector c is built through two-way LSTM obtain the second arrangement set Y respectively.
2. the word sequence error correction algorithm according to claim 1 based on OCR and two-way LSTM, it is characterised in that step Linguistic context vector c described in S3 is:
C=Φ ({ h1,h2,…,hTS});
ht=f (xt,ht-1)。
3. the word sequence error correction algorithm according to claim 1 based on OCR and two-way LSTM, it is characterised in that step The second arrangement set Y described in S4 is:
Y=(y0,y1,…,yn);
st=f (yt-1,st-1,c);
p(yt|y<T, X)=g (yt-1,st,c)。
4. the word sequence error correction algorithm based on OCR and two-way LSTM according to Claims 2 or 3, it is characterised in that step Character image described in rapid S1 is express delivery single image.
5. the word sequence error correction algorithm based on OCR and two-way LSTM according to Claims 2 or 3, it is characterised in that step The threshold value of OCR pretreatments described in rapid S2 is the minimum reliability threshold values that system allows.
6. the word sequence error correction system based on OCR and two-way LSTM, it is characterised in that including:
Image capture module, for obtaining character image;
OCR processing modules, pre-process to obtain First ray set X={ x for carrying out OCR to the character image0,x1,..., xm};
The encoder of two-way LSTM structures the, for { x to positive sequence0,x1,…,xmAnd inverted sequence { xm,xm-1,...,x0Carry out Coding obtains linguistic context vector c;
The decoder of two-way LSTM structures, the second arrangement set Y is obtained for being decoded to the linguistic context vector c respectively.
7. based on OCR and two-way LSTM word sequence error correction apparatus, including it is stored with computer-readable Jie of computer program Matter, it is characterised in that described program is run for performing:
S1, obtain character image;
S2, the character image pre-process to obtain First ray set X={ x by OCR0,x1,...,xm};
S3 the, by { x of positive sequence0,x1,…,xmAnd inverted sequence { xm,xm-1,...,x0Input in the encoder of two-way LSTM structures Obtain linguistic context vector c;
S4, the decoder decoding that the linguistic context vector c is built through two-way LSTM obtain the second arrangement set Y respectively.
CN201710630581.0A 2017-07-28 2017-07-28 Word sequence error correction algorithm, system and its equipment based on OCR and two-way LSTM Pending CN107463928A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201710630581.0A CN107463928A (en) 2017-07-28 2017-07-28 Word sequence error correction algorithm, system and its equipment based on OCR and two-way LSTM

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201710630581.0A CN107463928A (en) 2017-07-28 2017-07-28 Word sequence error correction algorithm, system and its equipment based on OCR and two-way LSTM

Publications (1)

Publication Number Publication Date
CN107463928A true CN107463928A (en) 2017-12-12

Family

ID=60547822

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201710630581.0A Pending CN107463928A (en) 2017-07-28 2017-07-28 Word sequence error correction algorithm, system and its equipment based on OCR and two-way LSTM

Country Status (1)

Country Link
CN (1) CN107463928A (en)

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108416349A (en) * 2018-01-30 2018-08-17 顺丰科技有限公司 Identification and correction system and method
CN109711412A (en) * 2018-12-27 2019-05-03 信雅达系统工程股份有限公司 A kind of optical character identification error correction method based on dictionary
CN110377591A (en) * 2019-06-12 2019-10-25 北京百度网讯科技有限公司 Training data cleaning method, device, computer equipment and storage medium
CN112507080A (en) * 2020-12-16 2021-03-16 北京信息科技大学 Character recognition and correction method
WO2021164310A1 (en) * 2020-02-21 2021-08-26 华为技术有限公司 Text error correction method and apparatus, and terminal device and computer storage medium
US11842524B2 (en) 2021-04-30 2023-12-12 International Business Machines Corporation Multi-modal learning based intelligent enhancement of post optical character recognition error correction

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20150161991A1 (en) * 2013-12-10 2015-06-11 Google Inc. Generating representations of acoustic sequences using projection layers
CN105046289A (en) * 2015-08-07 2015-11-11 北京旷视科技有限公司 Text field type identification method and text field type identification system
CN105512692A (en) * 2015-11-30 2016-04-20 华南理工大学 BLSTM-based online handwritten mathematical expression symbol recognition method
CN105740226A (en) * 2016-01-15 2016-07-06 南京大学 Method for implementing Chinese segmentation by using tree neural network and bilateral neural network
CN105930314A (en) * 2016-04-14 2016-09-07 清华大学 Text summarization generation system and method based on coding-decoding deep neural networks
CN106604125A (en) * 2016-12-29 2017-04-26 北京奇艺世纪科技有限公司 Video subtitle determining method and video subtitle determining device
CN106960206A (en) * 2017-02-08 2017-07-18 北京捷通华声科技股份有限公司 Character identifying method and character recognition system

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20150161991A1 (en) * 2013-12-10 2015-06-11 Google Inc. Generating representations of acoustic sequences using projection layers
CN105046289A (en) * 2015-08-07 2015-11-11 北京旷视科技有限公司 Text field type identification method and text field type identification system
CN105512692A (en) * 2015-11-30 2016-04-20 华南理工大学 BLSTM-based online handwritten mathematical expression symbol recognition method
CN105740226A (en) * 2016-01-15 2016-07-06 南京大学 Method for implementing Chinese segmentation by using tree neural network and bilateral neural network
CN105930314A (en) * 2016-04-14 2016-09-07 清华大学 Text summarization generation system and method based on coding-decoding deep neural networks
CN106604125A (en) * 2016-12-29 2017-04-26 北京奇艺世纪科技有限公司 Video subtitle determining method and video subtitle determining device
CN106960206A (en) * 2017-02-08 2017-07-18 北京捷通华声科技股份有限公司 Character identifying method and character recognition system

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
商俊蓓: "基于双向长短时记忆递归神经网络的联机手写数字公式字符识别", 《中国优秀硕士学位论文全文数据库信息科技辑》 *

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108416349A (en) * 2018-01-30 2018-08-17 顺丰科技有限公司 Identification and correction system and method
CN109711412A (en) * 2018-12-27 2019-05-03 信雅达系统工程股份有限公司 A kind of optical character identification error correction method based on dictionary
CN110377591A (en) * 2019-06-12 2019-10-25 北京百度网讯科技有限公司 Training data cleaning method, device, computer equipment and storage medium
WO2021164310A1 (en) * 2020-02-21 2021-08-26 华为技术有限公司 Text error correction method and apparatus, and terminal device and computer storage medium
CN112507080A (en) * 2020-12-16 2021-03-16 北京信息科技大学 Character recognition and correction method
US11842524B2 (en) 2021-04-30 2023-12-12 International Business Machines Corporation Multi-modal learning based intelligent enhancement of post optical character recognition error correction

Similar Documents

Publication Publication Date Title
CN110598713B (en) Intelligent image automatic description method based on deep neural network
CN107463928A (en) Word sequence error correction algorithm, system and its equipment based on OCR and two-way LSTM
CN109492202B (en) Chinese error correction method based on pinyin coding and decoding model
CN113313022B (en) Training method of character recognition model and method for recognizing characters in image
CN114757182B (en) A BERT short text sentiment analysis method with improved training method
CN112528637B (en) Text processing model training method, device, computer equipment and storage medium
CN109522403B (en) A Method of Abstract Text Generation Based on Fusion Coding
CN111444367B (en) Image title generation method based on global and local attention mechanism
CN113609326B (en) Image description generation method based on relationship between external knowledge and target
CN112115687B (en) Method for generating problem by combining triplet and entity type in knowledge base
CN110598221A (en) A Method of Improving the Quality of Mongolian-Chinese Translation Using Generative Adversarial Networks to Construct Mongolian-Chinese Parallel Corpus
CN114818721B (en) Event joint extraction model and method combined with sequence labeling
CN115082693B (en) Artwork image description generation method integrating multiple granularity and multiple modes
CN111160348A (en) Text recognition method, storage device and computer equipment for natural scenes
CN110866510A (en) Video description system and method based on key frame detection
CN115496134B (en) Traffic scene video description generation method and device based on multi-modal feature fusion
CN112070114A (en) Scene text recognition method and system based on Gaussian constrained attention mechanism network
CN111581970B (en) Text recognition method, device and storage medium for network context
CN117875395A (en) Training method, device and storage medium of multi-mode pre-training model
CN112560456A (en) Generation type abstract generation method and system based on improved neural network
CN114694255B (en) Sentence-level lip language recognition method based on channel attention and time convolution network
Agrawal et al. Image caption generator using attention mechanism
CN116935166A (en) Model training method, image processing method and device, medium and equipment
CN113780350A (en) Image description method based on ViLBERT and BilSTM
CN116206314A (en) Model training method, formula recognition method, device, medium and equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20171212

RJ01 Rejection of invention patent application after publication