CN113837168A

CN113837168A - Image text detection and OCR recognition method, device and storage medium

Info

Publication number: CN113837168A
Application number: CN202111118174.4A
Authority: CN
Inventors: 陈坤龙; 吴梁斌; 章瑶; 吕建进
Original assignee: Yilianzhong Zhiding Xiamen Technology Co ltd
Current assignee: Yilianzhong Zhiding Xiamen Technology Co ltd
Priority date: 2021-09-22
Filing date: 2021-09-22
Publication date: 2021-12-24

Abstract

The invention relates to the technical field of data recognition, in particular to a method, a device and a storage medium for image text detection and OCR recognition, wherein the method comprises the following steps: preprocessing the picture to obtain training data; extracting the preliminary characteristics of the training data to obtain a return result and building a training network according to the return result; the training model calls a training network to train training data to obtain a plurality of text segmentation examples; processing a plurality of text segmentation examples by a watershed segmentation method to complete detection and identification; through the steps and the watershed segmentation method, a plurality of text segmentation examples are subjected to post-processing, the algorithm time complexity is effectively reduced to O (N), and the problem of adopting PSENet is solvedThe breadth-first algorithm in the algorithm flow carries out pixel-by-pixel four-neighborhood search and combination on each text segmentation example, which leads the algorithm time complexity in the detection stage to reach O (N)²) The method has the advantages of low detection speed and low efficiency, thereby improving the image processing speed and accelerating the efficiency.

Description

Image text detection and OCR recognition method, device and storage medium

Technical Field

The invention relates to the technical field of data recognition, in particular to a method, a device and a storage medium for image text detection and OCR recognition.

Background

The core idea of the deep learning OCR method basically adopts a deep target detection algorithm strategy, the progressive expansion network PSENet is a method based on example segmentation, image feature extraction is carried out by adopting a back bone based on CNN, then a series of feature down-sampling, feature fusion and up-sampling operations are carried out on a feature image by adopting a network similar to a space pyramid to obtain a group of text segmentation examples with a predefined number, and finally the text examples are subjected to region communication by adopting a breadth-first algorithm.

CN110008950A patent "a method for detecting text in shape robust natural scene", application publication No. 2019.07.12, discloses a method for detecting text in shape robust natural scene, comprising the following steps: step 1, preprocessing training pictures in a text data set; step 2, building a PSENet progressive scale growth network, and completing feature extraction, feature fusion and segmentation prediction of a training picture by using the progressive scale growth network to obtain segmentation results of a plurality of prediction scales; step 3, performing supervised training on the PSENet progressive scale growth network built in the step 2 to obtain a detector model; step 4, detecting the picture to be detected; and 5, obtaining a final detection result by using a scale growth algorithm.

However, for an image with more text detection targets and the phenomenon of text region dislocation and overlap, the breadth-first algorithm in the PSENet algorithm flow is adopted to search and merge four adjacent regions pixel by pixel for each text segmentation example, which may cause the algorithm time complexity at the detection stage to reach O (N)²) Slow detection speedAnd the efficiency is low.

Disclosure of Invention

In order to solve the problem that the algorithm time complexity in the detection stage reaches O (N) by adopting the breadth-first algorithm in the PSENet algorithm flow to search and combine the four adjacent domains pixel by pixel for each text segmentation example²) The detection speed is slow, and the efficiency is low.

The invention provides an image text detection and OCR recognition method, which comprises the following steps:

preprocessing the picture to obtain training data;

extracting the preliminary features of the training data to obtain a return result and building a training network according to the return result;

the training model calls the training network to train the training data to obtain a plurality of text segmentation examples;

and processing a plurality of text segmentation examples by a watershed segmentation method to finish detection and identification.

Further, in a preferred embodiment, a text region of the picture is labeled, and the picture labeled with the text region is an original text coordinate tag; and processing the original text coordinate labels to generate a plurality of text segmentation kernels with similar shapes, the same central points and different sizes as training data of the training network.

Further, in a preferred embodiment, the training network is a PSENet forward network;

and extracting the preliminary features of the training data by loading a feature extraction model to obtain a return result, inputting the return result into a PSENet forward network, and constructing the PSENet forward network by using a feature space pyramid network according to a top-down mode.

Further, in a preferred embodiment, the training model invoking the training network to train the training data to obtain a plurality of text segmentation instances includes the following steps:

training preparation: setting a hyper-parameter, selecting an optimizer, and setting a mode for reading the training data into the training model;

training process: calling a PSENet forward network, calculating the current loss situation through comparison with a real label and a loss function, calculating and updating network parameter gradient by adopting an optimizer, carrying out iterative training until the ideal precision is reached, and carrying out persistence on the model;

and outputting a plurality of text segmentation examples after training is completed.

Further, in a preferred embodiment, a dice coefficient is used to define a loss function, samples with poor detection effects are screened out according to the loss of training data transmitted into the model, and the screened samples with poor detection effects are extracted and combined and trained in random gradient descent.

Further, in a preferred embodiment, processing a plurality of the text segmentation instances by a watershed segmentation method to determine a final text line region and a final background region, includes the following steps:

acquiring a foreground image mark, a background image mark and an uncertain region;

and operating a watershed segmentation algorithm to process the uncertain area to obtain a final text line area and a final background area.

Further, in a preferred embodiment, the obtaining of the foreground image mark, the background image mark and the uncertain region comprises the following steps:

marking pixels inside the minimum text segmentation example as a foreground area, and setting the pixel value of the area to be 255;

marking pixels outside the maximum text segmentation instance as a background region and setting the pixel value of the region to 128;

the region between the minimum text segmentation instance and the maximum text segmentation instance is taken as an uncertain region, and the pixel value of the region is set to 0.

Further, in a preferred embodiment, the step of operating the watershed segmentation algorithm to process the uncertain region to obtain the final text line region and the final background region comprises the following steps:

sequencing pixels in the gradient image of the uncertain region to obtain a geodesic distance threshold of a watershed segmentation algorithm, and marking the minimum value of the uncertain region as the lowest point;

continuously increasing the geodesic distance, screening out pixels smaller than the geodesic distance value, and if the distance from the screened pixels to the lowest point is smaller than a geodesic distance threshold value, submerging; otherwise, taking the gray value of the screened pixel as a local threshold, namely constructing a dam and completing the classification of the text region and the non-text region of the local region;

and the geodesic distance is continuously increased until the maximum value of the gray value, so that the separation of the text region from the background is completed, and the classification attribution judgment of all pixels is completed.

The invention also provides an image text detection and OCR recognition device, which comprises

A preprocessing module: the image preprocessing module is used for preprocessing the image to obtain training data;

training a network building module: the training network is used for extracting the preliminary features of the training data to obtain a return result and building a training network according to the return result;

a training module: the training network is used for calling the training network by a training model to train the training data so as to obtain a plurality of text segmentation examples;

a processing module: the detection and recognition are completed by processing a plurality of text segmentation examples through a watershed segmentation algorithm.

The invention also provides a computer readable storage medium, which stores computer instructions, and when executed by a processor, the computer implements any one of the image text detection and OCR recognition methods.

Compared with the prior art, the image text detection and OCR recognition method provided by the invention has the advantages that through the steps and the watershed segmentation method replacing a breadth-first search (BFS) algorithm in the original PSENet algorithm to carry out post-processing on a plurality of text segmentation examples, the algorithm time complexity is effectively reduced to O (N), the problem that the algorithm time complexity at the detection stage reaches O (N) due to the fact that the breadth-first algorithm in the PSENet algorithm flow is adopted to carry out pixel-by-pixel four-neighborhood search and combination on each text segmentation example is solved, and the algorithm time complexity at the detection stage can reach O (N)²) Slow and effective detectionThe rate is low, thereby improving the image processing speed and increasing the efficiency.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and other drawings can be obtained by those skilled in the art without creative efforts.

Fig. 1 is a step diagram of an image text detection and OCR recognition method provided by the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

In the description of the present invention, it should be noted that the terms "center", "longitudinal", "lateral", "up", "down", "front", "back", "left", "right", "vertical", "horizontal", "top", "bottom", "inner", "outer", etc., indicate orientations or positional relationships based on those shown in the drawings, and are only for convenience of description and simplicity of description, but do not indicate or imply that the referred device or element must have a specific orientation, be constructed and operated in a specific orientation, and thus, should not be construed as limiting the present invention. Furthermore, the terms "first" and "second" are used for descriptive purposes only and are not to be construed as indicating or implying relative importance.

Specific examples are given below:

in the invention, the image with more text detection targets and the phenomenon of text area dislocation and overlap is taken as the medical bill image as an example. Because a large amount of medical bill data are accumulated, retraining can be directly started, and parameters of a pre-trained model are not migrated for further training, the method can be used for model training in a train from scratch mode.

An image text detection and OCR recognition method comprises the following steps:

preprocessing the picture to obtain training data;

Compared with the prior art, the image text detection and OCR recognition method provided by the invention has the advantages that through the steps and the watershed segmentation method replacing a breadth-first search (BFS) algorithm in the original PSENet algorithm to carry out post-processing on a plurality of text segmentation examples, the algorithm time complexity is effectively reduced to O (N), the problem that the algorithm time complexity at the detection stage reaches O (N) due to the fact that the breadth-first algorithm in the PSENet algorithm flow is adopted to carry out pixel-by-pixel four-neighborhood search and combination on each text segmentation example is solved, and the algorithm time complexity at the detection stage can reach O (N)²) The method has the advantages of low detection speed and low efficiency, thereby improving the image processing speed and accelerating the efficiency.

Specifically, preprocessing a picture to obtain training data, wherein the picture can be a picture shot in a natural scene, and a text area of the picture is marked, and the picture marked with the text area is an original text coordinate label; the text area refers to an area with a text, the labeling mode can be manual or computer labeling, and the labeling mode is a polygonal coordinate and can be a coordinate of four points of a rectangular frame;

according to the requirement of progressive scale expansion, an original text coordinate label is processed through a Vatti clipping algorithm to generate a plurality of text segmentation cores with similar shapes, the same central point and different sizes as training data of the training network.

Specifically, the method for obtaining the text segmentation cores by performing contraction processing on the original text coordinate labels through the Vatti clipping algorithm comprises the following steps:

and calculating a reduction distance according to the area and the perimeter of the maximum text segmentation kernel and a reduction ratio:

in implementation, the original text coordinate label is contracted to obtain a plurality of text segmentation kernels which are p in sequence₁,p₂...p_iWherein the largest text segmentation kernel (i.e. the original kernel) is p₁Any one of the text segmentation kernels p_iWith the largest text segmentation kernel p₁Is reduced by a ratio r_iRelative distance is d_iArea and Perimeter are the Area and the Perimeter of the maximum text segmentation kernel respectively;

calculating a reduction ratio according to the number of text segmentation cores and the reduction ratio:

wherein m is a reduction scale, the range is (0, 1), n is the number of text segmentation examples, namely the number of text segmentation kernels, and both m and n are hyper-parameters of a PSENet algorithm;

calculating the reduced labels of the original text coordinate labels through a reduction formula to obtain a plurality of text segmentation kernels, wherein the plurality of text segmentation kernels are used as original input training data of a training network; the reduction formula refers to the above-described formula of the reduction distance and the reduction ratio.

Specifically, in the step of extracting the preliminary features of the training data to obtain a return result, and building a training network according to the return result:

the training network is a PSENet forward network, the feature extraction model is, but not limited to ResNet-18, ResNet-34, ResNet-152, ResNet-50, ResNet-101, vgg16, vgg19, shufflent and mobilene, preferably, the ResNet-152 model is selected, and ResNet-152 is a network with a deeper structure, can extract more effective features and has better precision;

extracting the preliminary characteristics of the training data by loading a ResNet-152 model on the Pythroch to obtain a return result, inputting the return result into a PSENet forward network, and building the PSENet forward network by using a characteristic space pyramid network according to a top-down mode. The process of extracting the preliminary features of the training data and obtaining the returned result by the ResNet-152 model is already the prior art, and redundant description is not repeated again.

Specifically, inputting the returned results [ c2, c3, c4 and c5] into a PSENet forward network, and constructing a PSENet training network by using a feature space pyramid network according to a top-down mode, wherein the PSENet training network comprises the following steps:

(1) p5 toplayer treatment:

c5 → p5:3 × 3 convolution, BN processing, ReLU activation function;

(2) p4 upsampling:

c4 → c4l:2 × 2 convolution, BN processing, ReLU activation function;

[ p5, c4l ] → p4 bilinear interpolation (p5) + c4l

(3) p4 smoothing:

p4 → p4 original size convolution, BN processing, ReLU activation function;

(4) p3 upsampling and smoothing:

c3 → c3l:1 × 1 convolution, BN processing, ReLU activation function;

[ p4, c3l ] → p3 bilinear interpolation (p4) + c3l

The smoothing treatment is the same as p 4;

(5) p2 upsampling and smoothing:

c2 → c2l original size convolution, BN processing and ReLU activation function;

[ p3, c2l ] → p2 bilinear interpolation (p3) + c2l

The smoothing treatment is the same as p 4;

(6) and (3) upsampling combination:

based on the size of p2, bilinearly interpolating p3-p5 into the size of p2 degrees, and then combining p2-p5 vectors by using a concatenate method; and completing the construction of the PSEnet forward network.

Specifically, the training model calls a training network to train training data to obtain a plurality of text segmentation examples, and the method comprises the following steps:

training preparation: setting a hyper-parameter, selecting an optimizer, and setting a mode of reading training data into a training model;

the optimization method comprises the following steps that hyper-parameters comprise completion of learning rate and decay tasks, segmentation examples, batch _ size and epoch, an optimizer selects but is not limited to SGD and Adam, Adam is selected, Adam has the advantages of being capable of dynamically adjusting learning rate and the like, and training data are read into a training model through a generator function batch;

training process: calling a PSENet forward network, calculating the current loss situation through comparison between a model prediction result and a real label and a loss function, calculating and updating network parameter gradient by adopting an optimizer, carrying out iterative training until the ideal precision is reached, and carrying out persistence on the model:

specifically, the method comprises the steps of training by taking epochs as a unit, completely training all data sub-batchs once by each epoch (without considering boundary problems), transmitting each batch data into a model, calling a PSENet forward network, comparing the training data with real labels, calculating the current loss condition by a loss function, calculating and updating network parameter gradients by using an Adam optimizer, iteratively training until the ideal precision is reached, and persisting the model; through continuous model iteration, the result predicted by each model is compared with the real label, and if the model prediction result is basically consistent with the real label, for example, the prediction precision reaches 95%, the model parameters at the moment are stored, namely, the model parameters are stored persistently.

In the text detection of the medical bill, the negative sample area is far larger than the positive sample area, the loss function is defined by dice coefficient, samples with poor detection effect are screened out according to the loss of training data of an incoming model, the screened samples with poor detection effect are extracted and combined and trained in random gradient descent, and the loss function specifically comprises the following steps:

wherein S_x，yTo predict the value of the resulting pixel, G_x，yIs the point value of the pixel in the real label.

The loss function is defined as L ═ λ L_C+(1-λ)L_SWherein L is_CIs a classification loss of text regions, L_SIs the loss of the contracted text instance, of

L_C＝1-D(S_n×M,G_n×M)

M is generated by an online hard case mining algorithm and is 0/1byte codes; and screening out samples with poor detection effect according to the loss of the training data transmitted into the model, then extracting and combining the screened samples and training by adopting Adam.

Specifically, the method for processing a plurality of text segmentation examples by a watershed segmentation method to determine a final text line region and a final background region comprises the following steps:

firstly, obtaining a foreground image mark, specifically marking pixels inside a minimum text segmentation example as a foreground area, and setting the pixel value of the area to be 255; acquiring a background image mark, specifically, marking a pixel outside a maximum text segmentation example as a background area, and setting a pixel value of the area as 128; and acquiring an uncertain region, specifically, taking a region between the minimum text segmentation example and the maximum text segmentation example as the uncertain region, and setting the pixel value of the region to be 0.

Secondly, operating a watershed segmentation algorithm to process the uncertain region to obtain a final text line region and a final background region, specifically comprising the following steps:

sequencing pixels in the gradient image of the uncertain region to obtain a geodesic distance threshold of a watershed segmentation algorithm, marking the minimum value of the uncertain region as the lowest point, and specifically obtaining the geodesic distance threshold of the watershed algorithm by running an OTSU algorithm;

continuously increasing the geodesic distance, screening out pixels smaller than the geodesic distance value, if the distance from the pixels to the lowest point is smaller than a geodesic distance threshold value, submerging, otherwise, taking the gray value of the pixels as a local threshold value, namely constructing a dam, and completing the classification of text regions and non-text regions of the local region;

and (4) continuously increasing the geodesic distance until the maximum value of the gray value is reached, and all the regions meet on the watershed line, so that the separation of the text region from the background is completed, the classification attribution judgment of all the pixels is completed, and the final text line region and the final background region are obtained.

According to the content of the invention, in implementation, M is set to be 0.5, n is set to be 6, ResNet-152 and M is set to be 3 are selected as a feature extraction network, training data is read into model training through a generator function batch, an Adam optimizer is adopted, and during training, the input image dimensions are [ B,3, H and W ], which are respectively corresponding to the batch _ size, the number of image channels and the height and width of an image;

setting the number of text segmentation examples to be 6, carrying out image down-sampling, feature fusion and image up-sampling on the batch training image feature graph, and outputting a batch with the same size as the original image, namely [ B,6, H, W ]]For each text line of each image, 6 text segmentation results S are generated₁，S₂，…，S₆；

The medical bills (including outpatient service invoices and hospitalization invoices) are adopted to carry out test experiments, each type of picture contains 1000, and a display card of the test equipment is Tesla V100 and 32GB for display and storage. In the experiments, all pictures were limited to 1000 on the shortest side,

under the same conditions, the original BFS algorithm of the PSENet is replaced by the Watershed segmentation algorithm, and the minimum segmentation result S of all text regions is separated₆Combining confidence foreground marked images as Watershed, and performing maximum segmentation on the result S₁Negation as Watershed confidence background labelImage, S₂,S₃,...,S₅The processing is performed as an uncertainty region of the algorithm:

for the original psenet algorithm, the accuracy reaches 92.37%, and the FPS (the number of pictures processed by the model per second, including data pre-processing and post-processing) reaches 11; the accuracy of the method reaches 92.51%, and FPS reaches 48%; obviously, under the condition of ensuring the precision, the processing speed of the method is more than 4 times of that of the original PSENet algorithm, compared with the prior art, the image text detection and OCR recognition method provided by the invention has the advantages that through the steps and the watershed segmentation method replacing the breadth-first search (BFS) algorithm in the original PSENet algorithm to carry out post-processing on a plurality of text segmentation examples, the algorithm time complexity is effectively reduced to O (N), the problem that the breadth-first algorithm in the PSENet algorithm flow is adopted to carry out pixel-by-pixel four-neighborhood search and combination on each text segmentation example is solved, and the algorithm time complexity in the detection stage can reach O (N) due to the fact that the algorithm time complexity in the detection stage reaches O (N)²) The method has the advantages of low detection speed and low efficiency, thereby improving the image processing speed and accelerating the efficiency.

The invention also provides an image text detection and OCR recognition system, which comprises a preprocessing module: the image preprocessing module is used for preprocessing the image to obtain training data; training a network building module: the training network is used for extracting the preliminary features of the training data to obtain a return result and building a training network according to the return result; a training module: the training network is used for calling the training network by a training model to train the training data so as to obtain a plurality of text segmentation examples; a processing module: the detection and recognition are completed by processing a plurality of text segmentation examples through a watershed segmentation algorithm. The image text detection and OCR recognition system provided by the invention improves the image processing speed and accelerates the efficiency.

The present invention also provides a computer readable storage medium having stored thereon computer instructions, which when executed by a processor, implement an image text detection and OCR recognition method as described in any of the above.

In specific implementation, the computer-readable storage medium is a magnetic Disk, an optical Disk, a Read-only Memory (ROM), a Random Access Memory (RAM), a Flash Memory (Flash Memory), a Hard Disk (Hard Disk Drive, abbreviated as HDD), a Solid-State Drive (SSD), or the like; the computer readable storage medium may also include a combination of memories of the above kinds.

Although terms such as training data, preliminary features, training networks, training models, text segmentation instances, watershed segmentation, etc. are used more often herein, the possibility of using other terms is not excluded. These terms are used merely to more conveniently describe and explain the nature of the present invention; they are to be construed as being without limitation to any additional limitations that may be imposed by the spirit of the present invention.

Finally, it should be noted that: the above embodiments are only used to illustrate the technical solution of the present invention, and not to limit the same; while the invention has been described in detail and with reference to the foregoing embodiments, it will be understood by those skilled in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some or all of the technical features may be equivalently replaced; and the modifications or the substitutions do not make the essence of the corresponding technical solutions depart from the scope of the technical solutions of the embodiments of the present invention.

Claims

1. An image text detection and OCR recognition method is characterized in that: the method comprises the following steps:

preprocessing the picture to obtain training data;

2. An image text detection and OCR recognition method according to claim 1 and characterized in that: marking a text area of the picture, wherein the picture marked with the text area is an original text coordinate label; and processing the original text coordinate labels to generate a plurality of text segmentation kernels with similar shapes, the same central points and different sizes as training data of the training network.

3. An image text detection and OCR recognition method according to claim 1 and characterized in that: the training network is a PSENet forward network;

4. An image text detection and OCR recognition method according to claim 3 and characterized in that: the training model calls the training network to train the training data to obtain a plurality of text segmentation examples, and the method comprises the following steps:

5. An image text detection and OCR recognition method according to claim 4 and characterized in that: and (3) defining a loss function by using dice coefficient, screening out samples with poor detection effect according to the loss of training data transmitted into the model, extracting and combining the screened samples with poor detection effect, and training in random gradient descent.

6. An image text detection and OCR recognition method according to claim 1 and characterized in that: processing a plurality of text segmentation examples by a watershed segmentation method to determine a final text line region and a final background region, comprising the following steps:

7. An image text detection and OCR recognition method according to claim 6 and further comprising: the method for acquiring the foreground image mark, the background image mark and the uncertain region comprises the following steps:

8. An image text detection and OCR recognition method according to claim 6 and further comprising: the method for processing the uncertain region by running the watershed segmentation algorithm to obtain the final text line region and the final background region comprises the following steps:

9. An image text detection and OCR recognition device, characterized by: comprises that

10. A computer-readable storage medium characterized by: the computer readable storage medium stores computer instructions which, when executed by a processor, implement a method of image text detection and OCR recognition as recited in any one of claims 1-8.