CN112507782A

CN112507782A - Text image recognition method and device

Info

Publication number: CN112507782A
Application number: CN202011138322.4A
Authority: CN
Inventors: 林涛; 潘甜甜; 黄伟如; 金成伟; 郑建飞; 赵仕嘉; 董浩欣; 张宇
Original assignee: State-owned Assets Supervision and Administration Commission of the State Council
Current assignee: State-owned Assets Supervision and Administration Commission of the State Council
Priority date: 2020-10-22
Filing date: 2020-10-22
Publication date: 2021-03-16
Anticipated expiration: 2040-10-22

Abstract

The invention discloses a text image identification method and a text image identification device, wherein the method comprises the following steps: preprocessing a text image bearing a target text to obtain a model input image; inputting the model input image into a predetermined character detection model for analysis to obtain a character detection result of the text image; clustering all the bounding boxes according to the coordinate information of each bounding box to obtain at least one bounding box clustering set; inputting the image in the target image area selected by each bounding box into a predetermined character recognition model for analysis to obtain a character recognition result of the bounding box; and determining a text recognition result of the text image according to the character recognition result, the coordinate information and the boundary box cluster set to which the character recognition result and the coordinate information belong of each boundary box. Therefore, the method and the device can improve the identification accuracy of the text in the text image and ensure that the characters in the same line in the target text are output as the characters in the same line when the text identification result is output.

Description

Text image recognition method and device

Technical Field

The present invention relates to the field of image processing technologies, and in particular, to a text image recognition method and apparatus.

Background

Ocr (optical Character recognition) technology is a common technology in the field of image processing, which analyzes a text image (e.g., an image of a printed matter such as a bill, a newspaper, a book, etc.) containing text through various analysis algorithms, so as to convert the text contained in the text image into text information more convenient for storage and processing by a computer. Currently, OCR technology is widely applied in many scenarios where a large amount of printed material (e.g., enterprise qualification certificates, employee qualification certificates, tickets, archive files, etc.) needs to be processed.

In practice, OCR technology usually identifies the ordinate of each word in the text image through an analysis algorithm. The words in the same line in the text usually have the same ordinate, so that words with the same ordinate can be output as words in the same line when outputting the text information. However, some characters in the text (for example, characters corresponding to graduates and graduates in some graduate certificates) have an abnormal line feed phenomenon, and the ordinate of the character with the abnormal line feed phenomenon is usually larger than that of the same line of characters but smaller than that of the next line of characters, so that when the text information is output, the characters are easily output as a new line, and thus the wrong text information is obtained. Therefore, how to improve the identification accuracy of the text in the text image is very important.

Disclosure of Invention

The technical problem to be solved by the present invention is to provide a method and an apparatus for recognizing a text image, in which all characters are clustered according to coordinate information corresponding to each character, the characters in the same line in a target text are divided into a cluster set, and then a text recognition result of the text image is determined according to a character recognition result, coordinate information and the cluster set to which each character belongs, so that accuracy of recognizing the text in the text image can be improved, and it is ensured that the characters in the same line in the target text are output as the characters in the same line when the text recognition result is output.

In order to solve the above technical problem, a first aspect of the present invention discloses a method for recognizing a text image, including:

preprocessing a text image bearing a target text to obtain a model input image;

inputting the model input image into a predetermined character detection model for analysis to obtain a character detection result of the text image, wherein the character detection result comprises at least one boundary box used for selecting a target image area in the text image, each boundary box at least comprises coordinate information used for representing the position of the target image area in the text image, and the target image area refers to an image area where a single character in the target text is located in the text image;

clustering all the bounding boxes according to the coordinate information of each bounding box to obtain at least one bounding box clustering set, wherein characters corresponding to all the bounding boxes in each bounding box clustering set are characters in the same line in the target text;

inputting the image in the target image area selected by each bounding box into a predetermined character recognition model for analysis to obtain a character recognition result of the bounding box;

and determining a text recognition result of the text image according to the character recognition result, the coordinate information and the boundary box cluster set of each boundary box.

The second aspect of the present invention discloses a text image recognition apparatus, comprising:

the preprocessing module is used for preprocessing the text image bearing the target text to obtain a model input image;

a first analysis module, configured to input the model input image to a predetermined character detection model for analysis, so as to obtain a character detection result of the text image, where the character detection result includes at least one bounding box used for framing a target image area in the text image, each bounding box at least includes coordinate information used for indicating a position of the target image area in the text image, and the target image area is an image area where a single character in the target text is located in the text image;

the clustering module is used for clustering all the boundary boxes according to the coordinate information of each boundary box to obtain at least one boundary box clustering set, wherein the characters corresponding to all the boundary boxes in each boundary box clustering set are the characters in the same line in the target text;

the second analysis module is used for inputting the image in the target image area selected by each bounding box into a predetermined character recognition model for analysis to obtain a character recognition result of the bounding box;

and the determining module is used for determining the text recognition result of the text image according to the character recognition result, the coordinate information and the boundary box clustering set of each boundary box.

The third aspect of the present invention discloses another text image recognition apparatus, including:

a memory storing executable program code;

a processor coupled with the memory;

the processor calls the executable program code stored in the memory to execute part or all of the steps in the text image recognition method disclosed by the first aspect of the invention.

In a fourth aspect of the present invention, a computer storage medium is disclosed, which stores computer instructions for performing some or all of the steps of the method for recognizing text images disclosed in the first aspect of the present invention when the computer instructions are called.

Compared with the prior art, the embodiment of the invention has the following beneficial effects:

in the embodiment of the invention, the coordinate information corresponding to each character in the text image bearing the target text is analyzed, then all characters are clustered according to the coordinate information corresponding to each character, the characters in the same line in the target text are divided into a clustering set, then the character recognition result corresponding to each character is analyzed, and finally the text recognition result of the text image is determined according to the character recognition result, the coordinate information and the clustering set to which the character recognition result belongs. Therefore, the method and the device can improve the identification accuracy of the text in the text image and ensure that the characters in the same line in the target text are output as the characters in the same line when the text identification result is output.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present invention, the drawings needed to be used in the description of the embodiments will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

FIG. 1 is a flow chart illustrating a method for recognizing a text image according to an embodiment of the present invention;

FIG. 2 is a flow chart illustrating another text image recognition method according to an embodiment of the present invention;

FIG. 3 is a schematic structural diagram of an apparatus for recognizing text images according to an embodiment of the present invention;

FIG. 4 is a schematic structural diagram of another text image recognition apparatus disclosed in the embodiment of the present invention;

fig. 5 is a schematic structural diagram of another text image recognition apparatus according to an embodiment of the present invention.

Detailed Description

In order to make the technical solutions of the present invention better understood, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

The terms "first," "second," and the like in the description and claims of the present invention and in the above-described drawings are used for distinguishing between different objects and not for describing a particular order. Furthermore, the terms "include" and "have," as well as any variations thereof, are intended to cover non-exclusive inclusions. For example, a process, method, apparatus, article, or article that comprises a list of steps or elements is not limited to only those steps or elements listed, but may alternatively include other steps or elements not listed, or inherent to such process, method, article, or article.

Reference herein to "an embodiment" means that a particular feature, structure, or characteristic described in connection with the embodiment can be included in at least one embodiment of the invention. The appearances of the phrase in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments mutually exclusive of other embodiments. It is explicitly and implicitly understood by one skilled in the art that the embodiments described herein can be combined with other embodiments.

The invention discloses a method and a device for recognizing a text image, which are characterized in that coordinate information corresponding to each character in the text image bearing a target text is analyzed, then all characters are clustered according to the coordinate information corresponding to each character, the characters in the same line in the target text are divided into a clustering set, then a character recognition result corresponding to each character is analyzed, and finally the text recognition result of the text image is determined according to the character recognition result, the coordinate information and the clustering set to which the characters belong of each character.

Example one

Referring to fig. 1, fig. 1 is a schematic flow chart illustrating a text image recognition method according to an embodiment of the present invention. As shown in fig. 1, the method for recognizing a text image may include the following operations:

101. and preprocessing the text image carrying the target text to obtain a model input image.

In step 101, the text image carrying the target text may include a scanned image or a photograph of any one of the documents such as a graduation certificate, a professional certificate, an enterprise business license, and an enterprise seniority certificate. The pre-processing of the text image may include mean filtering, graying, binarization, alignment transformation, etc. Specifically, the preprocessing process for the text image in the embodiment of the present invention may be referred to the description in the subsequent embodiments.

102. And inputting the model input image into a predetermined character detection model for analysis to obtain a character detection result of the text image.

In step 102, the character detection result includes at least one bounding box for bounding out a target image area in the text image, where each bounding box at least includes coordinate information for indicating a position of the target image area in the text image, and the target image area refers to an image area where a single character in the target text is located in the text image.

Alternatively, the text detection model may be a deep learning model PixelLink that abandons the bounding box regression method to detect the bounding box of the text line, and uses the example segmentation method to obtain the bounding box of the text line directly from the segmented text line region. The algorithm process is as follows:

(1) a deep learning model VGG16 is adopted as a feature extraction network, wherein the output of the deep learning model VGG16 is divided into two parts:

pixel division: judging whether each pixel point of the model input image is a text pixel or a non-text pixel;

and (3) link prediction: and performing link prediction on eight fields of each pixel point of the model input image, merging the eight fields into a text pixel if the eight fields are positive, and abandoning the eight fields if the eight fields are not positive.

(2) By calculating the minimum bounding rectangle, a bounding rectangle frame with direction information (i.e. a boundary frame of the target image region) corresponding to each character in the model input image is extracted, and the bounding rectangle frame is represented by ((x, y), (w, h), theta) (i.e. coordinate information of the boundary frame), wherein (x, y) represents the center point coordinate of the bounding rectangle frame, (w, h) represents the width and height of the bounding rectangle frame, and theta represents the rotation angle of the bounding rectangle frame.

103. And clustering all the bounding boxes according to the coordinate information of each bounding box to obtain at least one bounding box clustering set.

In step 103, the words corresponding to all the bounding boxes in each bounding box cluster set are the words in the same line in the target text.

In an alternative embodiment, the coordinate information of each bounding box includes abscissa information and ordinate information of the bounding box, and the ordinate information of each bounding box includes maximum ordinate and minimum ordinate of the bounding box;

and clustering all the bounding boxes according to the coordinate information of each bounding box to obtain at least one bounding box clustering set, wherein the clustering set comprises the following steps:

determining a vertical coordinate interval of each boundary frame according to the maximum vertical coordinate and the minimum vertical coordinate of each boundary frame;

judging whether an intersection exists between the vertical coordinate intervals of every two bounding boxes;

when the intersection exists between the vertical coordinate intervals of the two boundary frames, the two boundary frames are divided into the same boundary frame clustering set;

and when judging that no intersection exists between the vertical coordinate intervals of the two bounding boxes, dividing the two bounding boxes into different bounding box clustering sets.

In this alternative embodiment, after obtaining the center point coordinates, the width and the height of the bounding box (bounding box) output by the deep learning model PixelLink, the maximum ordinate and the minimum ordinate of the bounding box may be determined, and then the ordinate interval of the bounding box may be determined, for example, there are three bounding boxes in total, the center point coordinates are (10,10), (15,12), (20,20), and the width and the height are both (2,2), then the ordinate intervals of the three bounding boxes are calculated as [8,12], [10,14], [18,22], and finally, the two bounding boxes with the ordinate intervals of [8,12] and [10,14] will be divided into the same bounding box clustering set, and the bounding box with the ordinate interval of [18,22] will be divided into another bounding box clustering set.

Therefore, by implementing the optional embodiment, whether the two bounding boxes are located in the bounding boxes of the characters in the same row can be judged according to whether the intersection exists between the vertical coordinate intervals of the two bounding boxes, so that the bounding boxes of the characters in the same row can be divided into the clustering set of the same bounding box, and the clustering of the bounding boxes is realized.

104. And inputting the image in the target image area selected by each bounding box into a predetermined character recognition model for analysis to obtain a character recognition result of the bounding box.

In the step 104, the character recognition model may be a deep learning model CRNN, which mainly includes three types of layers, which are as follows:

(1) and (3) rolling layers: feature extraction is performed on the image in the target image area using the convolution layer, for example, converting an image of size (32,100,3) into a convolution feature matrix of size (1,25, 512).

(2) Circulating layer: and continuously extracting character sequence features on the basis of the convolution feature matrix by adopting a deep bidirectional LSTM network.

(3) Transcription layer: and after the RNN output result is input into an activation function softmax, selecting the character with the maximum probability as a character recognition result and outputting the character recognition result.

105. And determining a text recognition result of the text image according to the character recognition result, the coordinate information and the boundary box cluster set to which the character recognition result and the coordinate information belong of each boundary box.

In another optional embodiment, determining the text recognition result of the text image according to the word recognition result, the coordinate information and the belonging bounding box cluster set of each bounding box comprises:

determining the text coordinates of each bounding box according to the bounding box cluster set to which the bounding box belongs and the coordinate information of the bounding box;

and determining a text recognition result of the text image according to the character recognition result and the text coordinates of each boundary box.

In this alternative embodiment, the text coordinates are used to indicate the position of the word corresponding to the bounding box in the target text, where the text coordinates include at least a text vertical coordinate and a text horizontal coordinate, and the text vertical coordinates of the bounding boxes belonging to the same bounding box cluster set are the same. For example, the text coordinate of the bounding box is (3,3), which indicates that the word corresponding to the bounding box is the third word in the third line of the target text. After determining the characters and the text coordinates in each boundary box, arranging and combining the recognized characters according to the corresponding text coordinates to obtain the text corresponding to the whole text image.

Therefore, by implementing the alternative embodiment, the text coordinates of the bounding box can be determined according to the coordinate information of the bounding box, and then the characters in the bounding box are sorted and combined according to the text coordinates, so that the recognition text of the whole text image is obtained.

In this alternative embodiment, it is further optional that the coordinate information of each bounding box includes abscissa information and ordinate information of the bounding box;

and determining the text coordinates of the bounding box according to the bounding box cluster set to which each bounding box belongs and the coordinate information of the bounding box, wherein the text coordinates comprise:

determining the text longitudinal coordinates of all the bounding boxes in the bounding box cluster set to which each bounding box belongs according to the longitudinal coordinate information of all the bounding boxes in the bounding box cluster set to which each bounding box belongs;

and determining the text horizontal coordinate of each boundary box in each boundary box cluster set according to the horizontal coordinate information of each boundary box in each boundary box cluster set.

In this further alternative embodiment, the abscissa and the ordinate of the coordinates of the center point of the circumscribed rectangular frame output by the deep learning model PixelLink may be taken as the abscissa information and the ordinate information of the bounding box, respectively. Because the vertical coordinates of the characters in the same line in the target text are usually not very different, and the vertical coordinates of the characters in different lines are greatly different, the vertical coordinates of the texts of all the bounding boxes in the bounding box cluster set to which each bounding box belongs can be determined according to the vertical coordinate information of all the bounding boxes in the bounding box cluster set to which the bounding box belongs. Assuming that three bounding box cluster groups are shared, the abscissa information and the ordinate information of the three bounding boxes are (10,10), (15,12), (20,8) in the first bounding box cluster group, the abscissa information and the ordinate information of the three bounding boxes are (11,21), (14,18), (20,20) in the second bounding box cluster group, the abscissa information and the ordinate information of the three bounding boxes are (11,21), (14,18), (20,20) in the third bounding box cluster group, the abscissa information and the ordinate information of the three bounding boxes are (9,30), (16,28), (19,32) in the third bounding box cluster group, it can be seen that the ordinate of the bounding box in the first bounding box cluster group is approximately distributed at about 10, the ordinate of the bounding box in the second bounding box cluster group is approximately distributed at about 20, and the ordinate of the bounding box in the third bounding box cluster group is approximately distributed at about 30, the text vertical coordinates of the bounding boxes in the first bounding box cluster group are all 1, the text vertical coordinates of the bounding boxes in the second bounding box cluster group are all 2, and the text vertical coordinates of the bounding boxes in the third bounding box cluster group are all 3 according to the sequence of the vertical coordinates roughly distributed in the bounding box cluster groups. And then, according to the sorting of the abscissa information of the bounding boxes in the bounding box cluster group, the text abscissas of the bounding boxes in the bounding box cluster group can be obtained, so that the text coordinates of the bounding boxes (10,10), (15,12) and (20,8) are sequentially (1,1), (2,1), (3,1), (11,21), (14,18) and (20,20) are sequentially (1,2), (2,2), (3,2), (9,30), (16,28) and (19,32), and are sequentially (1,3), (2,3) and (3, 3).

Therefore, by implementing the further optional embodiment, the bounding boxes can be sequenced according to the abscissa information and the ordinate information of the bounding boxes, so that the text horizontal coordinate and the text vertical coordinate of the bounding boxes are obtained.

In this alternative embodiment, still further optionally, determining the text recognition result of the text image according to the text recognition result and the text coordinates of each bounding box includes:

determining an original text recognition result of the text image according to the character recognition result and the text coordinates of each boundary box;

determining a regular expression and a text template corresponding to the text image;

extracting key text information from an original text recognition result based on a regular expression;

and filling the key text information into the text template to obtain a text recognition result of the text image.

In this yet further alternative embodiment, the regular expression is a technical means that can be used in the computer field to extract a specific portion of the target text. When the text image is a scanned image or a photo of a certificate such as a graduation certificate, a professional qualification certificate, an enterprise business license, an enterprise seniority certificate and the like, because the text in the certificate usually has a specified format, and the meanings of other words are not important except some key information, for example, the key text information in the graduation certificate is a graduation institution, a graduation specialty, a name and the like, and the meanings of other words are not much different even on different graduation certificates, the words of other parts of the graduation certificate are used as a text template, then the key text information graduation institution, the graduation specialty, the name and the like in the graduation certificate are extracted through a regular expression, and then the key text information is filled into the text template, so that the text recognition result of the whole text image can be obtained. Thus, the situation that when the characters in other parts are recognized in error, the text recognition result of the whole obtained text image is also in error can be avoided. In addition, the regular expressions and the text templates corresponding to different types of text images are different, for example, the regular expressions corresponding to the text images of the enterprise license are used for extracting key text information such as enterprise names and enterprise addresses in the text images, and the corresponding text templates are also different from the text templates corresponding to the text images of the graduation certificates.

Therefore, by implementing the further optional embodiment, the key text information in the text image is extracted, and then the key text information is filled into the text template corresponding to the text image, so as to obtain the text recognition result of the whole text image, and the situation that the obtained text recognition result of the whole text image is also wrong when the characters of other parts are recognized in a wrong manner can be avoided, so that the recognition accuracy of the text image is improved.

In this still further optional embodiment, still further optional, after extracting the key text information from the original text recognition result based on the regular expression, filling the key text information into the text template, and before obtaining the text recognition result of the text image, the text image recognition method further includes:

judging whether the number of characters contained in the key text information is equal to the predetermined correct number of characters;

and triggering and executing the step of filling the key text information into the text template to obtain a text recognition result of the text image when the number of the characters contained in the key text information is judged to be equal to the predetermined correct number of the characters.

In this still further alternative embodiment, for each text image, the corresponding number of correct characters is preset, for example, the number of correct characters corresponding to the text image of the graduation certificate may be 9, 10, 11, and the number of correct characters corresponding to the text image of the enterprise license may be 19, 20, 21, 22, etc. When the number of characters of the key text information is equal to the number of correct characters, it can be determined that the extracted key text information is correct.

Therefore, the further optional embodiment is implemented, whether the extracted key text information is correct or not is judged according to the number of the contained characters, and after the extracted key text information is determined to be correct, the extracted key text information is filled into the text template, so that the accuracy of the text recognition result can be improved.

In this still further optional embodiment, yet further optional, the method for recognizing a text image further includes:

when the number of characters contained in the key text information is judged to be not equal to the number of correct characters determined in advance, receiving corrected key text information input by a user;

and filling the corrected key text information into the text template to obtain a text recognition result of the text image.

In this still further optional embodiment, the modified key text information may be directly input by the user according to the target text, that is, after determining that the extraction error of the key text information is determined, the modified key text information is directly input by the user, and then the modified key text information input by the user is filled into the text template, so as to obtain the text recognition result of the text image.

Therefore, by implementing the further optional embodiment, after determining that the extraction error of the key text information is determined, the user directly inputs the corrected key text information, and then the corrected key text information input by the user is filled into the text template, so that the accuracy of the text recognition result can be improved.

Therefore, by implementing the text image recognition method described in fig. 1, the recognition accuracy of the text in the text image can be improved, and it is ensured that the characters in the same line in the target text are output as the characters in the same line when the text recognition result is output. And judging whether the two boundary frames are in the boundary frame of the characters in the same line or not according to whether the vertical coordinate intervals of the two boundary frames have intersection or not. And determining the text coordinates of the boundary box according to the coordinate information of the boundary box, and then sequencing and combining the characters in the boundary box according to the text coordinates to obtain the identification text of the whole text image. And sequencing the bounding boxes according to the abscissa information and the ordinate information of the bounding boxes to obtain the text horizontal coordinates and the text vertical coordinates of the bounding boxes. The method can also avoid the situation that the text recognition result of the whole text image is wrong when the characters of other parts are recognized wrongly, thereby improving the recognition accuracy of the text image.

Example two

Referring to fig. 2, fig. 2 is a schematic flow chart illustrating another text image recognition method according to an embodiment of the present invention. As shown in fig. 2, the method for recognizing a text image may include the following operations:

201. and carrying out mean filtering on the text image carrying the target text to obtain a filtered image.

In step 201, the process of mean filtering may be as follows:

(1) the size of the filter is set, and the size of the filter is generally odd, such as:

the filter is a matrix with a size of (3,3), and contains 9 elements, and each element is 1, and in an actual application scenario, the size and the element value of the filter can be set according to an actual situation.

(2) Moving the filter on the text image, making the center of the filter coincide with each pixel of the text image in sequence, multiplying the filter by the pixel corresponding to the coincidence of the text image, adding the products, and dividing the products by the number of filter elements, which can be expressed as the following formula:

in the formula, f (i + k, j + l) represents the pixel value of the coordinate (i + k, j + l) in the pixel matrix of the image before denoising, and g (i, j) represents the pixel value of the coordinate (i, j) in the pixel matrix of the image after denoising. h (k, l) is a filter matrix, which contains n elements.

202. And carrying out graying processing on the filtered image to obtain a grayscale image.

In the step 202, the process of image graying processing refers to: in an image of an RGB color model, if the values of the R channel, G channel, and B channel of each pixel point in the image are the same, the entire image will appear gray. Where the values m for the R, G and B channels are called gray values. The following two common methods are used for the graying process:

the first method is as follows: m ═ R + G + B)/3

The second method comprises the following steps: m is 0.3R +0.59G +0.11B

In the embodiment of the present invention, it is preferable that the second mode performs the gradation processing.

203. And carrying out binarization processing on the gray level image to obtain a binarized image.

In an optional embodiment, the binarizing processing on the grayscale image to obtain a binarized image includes:

dividing the grayscale image into a plurality of grayscale image regions;

determining a binarization threshold corresponding to each gray image area according to the gray values of all pixel points in each gray image area;

and carrying out binarization processing on the image in each gray level image area according to the binarization threshold corresponding to each gray level image area to obtain a binarization image of the whole gray level image.

In this alternative embodiment, the process of binarizing the image may be understood as: and resetting the gray value of each pixel point according to whether the gray value of each pixel point in the image is greater than the binarization threshold, setting the gray value of each pixel point to be 255 (namely, the pixel point is white) when the gray value of each pixel point is greater than the binarization threshold, and setting the gray value of each pixel point to be 0 (namely, the pixel point is black) when the gray value of each pixel point is less than the binarization threshold, so that the whole image is changed into a black-and-white image. In addition, since different parts in the same image may have different brightness, when a uniform binarization threshold is applied to the same image, the effect of the obtained binarized image is generally not ideal. In this case, an adaptive binarization method is used, that is, different binarization threshold values are used for different regions of the same image, so that a binarized image with a better effect can be obtained.

Therefore, by implementing the optional embodiment, different binarization threshold values can be used for different areas of the same image, so that a binarization image with a better effect can be obtained.

In this alternative embodiment, it is further optional that each of the grayscale image regions is a matrix-shaped region;

and determining a binarization threshold corresponding to each gray image area according to the gray values of all pixel points in each gray image area, wherein the binarization threshold comprises the following steps:

calculating a binarization threshold value corresponding to each gray level image area by the following formula:

in the formula, a represents the abscissa of a lower left corner pixel point of each gray image area in a gray image, b represents the ordinate of a lower right corner pixel point of each gray image area in the gray image, k represents the number of pixel points occupied by each gray image area in the transverse direction of the gray image, l represents the number of pixel points occupied by each gray image area in the longitudinal direction of the gray image, f (i, j) represents the gray value of a pixel point with the coordinate (i, j) in the gray image, n represents the total number of pixel points occupied by each gray image area, and thr represents the binarization threshold corresponding to each gray image area.

Therefore, by implementing the further optional embodiment, the average value of the gray values of all the pixel points in each gray image area is used as the binarization threshold, so that a binarization image with a better effect can be obtained.

204. And calculating the first vertex coordinates of the binary image based on a predetermined image edge detection algorithm.

In step 204, the image edge detection algorithm may be a Canny edge detection algorithm, and three vertex coordinates (i.e., first vertex coordinates) of the target text in the text image are extracted by using the Canny edge detection algorithm, where the three vertex coordinates are:

A(x₁，y₁)，B(x₂，y₂)，C(x₃，y₃)

according to the Canny edge detection algorithm, coordinate points in a picture pixel point matrix of a text image are multiplied by sobel or other operators to obtain gradient values g in different directions_x(m,n)，g_y(m, n), then combining the two directions to obtain gradient values and gradient directions:

wherein G (m, n) is a gradient value,

in the direction of the gradient.

And then screening edge pixel points by using an upper threshold and a lower threshold, wherein the screening rule is as follows: and setting two thresholds, namely a maxVal upper threshold and a minVal lower threshold. Wherein, the pixel points with the brightness larger than maxVal are all detected as edges, and the pixel points with the brightness lower than minVal are all detected as non-edges. For the middle pixel point, if the middle pixel point is adjacent to the pixel point determined as the edge, the edge is determined; otherwise, the edge is judged to be not edge.

205. Second vertex coordinates are determined from the predetermined aligned image.

In step 205, the aligned image is the image to be aligned after the binarized image is rectified, and has a specified geometric shape, and a new binarized image obtained after aligning the binarized image with the aligned image will have the same geometric shape. Three vertex coordinates (i.e., second vertex coordinates) of the aligned image may be determined from the long and wide sides of the aligned image, the three vertex coordinates being expressed as:

A′(x′₁，y′₁),B′(x′₂,y′₂),C′(x′₃,y′₃)

206. and determining an affine transformation matrix according to the first vertex coordinates and the second vertex coordinates.

In the above step 206, the affine transformation matrix M can be expressed as:

207. and carrying out affine transformation on the binary image according to the affine transformation matrix to obtain a model input image.

In step 207 above, the affine transformation can be represented as:

G′＝M*G

g' represents a picture pixel matrix after affine transformation, and G represents a picture pixel matrix before affine transformation.

208. And inputting the model input image into a predetermined character detection model for analysis to obtain a character detection result of the text image.

209. And clustering all the bounding boxes according to the coordinate information of each bounding box to obtain at least one bounding box clustering set.

210. And inputting the image in the target image area selected by each bounding box into a predetermined character recognition model for analysis to obtain a character recognition result of the bounding box.

211. And determining a text recognition result of the text image according to the character recognition result, the coordinate information and the boundary box cluster set to which the character recognition result and the coordinate information belong of each boundary box.

For the detailed description of the above steps 208 to 211, reference may be made to the detailed description of the above steps 102 to 105, which is not described in detail here.

In another optional embodiment, the bounding box further comprises geometric information of the target image region, the geometric information comprising a pixel width and/or a pixel length and/or a pixel area of the target image region;

and after the model input image is input into a predetermined character detection model for analysis to obtain a character detection result of the text image, and before all the bounding boxes are clustered according to the coordinate information of each bounding box to obtain at least one bounding box cluster set, the method for identifying the text image further comprises the following steps:

when the geometric information comprises the pixel width of the target image area, selecting the target pixel width from all the pixel widths;

removing the corresponding boundary frames with the pixel width less than or equal to the target pixel width from the character detection result, and triggering and executing the step of clustering all the boundary frames according to the coordinate information of each boundary frame to obtain at least one boundary frame clustering set;

when the geometric information comprises the pixel length of the target image area, selecting the target pixel length from all the pixel lengths;

removing the corresponding boundary frames with the pixel length less than or equal to the target pixel length from the character detection result, and triggering and executing the step of clustering all the boundary frames according to the coordinate information of each boundary frame to obtain at least one boundary frame clustering set;

when the geometric information comprises the pixel area of the target image area, selecting the target pixel area from all the pixel areas;

and removing the corresponding boundary boxes with the pixel areas smaller than or equal to the target pixel area from the character detection result, and triggering and executing the step of clustering all the boundary boxes according to the coordinate information of each boundary box to obtain at least one boundary box clustering set.

In this alternative embodiment, the bounding rectangle (i.e. the bounding box) output by the deep learning model PixelLink is sometimes wrong, so that the bounding box output by the deep learning model PixelLink needs to be filtered, so as to ensure the accuracy of the final text recognition result. In particular, filtering the bounding box using the geometric features of the target image region (pixel width, pixel length, pixel area) is a simple and efficient method. For example, the pixel widths of all target image regions are sorted from large to small, then the pixel width sorted at 99% is selected as the target pixel width, and assuming that the target pixel width is 10 pixels, the bounding box with the pixel width smaller than 10 pixels is deleted, so as to implement filtering of the bounding box. At this time, of all the bounding boxes output by the deep learning model PixelLink, the bounding box with the pixel width smaller than 10 pixels is generally an error bounding box, so that it needs to be deleted, and the bounding box sorted in the first 99% is generally a valid bounding box.

It can be seen that, by implementing this alternative embodiment, the erroneous bounding box can be removed from the text detection result according to the geometric features of the bounding box, thereby improving the accuracy of the obtained text recognition result.

It can be seen that, by implementing the text image recognition method described in fig. 2, when the text image is preprocessed, different binarization thresholds can be used for different regions of the same image, and the average value of the grayscale values of all the pixel points in each grayscale image region is used as the binarization threshold, so that a binarized image with a better effect can be obtained. And the wrong boundary box can be removed from the character detection result according to the geometric characteristics of the boundary box, so that the accuracy of the obtained text recognition result is improved.

EXAMPLE III

Referring to fig. 3, fig. 3 is a schematic structural diagram of a text image recognition apparatus according to an embodiment of the present invention. As shown in fig. 3, the text image recognition apparatus may include:

the preprocessing module 301 is configured to preprocess a text image carrying a target text to obtain a model input image;

a first analysis module 302, configured to input the model input image into a predetermined character detection model for analysis, so as to obtain a character detection result of the text image, where the character detection result includes at least one bounding box used for framing a target image area in the text image, each bounding box at least includes coordinate information used for indicating a position of the target image area in the text image, and the target image area is an image area where a single character in the target text is located in the text image;

the clustering module 303 is configured to cluster all the bounding boxes according to the coordinate information of each bounding box to obtain at least one bounding box cluster set, where words corresponding to all the bounding boxes in each bounding box cluster set are words in the same line in the target text;

a second analysis module 304, configured to input an image in the target image region selected by each bounding box into a predetermined character recognition model for analysis, so as to obtain a character recognition result of the bounding box;

the determining module 305 is configured to determine a text recognition result of the text image according to the character recognition result, the coordinate information, and the belonging bounding box cluster set of each bounding box.

It can be seen that, with the text image recognition apparatus described in fig. 3, coordinate information corresponding to each character in a text image carrying a target text is analyzed, then all characters are clustered according to the coordinate information corresponding to each character, the characters in the same line in the target text are divided into a cluster set, then a character recognition result corresponding to each character is analyzed, and finally a text recognition result of the text image is determined according to the character recognition result, the coordinate information, and the cluster set to which the characters belong of each character, so that the recognition accuracy of the text in the text image can be improved, and it is ensured that the characters in the same line in the target text are output as the same line of characters when the text recognition result is output.

In an alternative embodiment, the determining module 305 determines the text recognition result of the text image according to the word recognition result, the coordinate information, and the belonging bounding box cluster set of each bounding box in the following specific manner:

determining text coordinates of each boundary box according to the boundary box cluster set to which each boundary box belongs and coordinate information of the boundary box, wherein the text coordinates are used for indicating the position of characters corresponding to the boundary box in a target text, the text coordinates at least comprise text longitudinal coordinates and text transverse coordinates, and the text longitudinal coordinates of the boundary boxes belonging to the same boundary box cluster set are the same;

It can be seen that, with the implementation of the recognition device for text images described in fig. 4, the text coordinates of the bounding box can be determined according to the coordinate information of the bounding box, and then the characters in the bounding box are sorted and combined according to the text coordinates, so as to obtain the recognition text of the entire text image.

In this optional embodiment, it is further optional that the coordinate information of each bounding box includes abscissa information and ordinate information of the bounding box;

and the specific way that the determining module 305 determines the text coordinate of each bounding box according to the bounding box cluster set to which the bounding box belongs and the coordinate information of the bounding box is as follows:

Therefore, by implementing the text image recognition device described in fig. 4, the bounding boxes can be sorted according to the abscissa information and the ordinate information of the bounding boxes, so as to obtain the text horizontal coordinates and the text vertical coordinates of the bounding boxes.

In this further optional embodiment, yet further optional, the ordinate information of each bounding box includes the maximum ordinate and the minimum ordinate of the bounding box;

the specific way of clustering all the bounding boxes by the clustering module 303 according to the coordinate information of each bounding box to obtain at least one bounding box cluster set is as follows:

Therefore, by implementing the text image recognition device described in fig. 4, whether two bounding boxes are located in the bounding box of the characters in the same row can be determined according to whether an intersection exists between the ordinate intervals of the two bounding boxes, so that the bounding boxes of the characters in the same row can be divided into the clustering set of the same bounding boxes, and clustering of the bounding boxes can be realized.

In another alternative embodiment, the determining module 305 determines the text recognition result of the text image according to the text recognition result and the text coordinates of each bounding box in a specific manner as follows:

It can be seen that, with the implementation of the text image recognition apparatus described in fig. 4, the text recognition result of the whole text image is obtained by extracting the key text information in the text image and then filling the key text information into the text template corresponding to the text image, so that the situation that the text recognition result of the whole text image is also wrong when a character of other parts is recognized is wrong can be avoided, and thus the recognition accuracy of the text image is improved.

In yet another alternative embodiment, the preprocessing module 301 preprocesses the text image carrying the target text to obtain the model input image in a specific manner:

carrying out mean filtering on the text image bearing the target text to obtain a filtered image;

carrying out graying processing on the filtered image to obtain a grayscale image;

carrying out binarization processing on the gray level image to obtain a binarized image;

calculating a first vertex coordinate of the binary image based on a predetermined image edge detection algorithm;

determining a second vertex coordinate according to a predetermined alignment image;

determining an affine transformation matrix according to the first vertex coordinates and the second vertex coordinates;

and carrying out affine transformation on the binary image according to the affine transformation matrix to obtain a model input image.

It can be seen that implementing the recognition apparatus for text images described in fig. 4 can implement preprocessing of text images in various ways.

In this further optional embodiment, further optionally, the preprocessing module 301 performs binarization processing on the grayscale image, and a specific manner of obtaining the binarized image is as follows:

dividing the grayscale image into a plurality of grayscale image regions;

It can be seen that, by implementing the text image recognition apparatus described in fig. 4, different binarization threshold values can be used for different areas of the same image, so that a binarized image with a better effect can be obtained.

In this further alternative embodiment, yet further alternatively, each of the grayscale image regions is a matrix-shaped region;

and the specific way for the preprocessing module 301 to determine the binarization threshold corresponding to each gray image region according to the gray values of all the pixel points in each gray image region is as follows:

Therefore, by implementing the text image recognition device described in fig. 4, the average value of the gray values of all the pixel points in each gray image region is used as the binarization threshold, so that a binarization image with a better effect can be obtained.

In yet another alternative embodiment, the bounding box further comprises geometric information of the target image region, the geometric information comprising a pixel width and/or a pixel length and/or a pixel area of the target image region;

and, the recognition device of the text image further comprises:

a selecting module 306, configured to, after the first analyzing module 302 inputs the model input image into a predetermined character detection model for analysis to obtain a character detection result of the text image, select a target pixel width from all pixel widths before the clustering module 303 clusters all bounding boxes according to coordinate information of each bounding box to obtain at least one bounding box cluster set and before the geometric information includes the pixel width of the target image region;

a removing module 307, configured to remove the bounding box with the corresponding pixel width less than or equal to the target pixel width from the text detection result, and trigger the clustering module 303 to perform the above-mentioned operation of clustering all bounding boxes according to the coordinate information of each bounding box to obtain at least one bounding box clustering set;

a selecting module 306, configured to, after the first analyzing module 302 inputs the model input image into a predetermined character detection model for analysis to obtain a character detection result of the text image, select a target pixel length from all pixel lengths before the clustering module 303 clusters all bounding boxes according to the coordinate information of each bounding box to obtain at least one bounding box cluster set, and when the geometric information includes the pixel length of the target image region;

a removing module 307, configured to remove the bounding box with the corresponding pixel length less than or equal to the target pixel length from the text detection result, and trigger the clustering module 303 to perform the above-mentioned operation of clustering all bounding boxes according to the coordinate information of each bounding box to obtain at least one bounding box clustering set;

a selecting module 306, configured to, after the first analyzing module 302 inputs the model input image into a predetermined character detection model for analysis to obtain a character detection result of the text image, select a target pixel area from all pixel areas before the clustering module 303 clusters all bounding boxes according to coordinate information of each bounding box to obtain at least one bounding box cluster set and before geometric information includes a pixel area of a target image area;

the removing module 307 is configured to remove the bounding box with the corresponding pixel area smaller than or equal to the target pixel area from the text detection result, and trigger the clustering module 303 to perform the above-mentioned operation of clustering all bounding boxes according to the coordinate information of each bounding box to obtain at least one bounding box clustering set.

It can be seen that the implementation of the text image recognition apparatus described in fig. 4 can remove the wrong bounding box from the character detection result according to the geometric features of the bounding box, thereby improving the accuracy of the obtained text recognition result.

For the specific description of the text image recognition apparatus, reference may be made to the specific description of the text image recognition method, which is not repeated herein.

Example four

Referring to fig. 5, fig. 5 is a schematic structural diagram of a device for recognizing a text image according to another embodiment of the present invention. As shown in fig. 5, the apparatus may include:

a memory 501 in which executable program code is stored;

a processor 502 coupled to a memory 501;

the processor 502 calls the executable program code stored in the memory 501 to execute the steps of the text image recognition method disclosed in the first embodiment or the second embodiment of the present invention.

EXAMPLE five

The embodiment of the invention discloses a computer storage medium, which stores computer instructions, and the computer instructions are used for executing steps in the text image recognition method disclosed in the first embodiment or the second embodiment of the invention when being called.

The above-described embodiments of the apparatus are merely illustrative, and the modules described as separate components may or may not be physically separate, and the components shown as modules may or may not be physical modules, may be located in one place, or may be distributed on a plurality of network modules. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment. One of ordinary skill in the art can understand and implement it without inventive effort.

Through the above detailed description of the embodiments, those skilled in the art will clearly understand that the embodiments may be implemented by software plus a necessary general hardware platform, and may also be implemented by hardware. Based on such understanding, the above technical solutions may be embodied in the form of a software product, which may be stored in a computer-readable storage medium, where the storage medium includes a Read-Only Memory (ROM), a Random Access Memory (RAM), a Programmable Read-Only Memory (PROM), an Erasable Programmable Read-Only Memory (EPROM), a One-time Programmable Read-Only Memory (OTPROM), an Electrically Erasable Programmable Read-Only Memory (EEPROM), a Compact Disc-Read-Only Memory (CD-ROM), or other disk memories, CD-ROMs, or other magnetic disks, A tape memory, or any other medium readable by a computer that can be used to carry or store data.

Finally, it should be noted that: the method and apparatus for recognizing text images disclosed in the embodiments of the present invention are only preferred embodiments of the present invention, and are only used for illustrating the technical solutions of the present invention, not for limiting the same; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those skilled in the art; the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and the modifications or the substitutions do not make the essence of the corresponding technical solutions depart from the spirit and scope of the technical solutions of the embodiments of the present invention.

Claims

1. A method for recognizing a text image, the method comprising:

preprocessing a text image bearing a target text to obtain a model input image;

2. The method for recognizing the text image according to claim 1, wherein the determining the text recognition result of the text image according to the character recognition result, the coordinate information and the belonging bounding box cluster set of each bounding box comprises:

determining text coordinates of each boundary box according to the boundary box cluster set to which the boundary box belongs and coordinate information of the boundary box, wherein the text coordinates are used for representing the position of characters corresponding to the boundary box in the target text, the text coordinates at least comprise text longitudinal coordinates and text transverse coordinates, and the text longitudinal coordinates of the boundary boxes belonging to the same boundary box cluster set are the same;

and determining a text recognition result of the text image according to the character recognition result and the text coordinate of each boundary box.

3. The method according to claim 2, wherein the coordinate information of each of the bounding boxes includes abscissa information and ordinate information of the bounding box;

and determining the text coordinates of the bounding box according to the bounding box cluster set to which each bounding box belongs and the coordinate information of the bounding box, including:

determining the text longitudinal coordinates of all the bounding boxes in the bounding box cluster set to which the bounding box belongs according to the longitudinal coordinate information of all the bounding boxes in the bounding box cluster set to which the bounding box belongs;

and determining the text horizontal coordinate of each boundary box in the boundary box cluster set according to the horizontal coordinate information of each boundary box in each boundary box cluster set.

4. The method according to claim 3, wherein the ordinate information of each of the bounding boxes includes a maximum ordinate and a minimum ordinate of the bounding box;

5. The method for recognizing text images according to any one of claims 2 to 4, wherein the determining the text recognition result of the text image according to the character recognition result and the text coordinates of each of the bounding boxes comprises:

extracting key text information from the original text recognition result based on the regular expression;

6. The method for recognizing the text image according to claim 1, wherein the preprocessing the text image carrying the target text to obtain a model input image comprises:

7. The method for recognizing the text image according to claim 6, wherein the binarizing the grayscale image to obtain a binarized image includes:

dividing the grayscale image into a plurality of grayscale image regions;

and carrying out binarization processing on the image in each gray level image area according to a binarization threshold corresponding to each gray level image area to obtain a binarization image of the whole gray level image.

8. The method according to claim 7, wherein each of the grayscale image regions is a matrix-shaped region;

and determining a binarization threshold corresponding to each gray image area according to the gray values of all pixel points in each gray image area, wherein the determining comprises the following steps:

calculating a binarization threshold value corresponding to each gray level image area through the following formula:

in the formula, a represents the abscissa of a lower left corner pixel point of each gray image area in the gray image, b represents the ordinate of a lower right corner pixel point of each gray image area in the gray image, k represents the number of pixel points occupied by each gray image area in the transverse direction of the gray image, l represents the number of pixel points occupied by each gray image area in the longitudinal direction of the gray image, f (i, j) represents the gray value of a pixel point with a coordinate of (i, j) in the gray image, n represents the total number of pixel points occupied by each gray image area, and thr represents the binarization threshold corresponding to each gray image area.

9. The method according to claim 8, wherein the bounding box further comprises geometric information of the target image region, the geometric information comprising a pixel width and/or a pixel length and/or a pixel area of the target image region;

and after the model input image is input into a predetermined character detection model for analysis to obtain a character detection result of the text image, before all the bounding boxes are clustered according to the coordinate information of each bounding box to obtain at least one bounding box cluster set, the method further comprises:

when the geometric information comprises the pixel width of the target image area, selecting a target pixel width from all the pixel widths;

removing the corresponding boundary boxes with the pixel width less than or equal to the target pixel width from the character detection result, and triggering and executing the step of clustering all the boundary boxes according to the coordinate information of each boundary box to obtain at least one boundary box clustering set;

when the geometric information comprises the pixel length of the target image area, selecting a target pixel length from all the pixel lengths;

removing the corresponding bounding boxes with the pixel length less than or equal to the target pixel length from the character detection result, and triggering and executing the step of clustering all the bounding boxes according to the coordinate information of each bounding box to obtain at least one bounding box clustering set;

when the geometric information comprises the pixel area of the target image area, selecting a target pixel area from all the pixel areas;

and removing the corresponding boundary box with the pixel area smaller than or equal to the target pixel area from the character detection result, and triggering and executing the step of clustering all the boundary boxes according to the coordinate information of each boundary box to obtain at least one boundary box clustering set.

10. An apparatus for recognizing a text image, the apparatus comprising:

11. An apparatus for recognizing a text image, the apparatus comprising:

a memory storing executable program code;

a processor coupled with the memory;

the processor calls the executable program code stored in the memory to execute the method of recognizing a text image according to any one of claims 1 to 9.