CN112560847B

CN112560847B - Image text area positioning method and device, storage medium and electronic device

Info

Publication number: CN112560847B
Application number: CN202011561668.5A
Authority: CN
Inventors: 何龚敏; 杨俊�
Original assignee: China Construction Bank Corp
Current assignee: China Construction Bank Corp
Priority date: 2020-12-25
Filing date: 2020-12-25
Publication date: 2024-12-06
Anticipated expiration: 2040-12-25
Also published as: CN112560847A

Abstract

The present application provides a method and device for locating an image text area, a storage medium and an electronic device. For a text image of a pure text type, the text image is expanded so that adjacent characters are connected into a text line connected area, and the image text area is determined by the circumscribed rectangle of the text line connected area; for a text image of a text straight line staggered type, the image text area is determined by detecting a straight line frame in the text image; for a text image of a complex background layout type, the single word frame in the text image is identified, and the single word frame is merged into a text line connected area, and the image text area is determined by the text line connected area and the detected straight line frame. It can be seen that by identifying the straight line frame and/or the circumscribed rectangle of the connected area in the text image, the upper, lower, left and right edge positions of the text line in the text image can be accurately located, and the image text area positioning is universal for various types of text images.

Description

Image text region positioning method and device, storage medium and electronic equipment

Technical Field

The present application relates to the field of image processing technologies, and in particular, to a method and apparatus for positioning an image text region, a storage medium, and an electronic device.

Background

Image text recognition is widely required in many fields, and related applications relate to identification card recognition, license plate number recognition, express bill recognition, bank card number recognition and the like. Image text recognition refers to recognition of text present in a text image, which is generally classified into a plain text type image, a text line interleaving type image (e.g., a form image), and a complex background layout type image (e.g., a ticket image), and image text region positioning is a precondition for image text recognition.

At present, the scheme of image text region positioning is mainly an image pixel projection analysis method, and specifically comprises the steps of carrying out binarization processing on an image, enabling characters in the image to be black and the background to be white, horizontally projecting pixel points in the image, calculating the number of black pixel points on each text line, obtaining a pixel distribution map, and determining the starting point and the ending point of each peak on the pixel distribution map as the upper boundary and the lower boundary of the text line according to a threshold value by setting the threshold value.

In the existing image text region positioning scheme, only the upper edge position and the lower edge position of each line of text in a plain text image can be positioned, the left edge position and the right edge position are difficult to position, in addition, for text lines with a small number of characters, the peak value after horizontal projection can be smaller, the subsequent threshold value setting is difficult, the noise can be recognized as the text line when the threshold value is too small, the text lines with a small number of single words can be ignored when the threshold value is too large, and the scheme can not position the image text region of the text image with straight lines.

Disclosure of Invention

The application provides a method and a device for positioning an image text region, a storage medium and electronic equipment, and aims to improve accuracy and universality of positioning the image text region.

In order to achieve the above object, the present application provides the following technical solutions:

an image text region locating method, comprising:

Acquiring a text image to be positioned, and determining the image type of the text image to be positioned, wherein the image type comprises a plain text type, a text straight line staggered type or a complex background layout type;

if the image type of the text image to be positioned is a plain text type, performing image preprocessing on the text image to be positioned, performing expansion processing on the text image to be positioned after the image preprocessing to obtain a target text image, identifying each text line communication area in the target text image, determining coordinate values of circumscribed rectangles of each text line communication area, and determining the text line area in the text image to be positioned based on the coordinate values of circumscribed rectangles of each text line communication area, wherein the pixel values of adjacent pixels in each text line communication area are the same;

if the image type of the text image to be positioned is a text straight line staggered type, performing image preprocessing on the text image to be positioned, performing horizontal line detection and vertical line detection on the text image to be positioned after the image preprocessing, determining a plurality of rectangles based on each horizontal line and each vertical line obtained by detection, and determining a text line area in the text image to be positioned according to coordinate values of the rectangles;

If the image type of the text image to be positioned is a complex background layout type, inputting the text image to be positioned into a pre-built single word recognition model to obtain a coordinate predicted value and a confidence coefficient of each single word corresponding to a single word frame in the text image to be positioned, determining a plurality of target single word frames from the single word frames based on the confidence coefficient of each single word frame, merging the target single word frames adjacent in the horizontal direction to obtain a plurality of text line connected areas, detecting the horizontal line and the vertical line of the text image to be positioned, and determining the text line area in the text image to be positioned according to the text line connected areas and the detected horizontal line and vertical line.

The method, optionally, of performing image preprocessing on the text image to be positioned, includes:

Carrying out graying treatment on the text image to be positioned to obtain a graying image;

Filtering the graying image to obtain a filtered image;

Performing self-adaptive binarization processing on the filtered image to obtain a binarized image;

And carrying out inversion processing on the pixel value of each pixel point in the binarized image.

In the above method, optionally, the filtering processing is performed on the graying image to obtain a filtered image, including:

sliding through each pixel point in the gray level image by the center of a preset filtering sliding window;

When the center of the filtering sliding window slides to one pixel point in the gray level image, a preset filtering calculation formula corresponding to the noise type is selected based on the noise type of the text image to be positioned, a filtering gray level value in the current filtering sliding window is calculated based on the selected filtering calculation mode, and the calculated filtering gray level value is used as the pixel value of the pixel point.

According to the method, optionally, the expanding processing is performed on the text image to be positioned after the image preprocessing to obtain the target text image, and the method comprises the following steps:

Expanding the text image to be positioned after the image preprocessing based on a first sliding window, wherein the width of the first sliding window is determined according to the interval between adjacent characters in the text image to be positioned, and the height of the first sliding window is determined according to the line interval of the Chinese lines in the text image to be positioned.

The method, optionally, carries out horizontal line detection on the text image to be positioned after the image preprocessing, and includes:

Based on a preset second sliding window, performing corrosion treatment on the text image to be positioned after the image pretreatment to obtain a first corrosion image;

Performing expansion processing on the first corrosion image based on a preset third sliding window to obtain a first expansion image;

Identifying each horizontal communication region in the first expansion image, and determining a circumscribed rectangle of each horizontal communication region;

And calculating coordinates of two endpoints of a horizontal line corresponding to the circumscribed rectangle according to the coordinates of the circumscribed rectangle of the horizontal communication area for each horizontal communication area.

The method, optionally, of performing vertical line detection on the text image to be positioned after image preprocessing, includes:

based on a preset fourth sliding window, performing corrosion treatment on the text image to be positioned after the image pretreatment to obtain a second corrosion image;

Performing expansion processing on the second corrosion image based on a preset fifth sliding window to obtain a second expansion image;

identifying each vertical communication area in the second expansion image, and determining the circumscribed rectangle of each vertical communication area;

and calculating coordinates of two endpoints of a vertical line corresponding to the circumscribed rectangle according to the coordinates of the circumscribed rectangle of the vertical communication area for each vertical communication area.

The method, optionally, the determining, based on the confidence of each single frame, a plurality of target single frames from the single frames includes:

For each single word frame, if the confidence coefficient of the single word frame is not smaller than a preset confidence coefficient threshold value, determining the single word frame as an initial single word frame;

forming a single character frame set by the initial single character frames;

Selecting a first single word frame from the current single word frame set, wherein the first single word frame is an initial single word frame with the highest confidence level in all initial single word frames contained in the current single word frame set;

Calculating the area overlapping rate of the initial single word frame and the first single word frame aiming at each initial single word frame remained in the single word frame set, and deleting the initial single word frame from the single word frame set if the area overlapping rate is larger than a preset overlapping threshold value;

determining the first single character frame to be a target single character frame, and judging whether the current single character frame set is an empty set or not;

And if the current single-word frame set is not the empty set, returning to execute the step of selecting the first single-word frame from the current single-word frame set until the current single-word frame set is the empty set.

An image text region locating device comprising:

the device comprises an acquisition unit, a positioning unit and a positioning unit, wherein the acquisition unit is used for acquiring a text image to be positioned and determining the image type of the text image to be positioned, wherein the image type comprises a plain text type, a text straight line staggered type or a complex background layout type;

The device comprises a first positioning unit, a second positioning unit and a third positioning unit, wherein the first positioning unit is used for performing image preprocessing on the text image to be positioned if the image type of the text image to be positioned is a plain text type, performing expansion processing on the text image to be positioned after the image preprocessing to obtain a target text image, identifying each text line communication area in the target text image, determining coordinate values of circumscribed rectangles of each text line communication area, and determining the text line area in the text image to be positioned based on the coordinate values of circumscribed rectangles of each text line communication area, wherein the pixel values of adjacent pixel points in each text line communication area are the same;

The second positioning unit is used for carrying out image preprocessing on the text image to be positioned if the image type of the text image to be positioned is a text straight line staggered type, carrying out horizontal line detection and vertical line detection on the text image to be positioned after the image preprocessing, determining a plurality of rectangles based on each horizontal line and each vertical line obtained by detection, and determining a text line area in the text image to be positioned according to coordinate values of the rectangles;

and the third positioning unit is used for inputting the text image to be positioned into a pre-built single word recognition model if the image type of the text image to be positioned is a complex background layout type, obtaining the coordinate predicted value and the confidence coefficient of each single word corresponding to a single word frame in the text image to be positioned, determining a plurality of target single word frames from the single word frames based on the confidence coefficient of each single word frame, merging the target single word frames adjacent in the horizontal direction, obtaining a plurality of text line communication areas, detecting the horizontal line and the vertical line of the text image to be positioned, and determining the text line area in the text image to be positioned according to the text line communication areas and the detected horizontal line and vertical line.

A storage medium comprising stored instructions, wherein the instructions, when executed, control a device in which the storage medium is located to perform the image text region locating method described above.

An electronic device comprising a memory, and one or more instructions, wherein the one or more instructions are stored in the memory and configured to perform the image text region localization method described above by one or more processors.

Compared with the prior art, the application has the following advantages:

The application provides an image text region positioning method and device, the method comprises the steps of adopting different image text region positioning strategies for different types of text images, conducting expansion processing on text images to enable adjacent characters to be connected into a text line connection region, further determining circumscribed rectangles of the text line connection region to position a region with texts in the text images, positioning the region with texts in the text images in a text straight line staggered mode through detecting straight line frames in the text images, positioning the region with the texts in the text images, identifying single character frames corresponding to each single character in the text images on the basis of a single character identification model, merging the single character frames into the text line connection region, and positioning the region with the texts in the text images through detecting the straight line frames and the text line connection region. Therefore, the technical scheme provided by the application realizes accurate positioning of the upper, lower, left and right edge positions of each text line in the text image by identifying the circumscribed rectangle of the straight line frame and/or the connected region in the text image, and realizes universality of positioning the image text regions of each type of text image by adopting different image text region positioning strategies for different types of text images.

Drawings

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings that are required to be used in the embodiments or the description of the prior art will be briefly described below, and it is obvious that the drawings in the following description are only embodiments of the present application, and that other drawings can be obtained according to the provided drawings without inventive effort for a person skilled in the art.

FIG. 1 is a flow chart of a method for locating an image text region according to the present application;

FIG. 2 is a flowchart of another method for locating text regions of an image according to the present application;

FIG. 3 is a flowchart of another method for locating text regions of an image according to the present application;

FIG. 4 is a flowchart of another method for locating text regions of an image according to the present application;

FIG. 5 is a flowchart of another method for locating text regions of an image according to the present application;

fig. 6 is a schematic structural diagram of an image text region positioning device provided by the application;

fig. 7 is a schematic structural diagram of an electronic device according to the present application.

Detailed Description

The following description of the embodiments of the present application will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present application, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the application without making any inventive effort, are intended to be within the scope of the application.

The application is operational with numerous general purpose or special purpose computing device environments or configurations. Such as a personal computer, a server computer, a hand-held or portable device, a tablet device, a multiprocessor apparatus, a distributed computing environment including any of the above, and the like.

The embodiment of the application provides a method for positioning an image text region, which can be applied to various system platforms, wherein an execution subject of the method can be operated on a computer terminal or a processor of various mobile devices, and a flow chart of the method is shown in fig. 1, and specifically comprises the following steps:

S101, acquiring a text image to be positioned, and determining the image type of the text image to be positioned.

And acquiring the text image to be positioned, and determining the image type of the text image to be positioned, wherein the image type comprises a plain text image type, a text straight line staggered type or a complex background layout type.

Optionally, the specific process of determining the image type of the text image to be positioned comprises the steps of receiving the image type of the text image to be positioned uploaded by a user, or inputting the text image to be positioned into a pre-constructed image recognition model to obtain the image type of the text image to be positioned output by the image recognition model. Alternatively, the image recognition model may be a classification model, and the specific construction process refers to the construction process of the existing convolutional neural network classification model.

S102, if the image type of the text image to be positioned is the plain text type, performing image preprocessing on the text image to be positioned.

If the image type of the text image to be positioned is a plain text type, carrying out image preprocessing on the text image to be positioned, wherein the image preprocessing comprises graying processing, filtering processing, self-adaptive binarization processing and pixel inversion processing on the text image to be processed, so as to enhance the image quality of the text image to be processed.

Referring to fig. 2, the process of image preprocessing a text image to be located specifically includes:

s201, carrying out graying treatment on the text image to be positioned to obtain a graying image.

The method comprises the steps of carrying out graying treatment on a text image to be positioned to obtain a graying image of the text image to be positioned, wherein the specific graying treatment process comprises the steps of carrying out graying conversion on each pixel point in the text image to be positioned according to a preset graying formula to obtain the graying image of the text image to be positioned, and the graying image is that each pixel point represents the graying depth of gray with a value between 0 and 255.

Optionally, the preset graying formula is as follows:

GYAY=R×0.299+G×0.587+B×0.114

Wherein R, G and B respectively represent red, green and blue values, and GRAY is the finally obtained GRAY value.

S202, filtering the gray-scale image to obtain a filtered image.

The filtering processing is performed on the text image to be positioned after the graying processing, that is, the filtering processing is performed on the graying image, so as to obtain a filtering image of the text image to be positioned, and the specific filtering processing process may include:

When the center of the filtering sliding window slides to one pixel point in the gray level image, a preset filtering calculation formula corresponding to the noise type is selected based on the noise type of the text image to be positioned, the filtering gray level value in the current filtering sliding window is calculated based on the selected filtering calculation mode, and the calculated filtering gray level value is used as the pixel value of the pixel point.

In the method provided by the embodiment of the application, each pixel point in the gray level image is scratched by the center of the preset filter sliding window, namely, the preset filter sliding window slides in the gray level image based on the preset sliding mode, so that each pixel point in the gray level image is scratched by the center of the filter sliding window. The preset sliding mode is a set sliding mode, and is not limited herein.

In the method provided by the embodiment of the application, when sliding to one pixel point each time, a preset filtering calculation mode corresponding to the noise type is selected based on the noise type of the text image to be positioned, the gray value in the current filtering sliding window is calculated based on the selected filtering calculation mode, and the calculated filtering gray value is used as the pixel value of the pixel point. If the noise type of the text image to be positioned is white, the gray value in each filtering sliding window is calculated by a Gaussian filter calculation formula, and if the noise type of the text image to be positioned is spiced salt noise type, the gray value in each filtering sliding window is calculated by a median filter calculation formula.

S203, performing self-adaptive binarization processing on the filtered image to obtain a binarized image.

And carrying out self-adaptive binarization processing on each pixel point in the filtered image, wherein the gray value of each pixel point in the filtered image is 0 or 255, so as to obtain a binarized image in the text image to be positioned. Wherein 0 represents black and 255 represents white.

Optionally, the specific process of performing adaptive binarization processing on the filtered image includes:

And when each pixel point in the filtered image is scratched by the center of the preset binarization sliding window and slides to one pixel point, calculating a current binarization threshold value by using the pixel values of all the pixel points in the current binarization sliding window, comparing the pixel value of the center pixel point in the current binarization sliding window with the current binarization threshold value, taking a preset first value as the pixel value of the center pixel point in the current binarization sliding window if the pixel value of the center pixel point is larger than the binarization threshold value, and taking a preset second value as the pixel value of the center pixel point in the current binarization sliding window if the pixel value of the center pixel point is not larger than the binarization threshold value, wherein the first value is 255 and the second value is 0.

In the method provided by the embodiment of the application, the self-adaptive binarization processing is carried out on the filtered image, so that the whole image only presents black and white pixel gray values, thereby highlighting the target contour, namely highlighting the contour of each text line. The self-adaptive binarization processing mode provided by the application carries out self-adaptive binarization processing on the filtered image, so that the binarization threshold value corresponding to each pixel point in the filtered image is not fixed, the pixel values of all the pixel points in the binarization sliding window are determined, the binarization threshold value of the image area with higher brightness is usually higher, the binarization threshold value of the image area with lower brightness is correspondingly lower, and the self-adaptive binarization processing mode can be suitable for images with different brightness, different contrast and different textures.

S204, inverting the pixel value of each pixel point in the binarized image.

And (3) inverting the pixel value of each pixel in the binarized image, wherein if the pixel value of the pixel is a first value, the pixel value is inverted to a second value, if the pixel value of the pixel is a second value, the pixel value is inverted to the first value, namely, if the pixel value of the pixel is 255, the pixel value is inverted to 0, and if the pixel value of the pixel is 0, the pixel value is inverted to 255, so that the color inversion of the black-white pixel in the binarized image is realized.

S103, performing expansion processing on the text image to be positioned after the image preprocessing to obtain a target text image.

Performing expansion processing on the text image to be positioned after the image preprocessing, and connecting adjacent characters of each row into a whole to obtain a target text image, wherein the specific expansion processing process is as follows:

Expanding the text image to be positioned after the image preprocessing based on a first sliding window, wherein the width of the first sliding window is determined according to the distance between adjacent characters in the text image to be positioned, and the height of the first sliding window is determined according to the line distance of the Chinese lines in the text image to be positioned.

According to the method provided by the embodiment of the application, based on the first sliding window, the center of the first sliding window slides through each pixel point of the text image to be positioned after the image preprocessing in a preset sliding mode, and when the center of the first sliding window slides to one pixel point of the text image to be positioned after the image preprocessing, the current maximum pixel value in the coverage range of the first sliding window is used as the pixel value of the pixel point corresponding to the center of the first sliding window, so that the expansion processing is realized.

In the method provided by the embodiment of the application, the width of the first sliding window is determined according to the interval between adjacent characters in the text image to be positioned, and the height of the first sliding window is determined according to the line interval of the Chinese lines in the text image to be positioned.

S104, identifying each text line connected region in the target text image, and determining coordinate values of circumscribed rectangles of each text line connected region.

And identifying each text line connected region in the target text image, wherein the pixel values of adjacent pixel points in each text line connected region are the same.

And determining coordinate values of circumscribed rectangles of each text line connected region based on the identified text line connected regions, namely determining circumscribed contours of the text line connected regions as circumscribed rectangles of the text line connected regions for each text line connected region, wherein after the circumscribed rectangles are determined, the coordinate values of the circumscribed rectangles can be determined, and when the description is needed, the coordinate values of the circumscribed rectangles are coordinate values of four corners of the circumscribed rectangles.

Optionally, after determining the circumscribed rectangle of each text communication area, the circumscribed rectangle with obviously excessive or excessively small width can be further deleted, that is, the circumscribed rectangle with the height not within the preset height range and/or the width not within the preset width range is deleted, and the circumscribed rectangle with the height within the preset height range and the width within the preset width range is reserved.

S105, determining text line areas in the text image to be positioned based on coordinate values of circumscribed rectangles of the text line connected areas.

And determining the text line areas in the text image to be positioned based on the coordinate values of the circumscribed rectangles of the text line connected areas, namely, determining a rectangle corresponding to one text line area in the text image to be positioned by the coordinate value of each circumscribed rectangle, and determining all text line areas in the text image to be positioned by the coordinate values of all circumscribed rectangles.

S106, if the image type of the text image to be positioned is the text straight line staggered type, performing image preprocessing on the text image to be positioned.

If the image type of the text image to be positioned is the text straight line staggered type, similar to the text image of the plain text type, the text image to be positioned needs to be subjected to image preprocessing, and the specific process of the image preprocessing is shown in embodiment fig. 2, and is not repeated here.

And S107, performing horizontal line detection and vertical line detection on the text image to be positioned after the image preprocessing, and determining a plurality of rectangles based on each detected horizontal line and each detected vertical line.

In the method provided by the embodiment of the application, horizontal line detection and vertical line detection are carried out on the text image to be positioned after the image pretreatment, namely, the linear frame in the text image to be positioned after the image pretreatment is detected.

In the method provided by the embodiment of the application, the horizontal line in the text image to be positioned after the image pretreatment is detected by performing the horizontal corrosion treatment on the text image to be positioned after the image pretreatment and performing the horizontal expansion treatment after the horizontal corrosion treatment.

Referring to fig. 3, a process of performing horizontal line detection on a text image to be positioned after image preprocessing specifically includes:

S301, performing corrosion treatment on the text image to be positioned after the image pretreatment based on a preset second sliding window to obtain a first corrosion image.

In the method provided by the embodiment of the application, the to-be-positioned text image after the image pretreatment is corroded based on the preset second sliding window, namely, the to-be-positioned text image after the image pretreatment is corroded horizontally to obtain the first corroded image, and the specific process of the corrosion treatment is shown in the existing image corrosion process and is not repeated here.

It should be noted that, the aspect ratio of the second sliding window is greater than the first threshold, alternatively, the first threshold may be 30, that is, the aspect ratio of the second sliding window is greater than 30, alternatively, the height of the second sliding window satisfies a preset first height range, and the first height range is 1-2 pixels.

In the method provided by the embodiment of the application, the second sliding window is used for carrying out corrosion treatment on the text image to be positioned after the image pretreatment, so that the image elements of the vertical line and other non-horizontal lines can be restrained.

S302, based on a preset third sliding window, performing expansion processing on the first corrosion image to obtain a first expansion image.

In the method provided by the embodiment of the application, the expansion processing is performed on the first corrosion image based on the preset third sliding window, that is, the first corrosion image is horizontally expanded to obtain the first expansion image, and it is noted that the specific process of the expansion processing refers to the existing image expansion process and is not repeated here.

It should be noted that, the aspect ratio of the third sliding window is greater than the second threshold, alternatively, the second threshold may be 20, that is, the aspect ratio of the third sliding window is greater than 20, alternatively, the height of the second sliding window satisfies a preset second height range, and the second height range is 1-5 pixels.

S303, identifying each horizontal communication area in the first expansion image, and determining the circumscribed rectangle of each horizontal communication area.

And identifying each horizontal communication area in the first expansion image, wherein the horizontal communication area is the horizontal expansion area, determining the circumscribed outline of the horizontal communication area, and further determining the circumscribed rectangle of each horizontal communication area.

S304, for each horizontal communication area, calculating coordinates of two endpoints of a horizontal line corresponding to the circumscribed rectangle according to coordinates of the circumscribed rectangle of the horizontal communication area.

And calculating coordinates of two endpoints of a horizontal line corresponding to the circumscribed rectangle according to coordinates of the circumscribed rectangle of the horizontal communication area for each horizontal communication area.

The specific process for calculating the coordinates of two endpoints of a horizontal line corresponding to the circumscribed rectangle according to the coordinates of the circumscribed rectangle of the horizontal communication area for each horizontal communication area comprises the following steps:

The method comprises the steps of calculating the average value between the ordinate of the upper edge and the ordinate of the lower edge of the circumscribed rectangle of the horizontal communication area, taking the average value as the ordinate of two endpoints of the horizontal line corresponding to the circumscribed rectangle, taking the abscissa of the left edge of the circumscribed rectangle as the abscissa of the left endpoint of the horizontal line corresponding to the circumscribed rectangle, and taking the abscissa of the right edge of the circumscribed rectangle as the abscissa of the right endpoint of the horizontal line corresponding to the circumscribed rectangle. For example, the upper edge of the circumscribed rectangle has an ordinate y _up, the lower edge has an ordinate y _down, the left edge has an abscissa x _left, the right edge has an abscissa x _right, and the left end point of the horizontal line corresponding to the circumscribed rectangle is Right end point

In the method provided by the embodiment of the application, the vertical line in the text image to be positioned after the image pretreatment is detected by performing vertical corrosion treatment on the text image to be positioned after the image pretreatment and performing vertical expansion treatment after the vertical corrosion treatment.

Referring to fig. 4, a process of performing vertical line detection on a text image to be positioned after image preprocessing specifically includes:

s401, performing corrosion treatment on the text image to be positioned after the image pretreatment based on a preset fourth sliding window to obtain a second corrosion image.

In the method provided by the embodiment of the application, the to-be-positioned text image after the image pretreatment is corroded based on the preset fourth sliding window, namely, the to-be-positioned text image after the image pretreatment is subjected to vertical corrosion to obtain the second corrosion image, and the specific process of the corrosion treatment is shown in the existing image corrosion process and is not repeated here.

It should be noted that, the aspect ratio of the fourth sliding window is smaller than the third threshold, alternatively, the third threshold may be 1/30, that is, the aspect ratio of the fourth sliding window is smaller than 1/30, alternatively, the width of the fourth sliding window may be 1 pixel point.

In the method provided by the embodiment of the application, the corrosion treatment is carried out on the text image to be positioned after the image pretreatment by using the fourth sliding window, so that the image elements of horizontal lines and other non-vertical lines can be restrained.

S402, expanding the second corrosion image based on a preset fifth sliding window to obtain a second expanded image.

In the method provided by the embodiment of the application, the second corrosion image is subjected to expansion processing based on the preset fifth sliding window, that is, the first corrosion image is subjected to vertical expansion to obtain the second expansion image, and it is noted that the specific process of expansion processing refers to the existing image expansion process and is not repeated here.

It should be noted that, the aspect ratio of the fifth sliding window is smaller than the fourth threshold, alternatively, the fourth threshold may be 1/20, that is, the aspect ratio of the fifth sliding window is larger than 1/20, alternatively, the width of the fifth sliding window may be smaller than the fifth threshold, and the fifth threshold may be 5 pixels, that is, the width of the fifth sliding window may be smaller than 5 pixels.

S403, identifying each vertical communication area in the second expansion image, and determining the circumscribed rectangle of each vertical communication area.

And identifying each vertical communication area in the second expansion image, wherein the vertical communication areas are vertical expansion areas, determining the circumscribed outline of the vertical communication areas, and further determining the circumscribed rectangle of each vertical communication area.

S404, calculating coordinates of two endpoints of a vertical line corresponding to the circumscribed rectangle according to the coordinates of the circumscribed rectangle of the vertical communication area for each vertical communication area.

And calculating coordinates of two endpoints of a vertical line corresponding to the circumscribed rectangle according to coordinates of the circumscribed rectangle of the vertical communication area for each vertical communication area.

The specific process for calculating the coordinates of two endpoints of a vertical line corresponding to the circumscribed rectangle according to the coordinates of the circumscribed rectangle of the vertical communication area for each vertical communication area comprises the following steps:

The method comprises the steps of calculating the average value between the abscissa of the left edge and the abscissa of the right edge of the circumscribed rectangle of the vertical communication area, taking the average value as the abscissa of two endpoints of the vertical line corresponding to the circumscribed rectangle, taking the ordinate of the upper edge of the circumscribed rectangle as the ordinate of the upper endpoint of the vertical line corresponding to the circumscribed rectangle, and taking the ordinate of the lower edge of the circumscribed rectangle as the ordinate of the lower endpoint of the vertical line corresponding to the circumscribed rectangle. For example, the upper edge of the circumscribed rectangle has an ordinate y _up, the lower edge has an ordinate y _down, the left edge has an abscissa x _left, and the right edge has an abscissa x _right, and the upper end point of the vertical line corresponding to the circumscribed rectangle is The lower end point is

In the method provided by the embodiment of the application, after the horizontal line and the vertical line of the text image to be positioned after the image preprocessing are detected, a certain blank gap is required to be reserved for text region segmentation, so that the horizontal line above and/or below the text region is copied, the vertical line at the left side and/or right side of the text region is copied, the copied horizontal line is moved up or down, and the copied vertical line is moved left or right.

In the method provided by the embodiment of the application, each horizontal line and each vertical line form a linear frame of the text image to be positioned, and the formed linear frame divides the text image to be positioned into a plurality of rectangles.

S108, determining a text line area in the text image to be positioned according to the coordinate value of the rectangle.

And determining text line areas in the text image to be positioned based on the coordinate values of the rectangles, namely, determining a rectangle by the coordinate value of each rectangle, wherein the determined rectangle corresponds to one text line area in the text image to be positioned, and determining all text line areas in the text image to be positioned by the coordinate values of all rectangles.

S109, if the image type of the text image to be positioned is a complex background layout type, inputting the text image to be positioned into a pre-constructed single word recognition model to obtain a coordinate prediction value and a confidence coefficient of each single word corresponding to a single word frame in the text image to be positioned.

In the method provided by the embodiment of the application, the single word recognition model is constructed in advance, and the construction process of the single word recognition model is referred to the prior art, and is not repeated here.

If the image type of the text image to be positioned is a complex background layout type, inputting the text image to be positioned into a pre-built single word recognition model to obtain a coordinate predicted value and a confidence coefficient of each single word in the text image to be positioned, which are output by the single word recognition model, wherein the coordinate predicted value and the confidence coefficient of each single word correspond to a single word frame respectively, and the range of the confidence coefficient is between 0 and 1, and the larger the numerical value is, the higher the confidence coefficient is.

S110, determining a plurality of target single frames from the single frames based on the confidence of each single frame.

Determining a plurality of target single frames from the single frames according to the confidence level of each single frame, specifically, referring to fig. 5, a process of determining a plurality of target single frames from the individual single frames based on the confidence level of each single frame includes:

S501, determining the single character frame as an initial single character frame if the confidence coefficient of the single character frame is not smaller than a preset confidence coefficient threshold value aiming at each single character frame.

And determining the single character frame with the confidence level not smaller than a preset confidence level threshold value in each single character frame as an initial single character frame.

S502, forming each initial single word frame into a single word frame set.

S503, selecting a first single word frame from the current single word frame set, wherein the first single word frame is the initial single word frame with the highest confidence level in all the initial single word frames contained in the current single word frame set.

S504, calculating the area overlapping rate of the initial single character frame and the first single character frame according to each initial single character frame remained in the single character frame set.

For each initial single frame remaining in the single frame set, calculating the area of intersection of the initial single frame and the first single frame and the area of merging, and calculating the area overlapping rate of the initial single frame and the first single frame according to the area of intersection and the area of merging, namely dividing the area of intersection by the area of merging to obtain the area overlapping rate.

S505, for each initial single frame remaining in the single frame set, judging whether the area overlapping rate of the initial single frame and the first single frame is larger than a preset overlapping threshold value.

For each initial single frame remaining in the single frame set, based on the calculated area overlapping rate of the initial single frame and the first single frame, judging whether the area overlapping rate is greater than a preset overlapping threshold, if so, executing step S506, and if not, executing step S507.

S506, deleting the initial single word frame from the single word frame set.

And deleting each initial single frame in the single frame set from the single frame set if the area overlapping rate is larger than a preset overlapping threshold value, and executing step S507.

S507, judging whether an initial single character frame which does not overlap with the first single character frame calculation area exists in the single character frame set.

Judging whether the initial single character frame which does not calculate the area overlapping rate with the first single character frame exists in the single character frame set, if so, returning to the step S505, and if not, executing the step S508.

S508, determining the first single word frame as a target single word frame.

S509, judging whether the current single-frame set is an empty set or not.

And judging whether the current single-frame set is an empty set, if so, directly ending, otherwise, returning to the step S503.

Optionally, in the method provided by the embodiment of the application, the target single word frame with the height not greater than the preset threshold value is deleted.

And S111, merging the target single character frames adjacent in the horizontal direction to obtain a plurality of text line communication areas.

Combining the target single character frames adjacent in the horizontal direction to obtain a plurality of text line communication areas, wherein the specific process comprises the following steps:

According to the abscissa of the upper left corner of each target single character frame, sequencing each target single character frame according to a preset sequence to obtain a single character frame sequence;

Judging whether the distance between the left upper-corner abscissa of two adjacent target single frames of the single frame sequence is larger than a preset first threshold value, and segmenting the middle of the two target single frames corresponding to the left upper-corner abscissa to obtain a plurality of single frame sequences;

The left boundary of the first target single character frame in each single character frame sequence is used as the left boundary of the text line communication area corresponding to the single character frame sequence, the right boundary of the last target single character frame in each single character frame sequence is used as the right boundary of the text line communication area corresponding to the single character frame sequence, the upper boundary minimum value of each single character frame sequence is used as the upper boundary of the text line communication area corresponding to the single character frame sequence, and the lower boundary maximum value of each single character frame sequence is used as the lower boundary of the text line communication area corresponding to the single character frame sequence.

In the method provided by the embodiment of the application, according to the horizontal coordinates of the upper left corners of each target single character frame, each target single character frame is arranged according to a preset sequence to obtain a single character frame sequence, alternatively, the preset sequence can be an order from small horizontal coordinates to large horizontal coordinates, whether the distance between the horizontal coordinates of the upper left corners of two adjacent target single character frames in the single character frame sequence is larger than a preset first threshold value is judged, if so, the two target single character frames are segmented to obtain a plurality of single character frame sequences, that is, if the distance between the left horizontal coordinates of the upper left corners of a plurality of groups of adjacent two single character frames in the single character frame sequence is larger than the preset first threshold value, the single character frame sequence is divided into a plurality of single character frame sequences, for example, the distance between the left horizontal coordinates of the upper left horizontal coordinates of 5 groups of adjacent two single character frames is larger than the preset first threshold value, the single character frame sequence is finally divided into 6 groups, each group of single character frame sequences corresponds to a text line communication area, if so, the upper boundary of the text line communication area corresponds to the boundary of the single character frame sequence, the boundary of the single character frame sequence is determined according to the boundary value of the single character frame sequence, and the boundary of the single character frame sequence is determined according to the boundary value of the single character frame sequence.

S112, horizontal line detection and vertical line detection are carried out on the text image to be positioned.

The specific implementation process of step S112 is described in step S107, and will not be described here again.

S113, determining text line areas in the text image to be positioned according to the text line connected areas and the detected horizontal lines and vertical lines.

After determining the straight line frame and each text line connection region in the text image to be positioned, dividing the text line connection region by the straight line frame, namely, the detected horizontal line and vertical line form the straight line frame of the text image to be positioned, dividing the text image to be positioned into a plurality of regions, and dividing the characters belonging to different regions in the text connection region, thereby obtaining a plurality of final text line connection regions, wherein each final text line connection region can determine a rectangle, the determined rectangle corresponds to one text line region in the text image to be positioned, and all text line regions in the text image to be positioned can be determined by coordinate values of all rectangles.

According to the image text region positioning method provided by the embodiment of the application, aiming at different types of text images, different image text region positioning strategies are adopted, text images of plain text types are subjected to expansion processing, adjacent characters are connected into a text line connected region, the circumscribed rectangle of the text line connected region is further determined, the region with the text in the text images is positioned, the text images of text straight line staggered type are positioned, the region with the text in the text images is positioned by detecting the straight line frames in the text images, the text images of complex background layout type are identified, the single character frames corresponding to each single character in the text images are combined into the text line connected region based on the single character identification model, and the region with the text in the text images is positioned by detecting the straight line frames in the text images and the text line connected region. By adopting the image text region positioning method provided by the embodiment of the application, the positions of the upper, lower, left and right edges of each text line in the text image are accurately positioned by identifying the straight line frame and/or the circumscribed rectangle of the connected region in the text image, and the image text region positioning of each type of text image is universal by adopting different image text region positioning strategies for different types of text images.

Corresponding to the method shown in fig. 1, the embodiment of the present application further provides an image text region positioning device, which is used for implementing the method in fig. 1, and the structural schematic diagram of the device is shown in fig. 6, and specifically includes:

An obtaining unit 601, configured to obtain a text image to be positioned, and determine an image type of the text image to be positioned, where the image type includes a plain text type, a text straight line interleaving type, or a complex background layout type;

A first positioning unit 602, configured to perform image preprocessing on the text image to be positioned if the image type of the text image to be positioned is a plain text type, perform expansion processing on the text image to be positioned after the image preprocessing to obtain a target text image, identify each text line connected region in the target text image, determine coordinate values of circumscribed rectangles of each text line connected region, and determine text line regions in the text image to be positioned based on the coordinate values of circumscribed rectangles of each text line connected region, where pixel values of adjacent pixels in each text line connected region are the same;

The second positioning unit 603 is configured to perform image preprocessing on the text image to be positioned if the image type of the text image to be positioned is a text straight line staggered type, perform horizontal line detection and vertical line detection on the text image to be positioned after the image preprocessing, determine a plurality of rectangles based on each horizontal line and each vertical line obtained by the detection, and determine a text line area in the text image to be positioned according to coordinate values of the rectangles;

and a third positioning unit 604, configured to, if the image type of the text image to be positioned is a complex background layout type, input the text image to be positioned into a pre-constructed word recognition model, obtain a coordinate predicted value and a confidence coefficient of each word corresponding to a single word frame in the text image to be positioned, determine a plurality of target single word frames from the single word frames based on the confidence coefficient of each single word frame, combine the target single word frames adjacent in the horizontal direction, obtain a plurality of text line communication areas, perform horizontal line detection and vertical line detection on the text image to be positioned, and determine a text line area in the text image to be positioned according to each text line communication area and the detected horizontal line and vertical line.

According to the image text region positioning device provided by the embodiment of the application, different image text region positioning strategies are adopted for different types of text images, text images of plain text types are subjected to expansion processing, adjacent characters are connected into a text line connection region, the circumscribed rectangle of the text line connection region is further determined, the region with the text in the text images is positioned, the text images of text straight line staggered type are positioned, the region with the text in the text images is positioned by detecting the straight line frames in the text images, the text images of complex background layout type are identified, the single character frames corresponding to each single character in the text images are combined into the text line connection region based on the single character identification model, and the region with the text in the text images is positioned by detecting the straight line frames in the text images and the text line connection region. By adopting the image text region positioning device provided by the embodiment of the application, the positions of the upper, lower, left and right edges of each text line in the text image are accurately positioned by identifying the straight line frame and/or the circumscribed rectangle of the connected region in the text image, and the image text region positioning of each type of text image is universal by adopting different image text region positioning strategies for different types of text images.

In one embodiment of the present application, based on the foregoing scheme, the first positioning unit 602 and the second positioning unit 603 are configured to:

the graying subunit is used for graying the text image to be positioned to obtain a graying image;

the filtering subunit is used for carrying out filtering processing on the gray-scale image to obtain a filtered image;

the binarization subunit is used for carrying out self-adaptive binarization processing on the filtered image to obtain a binarized image;

And the inverting subunit is used for inverting the pixel value of each pixel point in the binarized image.

In one embodiment of the present application, based on the foregoing scheme, the filtering subunit performs filtering processing on the graying image to obtain a filtered image, for:

In one embodiment of the present application, based on the foregoing scheme, the first positioning unit 602 performs expansion processing on the text image to be positioned after the image preprocessing, to obtain a target text image, which is used for:

In one embodiment of the present application, based on the foregoing scheme, the second positioning unit 603 performs horizontal line detection on the text image to be positioned after the image preprocessing, for:

In one embodiment of the present application, based on the foregoing scheme, the second positioning unit 603 performs vertical line detection on the text image to be positioned after the image preprocessing, for:

In one embodiment of the present application, based on the foregoing scheme, the third positioning unit 604 performs determining a plurality of target single frames from the respective single frames based on the confidence of each of the single frames, for:

forming a single character frame set by the initial single character frames;

The embodiment of the application also provides a storage medium, which comprises stored instructions, wherein the equipment where the storage medium is located is controlled to execute the image text region positioning method when the instructions run.

The embodiment of the present application further provides an electronic device, whose structural schematic diagram is shown in fig. 7, specifically including a memory 701, and one or more instructions 702, where the one or more instructions 702 are stored in the memory 701, and configured to be executed by the one or more processors 703, where the one or more instructions 702 perform the following operations:

It should be noted that, in the present specification, each embodiment is described in a progressive manner, and each embodiment is mainly described as different from other embodiments, and identical and similar parts between the embodiments are all enough to be referred to each other. For the apparatus class embodiments, the description is relatively simple as it is substantially similar to the method embodiments, and reference is made to the description of the method embodiments for relevant points.

Finally, it is further noted that relational terms such as first and second, and the like are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Moreover, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises an element.

For convenience of description, the above devices are described as being functionally divided into various units, respectively. Of course, the functions of each element may be implemented in the same piece or pieces of software and/or hardware when implementing the present application.

From the above description of embodiments, it will be apparent to those skilled in the art that the present application may be implemented in software plus a necessary general hardware platform. Based on such understanding, the technical solution of the present application may be embodied essentially or in a part contributing to the prior art in the form of a software product, which may be stored in a storage medium, such as a ROM/RAM, a magnetic disk, an optical disk, etc., including several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the method described in the embodiments or some parts of the embodiments of the present application.

The foregoing describes the image text region locating method and apparatus, the storage medium and the electronic device in detail, and specific examples are provided herein to illustrate the principles and embodiments of the present application, and the description of the foregoing examples is only for aiding in understanding of the method and core concept of the present application, and meanwhile, to those skilled in the art, according to the concept of the present application, there are variations in the specific embodiments and application ranges, so that the disclosure should not be construed as limiting the application.

Claims

1. A method for locating an image text region, comprising:

If the image type of the text image to be positioned is a complex background layout type, inputting the text image to be positioned into a pre-built single word recognition model to obtain a coordinate predicted value and a confidence coefficient of each single word corresponding to a single word frame in the text image to be positioned, determining a plurality of target single word frames from the single word frames based on the confidence coefficient of each single word frame, merging the target single word frames adjacent in the horizontal direction to obtain a plurality of text line communication areas, detecting horizontal lines and vertical lines of the text image to be positioned, and determining the text line area in the text image to be positioned according to the text line communication areas and the detected horizontal lines and vertical lines;

wherein the determining a plurality of target single frames from the single frames based on the confidence of each single frame comprises:

forming a single character frame set by the initial single character frames;

2. The method according to claim 1, wherein said image preprocessing of said text image to be localized comprises:

Filtering the graying image to obtain a filtered image;

3. The method of claim 2, wherein filtering the grayscale image to obtain a filtered image comprises:

Sliding through each pixel point in the gray-scale image with the center of a preset filtering sliding window;

When the center of the filtering sliding window slides to one pixel point in the gray-scale image, a preset filtering calculation formula corresponding to the noise type is selected based on the noise type of the text image to be positioned, a filtering gray value in the current filtering sliding window is calculated based on the selected filtering calculation mode, and the calculated filtering gray value is used as the pixel value of the pixel point.

4. A method according to claim 3, wherein the expanding the text image to be positioned after the image preprocessing to obtain the target text image includes:

5. A method according to claim 3, wherein the performing horizontal line detection on the text image to be positioned after the image preprocessing comprises:

6. A method according to claim 3, wherein the performing vertical line detection on the text image to be localized after the image preprocessing comprises:

7. An image text region locating apparatus, comprising:

the third positioning unit is used for inputting the text image to be positioned into a pre-built single word recognition model if the image type of the text image to be positioned is a complex background layout type, obtaining a coordinate predicted value and a confidence coefficient of each single word corresponding to a single word frame in the text image to be positioned, determining a plurality of target single word frames from the single word frames based on the confidence coefficient of each single word frame, merging the target single word frames adjacent in the horizontal direction, obtaining a plurality of text line communication areas, detecting the horizontal line and the vertical line of the text image to be positioned, and determining the text line area in the text image to be positioned according to the text line communication areas and the detected horizontal line and vertical line;

wherein, based on the confidence of each single word frame, a third positioning unit of a plurality of target single word frames is determined from the single word frames, and the third positioning unit is specifically used for:

For each single word frame, determining the single word frame as an initial single word frame if the confidence of the single word frame is not smaller than a preset confidence threshold, forming a single word frame set by the initial single word frames, selecting a first single word frame from a current single word frame set, wherein the first single word frame is the initial single word frame with the maximum confidence in the initial single word frames contained in the current single word frame set, calculating the area overlapping rate of the initial single word frame and the first single word frame for each initial single word frame remaining in the single word frame set, deleting the initial single word frame from the single word frame set if the area overlapping rate is larger than the preset overlapping threshold, determining the first single word frame as a target single word frame, judging whether the current single word frame set is an empty set, and if the current single word frame set is not the empty set, returning to execute the step of selecting the first single word frame from the current single word frame set until the current single word frame set is the empty set.

8. A storage medium comprising stored instructions, wherein the instructions, when executed, control a device in which the storage medium is located to perform the image text region localization method according to any one of claims 1 to 6.

9. An electronic device comprising a memory and one or more instructions, wherein the one or more instructions are stored in the memory and configured to be executed by the one or more processors to perform the image text region localization method of any one of claims 1-6.