Disclosure of Invention
In order to solve the above problems in the prior art, that is, to solve the problems of unstable binarization accuracy and poor robustness of the existing binarization method under the condition of uneven image quality of the document picture, a first aspect of the present invention provides a document image binarization method based on a generation countermeasure network, the method comprising:
step S10, acquiring a plurality of image blocks with a preset first size from an input original document image according to a set step length to serve as a first image block set;
step S20, for the first image block set, obtaining a binary image of each image block through a first convolutional neural network to obtain a second image block set; normalizing the original document image to the first size, and acquiring a binary image of the original document image through the first convolution neural network to serve as a first binary image;
step S30, splicing the image blocks in the second image block set to obtain a second binary image; scaling the first binary image to the size of the original document image as a third binary image; acquiring a gray scale image of the original document image; combining the second binary image, the third binary image and the gray image of the original document image to obtain a three-channel image;
step S40, the three-channel image is segmented by the method of step S10 to obtain a third image block set, and a binary image of each image block is obtained through a second convolutional neural network and is used as a fourth image block set;
step S50, splicing the image blocks in the fourth image block set to obtain a final binary image of the original document image;
and the first convolutional neural network and the second convolutional neural network are cascaded to form a generator for generating a countermeasure network, and parameter optimization is carried out through training.
In some preferred embodiments, the discriminator of the countermeasure network is a patch-based fully convolutional neural network;
the first convolutional neural network and the second convolutional neural network are two semantic segmentation networks with the same structure; the first convolution neural network is used for generating a binarization image according to the context information of the local area; and the second convolutional neural network is used for correcting the output result of the first convolutional neural network according to the difference between the text and the background context information.
In some preferred embodiments, the loss function L during the training of the countermeasure networklossIs composed of
LcGAN(G,D)=Ex,y[log D(x,y)]+Ex[log(1-D(x,G(x,z)))]
LL1(G)=Ex,y[||(y-G(x,z))||1]
G, D denotes a generator and a discriminator in the countermeasure network; l iscGAN(G, D) is the penalty on confrontation of the generator and arbiter training, LL1(G) Image and true binary image generated for generatorL1 loss of an image, x is an input picture, z is random noise in a generator, G (x, z) represents a binarization result image generated by the generator using the input image x and the random noise z, y is a true binary image, γ is a weight coefficient corresponding to both losses, and D (x, y) is a discriminator output result corresponding to the input image and the true binarization sample.
In some preferred embodiments, the first convolutional neural network and the second convolutional neural network each comprise five convolutional layers and five deconvolution layers.
In some preferred embodiments, each image block in the first set of image blocks has a second-size area in the center of the image that does not overlap with other image blocks in the first set of image blocks.
In some preferred embodiments, the first dimension is a, and the second dimension is B;
determining the upper left points of four adjacent image blocks of the image block based on the upper left points [ a, b ] of the image blocks, wherein the method comprises the following steps:
the coordinates of the upper left point of the adjacent image block on the left side are [ a-A + (B/2), B ];
the coordinates of the upper left point of the adjacent image block on the right side are [ a + A- (B/2), B ];
the coordinate of the upper left point of the upper adjacent image block is [ a, B-A + (B/2) ];
and the coordinate of the upper left point of the lower adjacent image block is [ a, B + A- (B/2) ].
In some preferred embodiments, the first dimension is 256 × 256 and the second dimension is 128 × 128.
The invention provides a document image binarization system based on a generated countermeasure network, which comprises a segmentation module, a first convolution neural network processing module, a three-channel image acquisition module, a second convolution neural network processing module and a final binarization image acquisition module;
the segmentation module is configured to acquire a plurality of image blocks with a preset first size from an input text image according to a set step length and construct an image block set;
the first convolution neural network processing module is configured to acquire a first image block set from an original document image through the segmentation module, and acquire a binary image of each image block through a first convolution neural network to obtain a second image block set; normalizing the original document image to the first size, and acquiring a binary image of the original document image through the first convolution neural network to serve as a first binary image;
the three-channel image acquisition module is configured to splice all image blocks in the second image block set to obtain a second binary image; scaling the first binary image to the size of the original document image as a third binary image; acquiring a gray scale image of the original document image; combining the second binary image, the third binary image and the gray image of the original document image to obtain a three-channel image;
the second convolutional neural network processing module is configured to acquire a third image block set from the three-channel image through the segmentation module, and acquire a binary image of each image block through a second convolutional neural network to serve as a fourth image block set;
the final binary image acquisition module is configured to splice the image blocks in the fourth image block set to obtain a final binary image of the original document image;
and the first convolutional neural network and the second convolutional neural network are cascaded to form a generator for generating a countermeasure network, and parameter optimization is carried out through training.
In a third aspect of the present invention, a storage device is provided, in which a plurality of programs are stored, the programs being adapted to be loaded and executed by a processor to implement the above-mentioned document image binarization method based on a generative countermeasure network.
In a fourth aspect of the present invention, a processing apparatus is provided, which includes a processor, a storage device; a processor adapted to execute various programs; a storage device adapted to store a plurality of programs; the program is suitable for being loaded and executed by a processor to realize the document image binarization method based on the generation countermeasure network.
The invention has the beneficial effects that:
the invention can obtain the binary image with higher accuracy for the photographed document images of various documents, has higher stability and strong robustness, and simultaneously has good adaptability to the text extraction of the document images by adopting a double convolution neural network mode, and can overcome the interference of non-text noise.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings, and it is apparent that the described embodiments are some, but not all embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
The present application will be described in further detail with reference to the following drawings and examples. It is to be understood that the specific embodiments described herein are merely illustrative of the relevant invention and not restrictive of the invention. It should be noted that, for convenience of description, only the portions related to the related invention are shown in the drawings.
It should be noted that the embodiments and features of the embodiments in the present application may be combined with each other without conflict.
For a clearer explanation of the present invention, the following description will discuss parts of an embodiment of the present invention with reference to fig. 1 to 7.
In the invention, two convolutional neural networks are cascaded to carry out binarization processing, and in order to better explain the invention, the constitution and training of the two convolutional neural networks are described in advance, and then the document image binarization method based on the generation countermeasure network is described based on the two trained convolutional neural networks.
1. Construction and training of two convolutional neural networks
And the first convolutional neural network and the second convolutional neural network are cascaded to form a generator for generating the countermeasure network, and the countermeasure network is constructed based on the generator.
(1) Generator
In the designed generation countermeasure network, the first convolutional neural network and the second convolutional neural network are formed by cascading two semantic segmentation networks (U-NETs) with the same structure, wherein each U-NET network comprises five convolutional layers and five anti-convolutional layers so as to ensure that the sizes of input and output pictures are the same. The two U-NETs respectively have the following functions: the first U-NET structure mainly generates a binary image according to the context information of a local area and keeps text details as much as possible. And the second U-NET structure corrects the result image generated by the first part based on the context information difference between the text and the background under different scales so as to further eliminate the background noise. The generator structure is shown in the middle block part of fig. 3, in which G1 is the first convolutional neural network and G2 is the second convolutional neural network.
(2) Distinguishing device
The discriminator is a patch-based full convolution neural network. The purpose is to distinguish which of the binarized image generated by the generator and the original binarized image is more standard. The specific network structure is shown in fig. 4, and the binarized picture generated by the generator and the binarized picture sample corresponding to the input sample are compared and judged, wherein the result of the comparison and judgment between the binary image generated by the generator and the input image is false, and the result of the comparison and judgment between the standard binary image corresponding to the original image and the input image is true.
(3) Loss function
Loss function L in training against networklossIs composed of
LcGAN(G,D)=Ex,y[log D(x,y)]+Ex[log(1-D(x,G(x,z)))]
LL1(G)=Ex,y[||(y-G(x,z))||1]
G, D denotes the generator and the discriminator in the countermeasure network;LcGAN(G, D) is the penalty on confrontation of the generator and arbiter training, LL1(G) L1 loss of the image generated for the generator and the true binary image; x is an input picture; z is random noise in the generator; g (x, z) denotes a binarization result image generated by the generator using the input image x and the random noise z, y is a real binary image, γ is a weighting coefficient corresponding to two kinds of losses (γ is 1 in some embodiments), and D (x, y) is a discriminator output result corresponding to the input image and the real binarization sample.
2. The method of the invention
The document image binarization method based on the generation countermeasure network of one embodiment of the invention is as shown in FIG. 1, and comprises the following steps:
step S10, acquiring a plurality of image blocks of a preset first size from the input original document image as a first image block set according to a set step size.
In this embodiment, each image block in the first image block set has a second-size area at the center of the image that is not overlapped with other image blocks in the first image block set.
For example, the first size is a (for example, 256 × 256) and the second size is B × B (for example, 128 × 128), the original photographed document image is cut into image blocks of a × a size according to a certain step size, the B × B areas in the center of each image block do not overlap, and in order to implement non-overlap, the following method may be used to determine the positions of adjacent image blocks:
determining the upper left points of four image blocks adjacent to the image block based on the upper left points [ a, b ] of the image blocks: :
the coordinates of the upper left point of the adjacent image block on the left side are [ a-A + (B/2), B ];
the coordinates of the upper left point of the adjacent image block on the right side are [ a + A- (B/2), B ];
the coordinate of the upper left point of the upper adjacent image block is [ a, B-A + (B/2) ];
and the coordinate of the upper left point of the lower adjacent image block is [ a, B + A- (B/2) ].
For example, in one embodiment, the first size is 256 × 256, the second size is 128 × 128, and the upper left dot of an image block is [ a, b ], then the coordinates of the upper left dot of the corresponding left adjacent image block are [ a-256+64, b ]; the coordinates of the upper left point of the right adjacent image block are [ a +256-64, b ]; the coordinates of the upper left point of the upper adjacent image block are [ a, b-256+64 ]; the coordinates of the upper left point of the lower adjacent image block are [ a, b +256-64 ].
FIG. 2 is a schematic diagram illustrating an example of segmenting an original document image, in which bars indicate the correspondence between the positions of the original document image and the segmented image.
Step S20, for the first image block set, obtaining a binary image of each image block through a first convolutional neural network to obtain a second image block set; normalizing the original document image to the first size, and acquiring a binary image of the original document image through the first convolution neural network to serve as a first binary image.
In the embodiment, each image block in the first image block set is input into a trained first convolution neural network to obtain an initial binarization result image corresponding to each image block, so as to obtain a second image block set; meanwhile, the original document image is entirely normalized to a × a (for example, 256 × 256), and a binarization result thereof is obtained through a first convolution neural network as a first binary image.
Fig. 5 is an example of the result of stitching the binarized image blocks generated by the first convolutional neural network into the original image size in an embodiment of the present invention, where five examples (a), (b), (c), (d), and (e) are given.
Step S30, splicing the image blocks in the second image block set to obtain a second binary image; scaling the first binary image to the size of the original document image as a third binary image; acquiring a gray scale image of the original document image; and combining the second binary image, the third binary image and the gray image of the original document image to obtain a three-channel image.
In this embodiment, the second convolutional neural network input image is composed of three channels, and therefore the three-channel image needs to be obtained in advance in this step, the method includes:
combining and splicing the image blocks in the second image block set obtained in the step S30 by adopting the information segmented in the step S10, and recovering a preliminary binarization result of the original document image as a second binary image, wherein the image is a first channel of a second convolutional neural network input image;
scaling the first binary image obtained in the step S30 to the size of the original document image as a third binary image, which is a second channel of the second convolutional neural network input image;
acquiring a gray scale image of an original document image as a third channel of a second convolution neural network input image;
and combining the second binary image, the third binary image and the gray level image of the original document image to obtain a three-channel image.
Two three-channel image examples are obtained as shown in fig. 6.
And step S40, the three-channel image is segmented by the method of step S10 to obtain a third image block set, and a binary image of the image blocks is obtained through a second convolutional neural network and is used as a fourth image block set.
And step S50, splicing the image blocks in the fourth image block set to obtain a final binary image of the original document image.
In this embodiment, each image block in the fourth image block set obtained in step S40 is combined and spliced with the information segmented in step S10, and is restored to a binarization result image corresponding to the original document image, and the image is used as a final binarization image of the original document image.
The image binarization process of the present invention can also be shown in fig. 3, an input image (original document image) is subjected to image segmentation to obtain a segmented image block set, scale scaling is performed to obtain a normalized original image, a grayscale image of the original document image is obtained through grayscale processing, the segmented image block set is subjected to binary image block merging obtained through G1 to obtain an image (1), the normalized original image is subjected to binarization through G1 to obtain an image (2), the image (1), the image (2) and the grayscale image of the original document image are merged and then subjected to image segmentation again, and then a plurality of binarized images are obtained through G2 and finally the binarized image is obtained after merging.
FIG. 7 is an example of the final binary image of the original document image obtained by one embodiment of the present invention, which includes five result examples (a), (b), (c), (d), and (e), which correspond to the respective graphs in FIG. 5.
The document image binarization system based on the generation countermeasure network comprises a segmentation module, a first convolution neural network processing module, a three-channel image acquisition module, a second convolution neural network processing module and a final binarization image acquisition module.
The segmentation module is configured to acquire a plurality of image blocks with a preset first size from an input text image according to a set step length, and construct an image block set.
The first convolution neural network processing module is configured to acquire a first image block set from an original text image through the segmentation module, and acquire a binary image of each image block through a first convolution neural network to obtain a second image block set; normalizing the original text image to the first size, and acquiring a binary image of the original text image through the first convolution neural network to serve as the first binary image.
The three-channel image acquisition module is configured to splice all image blocks in the second image block set to obtain a second binary image; scaling the first binary image to the size of the original text image to be used as a third binary image; acquiring a gray scale image of the original text image; and combining the second binary image, the third binary image and the gray image of the original text image to obtain a three-channel image.
The second convolutional neural network processing module is configured to acquire a third image block set from the three-channel image through the segmentation module, and acquire a binary image of the image blocks through the second convolutional neural network as a fourth image block set.
And the final binary image acquisition module is configured to splice the image blocks in the fourth image block set to obtain a final binary image of the original text image.
And the first convolutional neural network and the second convolutional neural network are cascaded to form a generator for generating a countermeasure network, and parameter optimization is carried out through training.
It can be clearly understood by those skilled in the art that, for convenience and brevity of description, the specific working process and related description of the system described above may refer to the corresponding process in the foregoing method embodiments, and will not be described herein again.
It should be noted that, the document image binarization system based on generation countermeasure network provided in the foregoing embodiment is only exemplified by the division of the above functional modules, and in practical applications, the above functions may be allocated to different functional modules according to needs, that is, the modules or steps in the embodiment of the present invention are further decomposed or combined, for example, the modules in the foregoing embodiment may be combined into one module, or may be further split into multiple sub-modules, so as to complete all or part of the above described functions. The names of the modules and steps involved in the embodiments of the present invention are only for distinguishing the modules or steps, and are not to be construed as unduly limiting the present invention.
The storage device of an embodiment of the present invention stores a plurality of programs, and the programs are suitable for being loaded and executed by a processor to realize the document image binarization method based on the generation countermeasure network.
The processing device of one embodiment of the invention comprises a processor and a storage device; a processor adapted to execute various programs; a storage device adapted to store a plurality of programs; the program is suitable for being loaded and executed by a processor to realize the document image binarization method based on the generation countermeasure network.
It can be clearly understood by those skilled in the art that, for convenience and brevity of description, the specific working processes and related descriptions of the storage device and the processing device described above may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.
Those of skill in the art would appreciate that the various illustrative modules, method steps, and modules described in connection with the embodiments disclosed herein may be implemented as electronic hardware, computer software, or combinations of both, and that programs corresponding to the software modules, method steps may be located in Random Access Memory (RAM), memory, Read Only Memory (ROM), electrically programmable ROM, electrically erasable programmable ROM, registers, hard disk, a removable disk, a CD-ROM, or any other form of storage medium known in the art. To clearly illustrate this interchangeability of electronic hardware and software, various illustrative components and steps have been described above generally in terms of their functionality. Whether such functionality is implemented as electronic hardware or software depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention.
The terms "first," "second," and the like are used for distinguishing between similar elements and not necessarily for describing or implying a particular order or sequence.
The terms "comprises," "comprising," or any other similar term are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus.
So far, the technical solutions of the present invention have been described in connection with the preferred embodiments shown in the drawings, but it is easily understood by those skilled in the art that the scope of the present invention is obviously not limited to these specific embodiments. Equivalent changes or substitutions of related technical features can be made by those skilled in the art without departing from the principle of the invention, and the technical scheme after the changes or substitutions can fall into the protection scope of the invention.