CN108133212B

CN108133212B - A deep learning-based fixed invoice amount recognition system

Info

Publication number: CN108133212B
Application number: CN201810011763.4A
Authority: CN
Inventors: 李顿伟; 王直杰
Original assignee: Donghua University
Current assignee: Donghua University
Priority date: 2018-01-05
Filing date: 2018-01-05
Publication date: 2021-06-29
Anticipated expiration: 2038-01-05
Also published as: CN108133212A

Abstract

The invention relates to a quota invoice amount identification system based on deep learning, which comprises an image acquisition module, an image rotation module, an image identification module and a result storage module, wherein the image acquisition module is used for acquiring an image file; the image rotation module is used for correcting the picture file; the image recognition module obtains the specific position of the image file to be recognized by using a deep learning model and performs image recognition; and the result storage module is used for storing the final recognition result. The invention can improve the OCR recognition rate when the image is polluted.

Description

Quota invoice amount recognition system based on deep learning

Technical Field

The invention relates to the technical field of image recognition, in particular to a quota invoice amount recognition system based on deep learning.

Background

The concept of OCR (Optical Character Recognition) was proposed earlier than in the 1920 s and has been an important research direction in the field of pattern Recognition.

In recent years, with the rapid update and iteration of mobile devices and the rapid development of mobile internet, OCR has a wider application scene, from the character recognition of the original scanned files to the recognition of picture characters in natural scenes, such as the characters in identification cards, bank cards, house numbers, bills and various network pictures.

Conventional OCR techniques are as follows:

firstly, text positioning is carried out, then inclined text correction is carried out, then single characters are segmented, the single characters are identified, and finally semantic error correction is carried out based on a statistical model (such as hidden Markov chain (HMM)). The treatment mode can be divided into three stages: a preprocessing stage, an identification stage and a post-processing stage. The key is the preprocessing stage, the quality of which directly determines the final recognition result, and therefore the following preprocessing stage is described in detail herein.

The pretreatment stage comprises three steps:

(1) the method is characterized in that a character area in a picture is positioned, character detection is mainly based on a connected domain analysis method, the main idea is to rapidly separate the character area from a non-character area by clustering character color, brightness and edge information, and two popular algorithms are as follows: the method comprises the following steps that a maximum extremum stable region (MSER) algorithm and a Stroke Width Transformation (SWT) algorithm are adopted, and in a natural scene, due to the interference of illumination intensity, picture shooting quality and character-like background, detection results contain a great number of non-character regions, at present, two main methods for distinguishing true character regions from candidate regions are adopted, and regular judgment or a lightweight neural network model is adopted for distinguishing;

(2) correcting the text region image, which is mainly based on rotation transformation and affine transformation;

(3) the single character is extracted by the line and row division, the line and row division point is found out by binarization and projection by utilizing the characteristic that the characters have gaps between the lines and the row, when the distinguishing degree of the characters and the background is good, the effect is good, the influence of illumination and image pickup quality in the shot picture is caused, and when the character background is difficult to distinguish, the wrong division condition is often caused.

Therefore, the conventional OCR recognition framework has more steps, so that error accumulation is easily caused to influence the final recognition result.

Disclosure of Invention

The invention aims to solve the technical problem of providing a quota invoice amount recognition system based on deep learning, which can improve the OCR recognition rate when an image is polluted.

The technical scheme adopted by the invention for solving the technical problems is as follows: the system comprises an image acquisition module, an image rotation module, an image identification module and a result storage module, wherein the image acquisition module is used for acquiring an image file; the image rotation module is used for correcting the picture file; the image recognition module obtains the specific position of the image file to be recognized by using a deep learning model and performs image recognition; and the result storage module is used for storing the final recognition result.

The image rotation module corrects the picture file in a mode of combining tesseract adjusting direction and opencv rotation adjusting angle.

The image rotation module extracts straight lines through Hough transformation, and respectively calculates the distances from the original points corresponding to a plurality of angles to the straight lines from the pixels at the top ends of the straight lines; and traversing pixel points of the whole image, finding out the distance which is repeated most, obtaining a linear equation corresponding to the straight line, and finally obtaining the rotation angle.

The image rotation module obtains the rotation angle of the image characters by using tesseract.

The image recognition module comprises a sample processing unit, an image training unit and a test unit; the sample processing unit is used for sorting the collected sample pictures and marking the picture types to obtain an xml file corresponding to the pictures, wherein the xml file comprises the type information and the position information of the pictures; the image training unit adopts 24 convolutional layers and 2 full-link layers, wherein the convolutional layers are used for extracting features, the full-link layers are used for predicting results, and the output of the last layer is k dimensions, wherein k is S (B5 + C), k comprises category prediction and bbox coordinate prediction, S is the number of divided grids, B is the number of targets in charge of each grid, and C is the number of categories; the testing unit multiplies the predicted type information of each grid by the predicted authentication information of each bounding box to obtain the optimal score of each bounding box, sets a threshold value to filter out the result with low score, and carries out NMS processing on the retained result to obtain the final detection result.

Advantageous effects

Due to the adoption of the technical scheme, compared with the prior art, the invention has the following advantages and positive effects: compared with the traditional OCR recognition framework, the method has the advantages that steps are reduced, and the influence of error accumulation on the final recognition result is reduced. The invention realizes the combination of deep learning and OCR image recognition, can greatly improve the OCR recognition rate when the image is polluted, and is convenient to operate; the system can be applied to the accounting field, can improve the working efficiency of accountants, and can liberate the accountants from fussy work.

Drawings

FIG. 1 is a block diagram of the system of the present invention;

FIG. 2 is an internal structural view of the present invention;

fig. 3A-3B are graphs of recognition results after an embodiment of the present invention is employed.

Detailed Description

The invention will be further illustrated with reference to the following specific examples. It should be understood that these examples are for illustrative purposes only and are not intended to limit the scope of the present invention. Further, it should be understood that various changes or modifications of the present invention may be made by those skilled in the art after reading the teaching of the present invention, and such equivalents may fall within the scope of the present invention as defined in the appended claims.

The embodiment of the invention relates to a quota invoice amount recognition system based on deep learning, which comprises an image acquisition module, an image rotation module, an image recognition module and a result storage module, wherein the image acquisition module is used for acquiring an image file; the image rotation module is used for correcting the picture file; the image recognition module obtains the specific position of the image file to be recognized by using a deep learning model and performs image recognition; and the result storage module is used for storing the final recognition result.

As shown in fig. 2, the present embodiment identifies the amount of money on the invoice, the invoice code, and the invoice number, based on the quota invoice scanned by the customer. Since the picture uploaded by the client may be tilted or inverted, a rotation step is added to facilitate later recognition, and the rotated picture is passed ocr to obtain the above fields.

The image rotation aims at that some pictures uploaded by the user scanning are inclined or inverted. The image rotation module in the embodiment corrects the picture by combining the tesseract adjustment direction with the opencv rotation adjustment small angle.

Opencv rotation regulation

The embodiment mainly uses a method of opencv line extraction, and then obtains the inclination angle of the straight line, wherein the method of straight line extraction is Hough transformation.

At any point O (X, Y) in the rectangular coordinate system, any straight line passing through O satisfies Y ═ kX + b (divided by a straight line perpendicular to the X axis). Due to this special case, converting the coordinate system to a polar coordinate system suffices.

In the polar coordinate system, any straight line may be represented by ρ ═ xCos θ + ySin θ.

Assuming that there is a straight line in a 10 × 10 image, the distances from the corresponding origin to the straight line when the angles are 180 °, 135 °, 90 °, 45 °, and 0 ° are calculated respectively from the top pixel points of the straight line of the image. Repeating the previous steps after traversing pixel points of the whole image, finding the distance with the most repetition, obtaining the corresponding linear equation, and obtaining the angle.

When a plurality of straight lines are found in a picture, the angle with the highest angular frequency is taken as the rotation angle of the picture.

Tesseract rotation

Tesseract is an OCR engine developed by Ray Smithf in Hewlett packard laboratories between 1985 and 1995, once named president in the 1995UNLV accuracy test. But development was essentially stopped after 1996. In 2006, Google invited Smith to join, restarting the project. The license for the project is currently Apache 2.0. The project supports mainstream platforms such as Windows, Linux and Mac OS at present. But as an engine it only provides command line tools.

Tesserract can identify most text languages (including Chinese), can obtain the text content on the picture and the rotation angle (270 degrees, 180 degrees, 90 degrees and 0 degrees) of the picture characters, and because the identification precision is not high, the rotation angle of the image characters can be obtained by using only tesseract in the embodiment. tesseract only accepts the grey-scale image, so the input color image needs to be converted into a grey-scale image.

In the embodiment, the image recognition module recognizes by using a deep learning method, and here, a deep learning target detection method yolo (youonly lookup) is used.

The idea of YOLO: the position of the bounding box and the category to which the bounding box belongs are directly returned in the output layer (the whole graph is used as the input of the network, and the Object Detection problem is converted into a Regression problem).

1. Sample treatment:

and (3) arranging the collected sample pictures, and marking the picture categories by using labelme software to obtain corresponding xml files, wherein the files contain the information and the positions of the categories in the pictures.

2. And (3) image training:

first, the picture is normalized to 448 x 448, and the picture is divided into 7 x 7 grids (cells), and the centers of the objects fall into the grids, so that the grids are responsible for predicting the objects.

CNN extraction features and predictions: convolution is responsible for extracting features; the full link part is responsible for the prediction. The final layer output is k dimensions. Wherein

k＝S*S(B*5+C) (1)

k contains the class prediction and the bbox coordinate prediction. S is the number of the divided grids, B is the number of targets in charge of each grid, and C is the number of categories. Where 5 includes predicted center point coordinates, width and height, and class prediction. The bbox coordinate prediction is expressed as:

wherein if a ground true box (manually marked object) falls in a grid cell, the first term is 1, otherwise 0 is taken. The second term is the IOU value between the predicted bounding box and the actual groudtuthbox.

The network structure is referred to GoogleLeNet. 24 convolutional layers, 2 full link layers. (inceptionmodules replacing Goolenet with 1X 1reduction layers followed by 3X 3 conditional layers)

The design goal of the loss function is to balance the coordinates (x, y, w, h), confidence, classification.

For different sizes of bbox predictions, a small box prediction bias is less tolerable than a large bbox prediction bias. And the same offset loss is the same in the total weighted loss. To alleviate this problem, the present embodiment replaces the original width and height with the square root of the box width and height.

A grid predicts a plurality of bounding boxes, and during training, we hope that only one bounding box is exclusively responsible for (one object and one bbox) in each class (grounttruebox). Specifically, the bounding box with the largest IOU of the group truebox (object) is responsible for the prediction of the group truebox (object). This practice is called specialization of building boxpredictor. Each predictor will predict better and better for a particular (sizes) ingredient or class of object.

3. A test module:

class information Pr (Class) predicted by each grid at the time of test_iObject) and bounding box predicted confidence information

Multiplying the result to obtain the best score of each bounding box. After the best score of each bbox is obtained, a threshold value is set, the boxes with low scores are filtered out, NMS treatment is carried out on the reserved boxes, and the final detection result is obtained. Fig. 3A-3B are graphs of recognition results after the present invention has been employed.

Compared with the traditional OCR recognition framework, the method reduces steps and reduces the influence on the final recognition result due to error accumulation. The invention realizes the combination of deep learning and OCR image recognition, can greatly improve the OCR recognition rate when the image is polluted, and is convenient to operate; the system can be applied to the accounting field, can improve the working efficiency of accountants, and can liberate the accountants from fussy work.

Claims

1. a fixed invoice amount recognition system based on deep learning, comprising image acquisition module, image rotation module, image recognition module and result storage module, it is characterized in that, described image acquisition module is used to obtain picture file; Described image rotation The module is used to correct the picture file, and the image rotation module uses a combination of tesseract adjustment direction and opencv rotation adjustment angle to correct the picture file; the image recognition module uses a deep learning model to obtain the picture to be recognized. The specific location of the file, and image recognition is performed; the result storage module is used to store the final recognition result.

2. The system for identifying the amount of a fixed invoice based on deep learning according to claim 1, wherein the image rotation module extracts a straight line through Hough transformation, and calculates the origin corresponding to the straight line from the pixel point at the top of the straight line. distance; traverse the above steps of the pixel points of the entire image, find the distance with the most repetitions, obtain the line equation corresponding to the line, and finally obtain the rotation angle.

3 . The system for recognizing the amount of a fixed invoice based on deep learning according to claim 1 , wherein the image rotation module uses tesseract to obtain the rotation angle of the image text. 4 .

4. The deep learning-based fixed invoice amount recognition system according to claim 1, wherein the image recognition module comprises a sample processing unit, an image training unit and a testing unit; the sample processing unit is used for sorting and collecting the The sample picture is marked, and the picture category is marked to obtain the xml file corresponding to the picture, and the xml file contains the category information and position information of the picture; the image training unit adopts 24 convolution layers and 2 full link layers, wherein , the convolution layer is used to extract features, the full link layer is used to predict the results, and the output of the last layer is k dimensions, where k=S*S(B*5+C), k includes category prediction and bbox coordinate prediction , S is the number of divided grids, B is the number of targets responsible for each grid, and C is the number of categories; the test unit uses the category information predicted by each grid and the authentication information predicted by the bounding box to multiply, then Obtain the best score of each bounding box, set a threshold to filter out the results with low scores, and perform NMS processing on the retained results to obtain the final detection result.