Disclosure of Invention
The embodiment of the invention provides a text recognition method and equipment based on a custom universal template, which are used for solving the problem of poor image recognition performance in the prior art.
The text recognition method based on the custom universal template comprises the following steps:
preparing a template picture, and acquiring contents and coordinates of a plurality of reference areas and a plurality of identification areas in the template picture;
acquiring the content and coordinates of all content areas in a target picture, wherein the type of the target picture is the same as that of the template picture;
correcting the target picture based on the reference region;
and determining a region to be identified of the corrected target picture based on the corrected target picture and in combination with the coordinates of the identification region so as to identify the content of the region to be identified.
According to some embodiments of the present invention, the obtaining contents and coordinates of a plurality of reference regions and a plurality of identification regions in the template picture includes:
performing frame selection on a plurality of reference areas in the template picture, and acquiring the content and coordinates of each reference area;
and selecting a plurality of identification areas in the template picture, and acquiring the content and the coordinates of each identification area.
According to some embodiments of the present invention, the framing the plurality of reference regions in the template picture includes:
and performing frame selection on at least four reference areas in the template picture, wherein the contents of the four reference areas are different, and at least one of the four reference areas is located in the edge area of the template picture.
According to some embodiments of the present invention, the obtaining the content and the coordinates of all the content areas in the target picture includes:
and detecting and identifying the content area in the target picture by adopting a DB algorithm and a CRNN-CTC algorithm.
According to some embodiments of the present invention, the correcting the target picture based on the reference region comprises:
and carrying out fuzzy matching on the content of the content area and the content of the reference area to screen out a first-class content area which can be matched with the content of the reference area, and correcting the target picture by adopting a perspective transformation method based on the first-class content area and the reference area matched with the first-class content area.
According to some embodiments of the present invention, the correcting the target picture by using a perspective transformation method based on the first type content area and the reference area matched with the first type content area includes:
acquiring a first polygon frame which is constructed by the first type of content and has the largest area, and determining a second polygon frame corresponding to the template picture according to the first polygon frame;
and performing perspective transformation on the second polygon frame and the first polygon frame to obtain a perspective matrix, and correcting the target picture based on the perspective matrix.
According to some embodiments of the present invention, the determining, according to the first polygon frame, a second polygon frame corresponding to the template picture includes:
obtaining coordinates of each vertex in the first polygon frame, and determining a first type content area corresponding to each vertex according to the coordinates of each vertex;
acquiring the content of a first-class reference area matched with the content of the first-class content area corresponding to the vertex based on the matching relation between the content of the content area and the content of the reference area;
and determining the coordinates of the first type of reference region based on the content of the first type of reference region, and acquiring a second polygon frame which is constructed by the first type of reference region and has the largest area.
According to some embodiments of the present invention, the determining, based on the corrected target picture and in combination with the coordinates of the identification area, an area to be identified of the corrected target picture includes:
identifying content areas in the corrected target picture, and acquiring coordinates of each content area in the corrected target picture;
calculating the intersection ratio between the coordinates of the content area of each corrected target picture and the coordinates of all the identification areas, and taking the identification area corresponding to the maximum intersection ratio as the identification area matched with the content area of the corrected target picture;
and determining the area to be identified by combining the terminal coordinates of the content area of the corrected target picture and the start end coordinates of the identification area matched with the content area of the corrected target picture.
According to some embodiments of the invention, the method further comprises:
before correcting the target picture, rotating the target picture to obtain a correct target picture.
The text recognition method and equipment based on the custom universal template comprise the following steps: a memory, a processor and a computer program stored on the memory and executable on the processor, the computer program when executed by the processor implementing the steps of the method for text recognition based on customized universal templates as described in any of the above.
By adopting the embodiment of the invention, a great deal of formatted data can be effectively processed, the universality is good, the principle is simple, the operation is convenient, the required content can be accurately identified, and the identification efficiency is improved.
The foregoing description is only an overview of the technical solutions of the present invention, and the embodiments of the present invention are described below in order to make the technical means of the present invention more clearly understood and to make the above and other objects, features, and advantages of the present invention more clearly understandable.
Detailed Description
Exemplary embodiments of the present invention will be described in more detail below with reference to the accompanying drawings. While exemplary embodiments of the invention are shown in the drawings, it should be understood that the invention can be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the invention to those skilled in the art.
An embodiment of a first aspect of the present invention provides a text recognition method based on a custom generic template, as shown in fig. 1, including:
s1, preparing a template picture, and acquiring the contents and coordinates of a plurality of reference areas and a plurality of identification areas in the template picture; the identification region may correspond to a region to be identified in the target picture, that is, a region in which the template picture is different from the target picture, and the reference region may be a region in which the template picture is identical to the target picture. For example, in the template picture of the identification card, the area where the words of "name" and "identification card" are located can be used as the reference area, and the area where the content following the "name" and the content following the "identification card" are located can be used as the identification area.
S2, acquiring the content and coordinates of all content areas in a target picture, wherein the type of the target picture is the same as that of the template picture; it can be understood that if the target picture to be identified is the identity card, the template picture of the identity card is obtained; if the target picture to be identified is the invoice, acquiring a template picture of the invoice; if the target picture to be identified is the value-added tax, acquiring a template picture of the value-added tax; if the target picture to be identified is a contract, acquiring a template picture of the contract;
s3, correcting the target picture based on the reference region; for example, the size of the target picture may be adjusted with reference to the reference region.
And S4, determining the area to be identified of the corrected target picture based on the corrected target picture and combining the coordinates of the identification area so as to identify the content of the area to be identified.
By adopting the embodiment of the invention, the template picture corresponding to the type of the target picture to be identified is constructed, the target picture is corrected based on the reference region of the template picture, and then the identification is carried out based on the corrected target picture, so that the method can effectively process a great deal of formatted data, solve the interference of factors such as illumination, photographing angle shielding, information loss and the like, has good universality, can accurately identify the required content and improve the identification efficiency.
On the basis of the above-described embodiment, various modified embodiments are further proposed, and it is to be noted herein that, in order to make the description brief, only the differences from the above-described embodiment are described in the various modified embodiments.
According to some embodiments of the invention, a plurality of types of template pictures including an identity card template picture, an invoice template picture, a contract template picture, a form template picture and a value added tax template picture can be prepared in advance, and then after the target picture is obtained, the template picture of the corresponding type is obtained according to the type of the target picture.
According to some embodiments of the present invention, the obtaining contents and coordinates of a plurality of reference regions and a plurality of identification regions in the template picture includes:
performing frame selection on a plurality of reference areas in the template picture, and acquiring the content and coordinates of each reference area;
and selecting a plurality of identification areas in the template picture, and acquiring the content and the coordinates of each identification area. For example, the above-described framing process may be performed manually by means of a marking tool.
In the embodiment of the present invention, a quadrilateral frame may be used for frame selection of all the regions.
According to some embodiments of the present invention, the framing the plurality of reference regions in the template picture includes:
and performing frame selection on at least four reference areas in the template picture, wherein the contents of the four reference areas are different, and at least one of the four reference areas is located in the edge area of the template picture.
Therefore, when the target picture is corrected based on the reference region, the correction effectiveness can be improved.
According to some embodiments of the present invention, the obtaining the content and the coordinates of all the content areas in the target picture includes:
and detecting and identifying the content area in the target picture by adopting a DB algorithm and a CRNN-CTC algorithm.
According to some embodiments of the present invention, the correcting the target picture based on the reference region comprises:
and carrying out fuzzy matching on the content of the content area and the content of the reference area to screen out a first-class content area which can be matched with the content of the reference area, and correcting the target picture by adopting a perspective transformation method based on the first-class content area and the reference area matched with the first-class content area.
According to some embodiments of the present invention, the correcting the target picture by using a perspective transformation method based on the first type content area and the reference area matched with the first type content area includes:
acquiring a first polygon frame which is constructed by the first type of content and has the largest area, as shown in fig. 5, and determining a second polygon frame corresponding to the template picture according to the first polygon frame;
and performing perspective transformation on the second polygon frame and the first polygon frame to obtain a perspective matrix, and correcting the target picture based on the perspective matrix, as shown in fig. 6.
According to some embodiments of the present invention, the determining, according to the first polygon frame, a second polygon frame corresponding to the template picture includes:
obtaining coordinates of each vertex in the first polygon frame, and determining a first type content area corresponding to each vertex according to the coordinates of each vertex;
acquiring the content of a first-class reference area matched with the content of the first-class content area corresponding to the vertex based on the matching relation between the content of the content area and the content of the reference area;
and determining the coordinates of the first type of reference region based on the content of the first type of reference region, and acquiring a second polygon frame which is constructed by the first type of reference region and has the largest area.
In some embodiments of the present invention, each of the first polygon frame and the second polygon frame may be a quadrangular frame.
According to some embodiments of the present invention, determining the region to be identified of the corrected target picture in combination with the coordinates of the identification region based on the corrected target picture includes:
identifying content areas in the corrected target picture, and acquiring coordinates of each content area in the corrected target picture;
calculating the intersection ratio between the coordinates of the content area of each corrected target picture and the coordinates of all the identification areas, and taking the identification area corresponding to the maximum intersection ratio as the identification area matched with the content area of the corrected target picture;
and determining the area to be identified by combining the terminal coordinates of the content area of the corrected target picture and the start end coordinates of the identification area matched with the content area of the corrected target picture, as shown in fig. 7. The terminal coordinates here may be understood as the end position in the text sequence, and may be, for example, coordinates of the right end frame of the content area frame. The start point coordinate here may be understood as a start point position in the character sequence, and may be, for example, a coordinate of a left end frame of the recognition area frame.
According to some embodiments of the invention, the method further comprises:
before correcting the target picture, rotating the target picture to obtain a correct target picture.
In the invention, the involved recognition processes can be all completed by adopting an OCR recognition technology. The coordinates refer to the coordinates of the border of the area.
The following describes in detail a text recognition method based on a custom generic template according to an embodiment of the present invention in a specific embodiment with reference to fig. 2 to 7. It is to be understood that the following description is illustrative only and is not intended to be in any way limiting. All similar structures and similar variations thereof adopted by the invention are intended to fall within the scope of the invention.
Because the text obtained by scanning or photographing in the real scene is different from the ideal text, if the mobile phone photographing is influenced by environmental factors such as photographing angle, illumination, camera pixels, camera shake and the like, and meanwhile, the text has element interference of a complex background, such as inconsistent size, font and color, missing or shielding of part of information, folding of the text, low resolution and the like.
In order to effectively process a large amount of formatted data and solve the interference of factors such as illumination, photographing angle shielding and information loss, the embodiment of the invention provides a text recognition method based on a user-defined universal template.
Specifically, as shown in fig. 2, the text recognition method based on the customized universal template according to the embodiment of the present invention includes:
step 1, selecting a plurality of different reference fields (or called reference areas) at least four on a template picture, wherein the reference fields are unique in the template, and storing frame coordinates and content information of the reference fields through marking software.
The template picture types are diversified (such as identity cards, invoices, contracts, forms, value-added taxes and the like), each type of target picture selects the corresponding type of template picture, and the template picture is complete, correct and clear; manually marking the reference field by using a marking tool, wherein the selected reference field must be unique to ensure the uniqueness of the reference field; in order to ensure that the subsequent perspective transformation is corrected correctly, at least 4 reference fields are selected, and the reference fields are distributed as close to the periphery of the text as possible.
And 2, framing the identification area on the template picture, and storing the frame coordinate and the content information of the identification area by the labeling software.
And 3, judging whether the target picture needs to be rotated or not, and if so, correcting the target picture through a rotation matrix.
And 4, performing text detection by using a DB (differential binary) algorithm, performing text recognition by using a CRNN-CTC (relational reliable Neural Network-connected Temporal Classification) algorithm to obtain coordinate information of all text contents and corresponding region frames in the target picture, performing fuzzy matching on the recognized contents and the contents of the reference fields in the step 1, finding out the coordinates of all the text contents matched with the reference fields, finding out the maximum quadrilateral coordinate formed by all the text contents matched with the reference fields in the target picture by using the maximum quadrilateral area, further finding out the corresponding maximum quadrilateral coordinate according to the reference fields with the maximum area formed by the target picture, and correcting the target picture by perspective quadrilateral transformation. As shown in fig. 3, the specific process is as follows:
4-1, detecting and identifying the target picture by using DB and CRNN-CTC algorithms to obtain identification content and coordinate information;
4-2, carrying out fuzzy matching on the identification content obtained in the step 4-1 and the content of the reference field in the step 1, wherein the higher the matching degree is, the closer the identification content is to the reference field, and obtaining the coordinate of the identification content matched with the reference field in the step 1 in the target picture;
4-3, finding the maximum quadrangle on the target picture covering the identification content through the maximum quadrangle area, and acquiring four-point coordinates of the maximum quadrangle; because the target picture is shielded or incomplete, the coordinates of four points on the edge are difficult to find according to perspective transformation, and a new quadrangle is formed by utilizing the reference field to obtain the coordinates of the four points. Because the projection transformation is inaccurate due to the fact that the quadrangle is selected too small, the selected reference fields are distributed as widely as possible, and the largest quadrangle is obtained finally. The maximum quadrangle here may be understood as being able to cover the entire identification content, as shown in fig. 5.
4-4, finding corresponding identification content by the template picture according to the four-point coordinates of the maximum quadrangle obtained by the target picture in the step 4-3, so as to find the reference field content corresponding to the identification content, further finding the coordinates of the reference field corresponding to the reference field content, so as to obtain the maximum quadrangle on the template picture, further obtaining the coordinates of four vertexes of the maximum quadrangle;
and 4-5, performing perspective transformation on the maximum quadrangle obtained by the step 4-3 and the 4-4 quadrangle to obtain a perspective matrix, and correcting the target picture. Specifically, 4 vertex coordinates R (x) obtained in the step 4-30,y0),(x1,y1),(x2,y2),(x3,y3) And 4 vertex coordinates S (X) obtained in the step 4-40,Y0),(X1,Y1),(X2,Y2),(X3,Y3) Obtaining a perspective transformation matrix A according to a formula 1 and a formula 2, and then combining the target picture with the perspective obtained by calculationThe view transformation matrix a is used to obtain a corrected picture using equation 3, as shown in figure 6,
in the formula 1, the first and second groups,
in the formula 2, the first and second groups,
And 5, after the target picture is corrected, detecting and identifying by using a DB and CRNN-CTC algorithm again to obtain all identification contents and corresponding coordinate information of the target picture, performing IOU calculation on the coordinates of the template identification area and all identification content coordinates of the target picture respectively, finding out the coordinate of the maximum IOU value of the corresponding template identification content of the target picture, forming a new frame according to two points on the left side of the template coordinate and two points on the right side of the corresponding coordinate of the target picture, and finally identifying the area of the new frame by using a character identification algorithm to obtain the identification contents. As shown in fig. 4, the specific steps are as follows:
5-1, detecting and identifying the corrected target picture again by using DB and CRNN-CTC algorithm to obtain identification content and coordinate information;
5-2, respectively calculating the coordinates of the identification area obtained in the step 2 and the coordinates obtained in the step 5-1 by Intersection Over Unit (IOU), as shown in formula 4,
in the case of the formula 4,
where a denotes the coordinates of the content area frame identified in step 5-1 and B denotes the coordinates of the area frame identified in step 2.
5-3, finding the identification area corresponding to the maximum IOU value;
5-4, forming new coordinates by the two coordinate points on the left side of the identification area frame and the two coordinates on the right side of the identified content area frame of the corresponding target picture;
5-5, identifying the new frame in the target picture by utilizing a CRNN-CTC algorithm to acquire the content of an identification area, as shown in FIG. 7;
5-6, repeating the steps from 5-3 to 5-5 until all the areas to be identified are identified.
And performing OCR recognition after perspective transformation correction, wherein the coordinate content in the corresponding target picture cannot be accurately found according to the coordinates of the identification area in the template picture because the content area frame detected in the target picture and the frame of the identification area in the template picture have deviation in the recognition process, or the content area frame detected in the target picture and the frame coordinates of the identification area in the corresponding template picture have deviation due to excessive recognition or missing recognition in the recognition process. According to the embodiment of the invention, the area to be identified of the target picture is searched by maximizing the IOU value, then a new frame is formed according to two points on the left side of the template coordinate and two points on the right side of the corresponding coordinate of the target picture, and finally the area of the new frame is identified by using a character identification algorithm to obtain the identification content, so that the problem of frame deviation is solved, and the identification rate is greatly improved.
The user-defined universal template text recognition method provided by the embodiment of the invention can recognize various bills (identity cards, invoices, value-added taxes, contracts and the like), and simultaneously converts the target picture into the template state, so that the problems of rotation, perspective, shielding and the like are solved, the recognition area can accurately appear at the position of a corresponding template recognition frame, the final bill recognition rate is improved, and the method has the following beneficial effects:
1. in some scenes, such as financial reimbursement, only part of information needs to be counted for invoices, and the identification area is set according to the self requirement without full identification, so that the identification efficiency is improved;
2. the method utilizes the reference field to find the maximum quadrangle for projection transformation, and solves the problem that four edges cannot be obtained under the condition that a target picture is shielded or partial information is lost;
3. because the resolution of the target picture is poor, the situation of multiple recognition or missing recognition can occur in the recognition process, and the invention uses the IOU to find a new frame to solve the problems of multiple recognition and missing recognition;
4. the method can try any text by manufacturing the custom template, and achieves the universality of text recognition.
It should be noted that the above-mentioned embodiments are only preferred embodiments of the present invention, and are not intended to limit the present invention, and those skilled in the art can make various modifications and changes. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.
The embodiment of the second aspect of the invention provides a text recognition method and device based on a custom universal template, which comprises the following steps: a memory, a processor and a computer program stored on the memory and executable on the processor, the computer program, when executed by the processor, implementing the steps of the method for text recognition based on customized generic templates as described in the embodiments of the first aspect above.
It should be noted that in the description of the present specification, reference to the description of the terms "one embodiment", "some embodiments", "illustrative embodiments", "examples", "specific examples", or "some examples", etc., means that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the present invention. In this specification, the schematic representations of the terms used above do not necessarily refer to the same embodiment or example. Although some embodiments described herein include some features included in other embodiments instead of others, combinations of features of different embodiments are meant to be within the scope of the invention and form different embodiments. The particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples. For example, in the claims, any of the claimed embodiments may be used in any combination.
The terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.
Any reference signs placed between parentheses shall not be construed as limiting the claim. The word "a" or "an" preceding an element does not exclude the presence of a plurality of such elements. The usage of the words first, second and third, etcetera do not indicate any ordering. These words may be interpreted as names.
In the description provided herein, numerous specific details are set forth. It is understood, however, that embodiments of the invention may be practiced without these specific details. In some instances, well-known methods, structures and techniques have not been shown in detail in order not to obscure an understanding of this description.