Detailed Description
Exemplary embodiments of the present invention will be described in more detail below with reference to the accompanying drawings. While exemplary embodiments of the invention are shown in the drawings, it should be understood that the invention can be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the invention to those skilled in the art.
Fig. 1 is a flowchart of steps of a method for processing a text image in a structured manner, where as shown in fig. 1, the method may include:
step 101, determining text boxes in the text image and item text boxes in all the text boxes.
In the embodiment of the present invention, the text image may be an image containing text content, such as a medical invoice image uploaded by a client, a scanned certificate image, and the like.
In practical application, text contents in the text image have corresponding structured formats, specifically, the text contents in the text image exist in respective corresponding text boxes, and positions of the text boxes are limited by the structured formats, for example, an item text box for representing an item name, an attribute value text box for representing an attribute value of the item name, and the like exist in the text image, positions of different types of text boxes are different, and a certain position constraint also exists between the text boxes, for example, a line-column relationship exists, and the attribute value text box corresponding to the item text box is to be located in an area laterally adjacent to the item text box, which are all required to be considered in a structured processing process of the text image.
Further, in addition to text boxes for a specific field (such as a medical item), text boxes related to other contents also exist in a text image, for example, text boxes corresponding to some other information also exist in a medical invoice, such as text boxes of a payee, a rechecker, a payee, and the like.
And 102, determining a header text box and an attribute name text box in all the text boxes.
In the text image structuring process of the embodiment of the present invention, the more important text boxes further include a header text box and an attribute name text box, referring to fig. 2, which shows a text image provided by the embodiment of the present invention, where a plurality of text boxes 10 are identified, the header text box 11 is used to represent item information at a header position, for example, the content of the header text box 11 may be an "item name", the plurality of item text boxes 12 identified in step 101 are used to represent specific item content, and the content of the items may include: the item text box 12 may be a subordinate text box of the header text box 11, and may be arranged in a region longitudinally adjacent to the header text box 11, as a list item [ guggu ] movement (lactulose oral solution) "," nonpareil bran (wheat cellulose particles) "," mucosal anti-infection treatment 4 (active silver ion antibacterial solution (silverton) ", and the like. The attribute name text box 13 is used for representing an attribute name, and the content of the attribute name text box is as follows: "quantity/unit", "amount (element)", and the like.
Further, the header text box and the attribute name text box may be obtained by matching preset keywords, for example, the header text box may be matched by the keyword "item name", and the attribute name text box may be matched by the keywords "number" and "amount".
Step 103, according to the orientation relationship among the header text box, the item text box and the attribute name text box, determining attribute value text boxes respectively corresponding to the item text box and the attribute name text box from all the text boxes, and determining multi-line printing item text boxes from all the item text boxes.
Specifically, referring to fig. 2, due to the current OCR recognition problem, there is a problem of multi-line printing of the item text box 12, that is, a complete item text box 12 is erroneously recognized as a plurality of multi-line printed item text boxes 121, such as a complete item text box 12: "the following is the list item [ guggotan ] action (lactulose oral solution)", erroneously identified as a plurality of lines of print items text box 121, including: "the following is the list item" or "the solution". This erroneous recognition greatly affects the accuracy of the structured output of the text image.
In the embodiment of the invention, the problem of printing the item text box in multiple lines in the item text box can be further solved on the basis of realizing the structured output of the text image by specifically using the orientation relation among the header text box, the item text box and the attribute name text box. The structured output of the text image is based on determining an attribute value text box corresponding to the item text box, where the attribute value text box is used to represent a specific attribute value of the item text box for a certain attribute name, and if the item text box can be a treatment fee, the content of the attribute value text box corresponding to the treatment fee item text box can be a money amount, such as 60, for the money amount attribute name.
Referring to fig. 2, it can be seen that there is a directional characteristic between the item text box 12 and the corresponding attribute value text box 14 arranged in a horizontal line; the item text box 12 and the header text box 11 have orientation characteristics arranged in a vertical column, the header text box 11 and the attribute name text box 13 have orientation characteristics arranged in a horizontal row, and the header text box 11 is at the header position of the end, the orientation characteristic arranged in a vertical column is present between the attribute name text box 13 and the attribute value text box 14, then, based on these orientation relationships, a horizontal straight line 21 may be first set starting from the item text box 12 (the slope of the straight line 21 is calculated from the slope of the text file in the text image), so that the text box 10 overlapping the lateral straight line 21 is determined as the attribute value text box 14 corresponding to the item text box 12, and traversing all the item text boxes 12 in the above manner, and sequentially determining the attribute value text box 14 corresponding to each item text box 12, so as to obtain the preliminary structured output of the text image.
In addition, since the orientation characteristic arranged in the vertical column manner exists between the attribute name text box 13 and the attribute value text box 14, a downward vertical straight line 22 can be made for the attribute name text box 13 (the slope of the straight line 22 is calculated from the slope of the text file in the text image), so that the text box framed by the straight line 22 can be used as the attribute value text box 14 under the attribute name text box 13.
Further, referring to fig. 2, it can be seen that the horizontal straight line 23, which is set starting from the multi-line print item text box 121, does not overlap any of the attribute value text boxes 14, and therefore the multi-line print item text boxes 121 in all of the item text boxes 12 can be determined based on this orientation relationship.
And 104, merging the multi-line printing project text box and the text boxes of the adjacent lines when establishing the structural relationship of the text image according to the corresponding relationship of the project text box, the attribute name text box and the attribute value text box.
In the embodiment of the invention, after OCR recognition is carried out on the text image and the structured corresponding relation of the project text box, the attribute name text box and the attribute value text box is determined, the multi-line printing project text box and the text boxes of the upper and lower adjacent lines can be combined, thereby solving the multi-line printing problem in structured output and improving the quality of structured output.
For example, referring to fig. 2, a plurality of lines of print item text boxes 121 that are incorrectly identified may be: the following items are list items and liquid, and the text boxes adjacent to the upper part and the lower part are moved (the lactulose orally taken solution is combined to obtain correct and complete text boxes of the items: "the following is the list item [ guo yi ] advantage (lactulose oral solution)",
in summary, according to the structured processing method for a text image provided by the embodiments of the present invention, for the orientation relationship among the header text box, the item text box, and the attribute name text box in the preliminary OCR recognition result of the text image, while determining and outputting the structured correspondence among the item text box, the attribute name text box, and the attribute value text box, the orientation relationship is further provided, and multiple lines of the item text boxes are printed and merged in all the item text boxes, so that the problem of multiple lines of the structured output of the text image is solved, the quality of the structured output of the text image is improved, and in addition, the whole process can be automatically implemented through a machine algorithm, the degree of manual participation is reduced, and the labor cost is reduced.
Fig. 3 is a flowchart of steps of another method for processing a text image in a structured manner, according to an embodiment of the present invention, as shown in fig. 3, the method may include:
step 201, determining a text box in the text image and text content contained in the text box.
In the embodiment of the invention, the text box in the text image can be detected through the text box detection model, and then a set box _ set of the text boxes is output, wherein each text box in the set comprises 8 data of [ x [ ]0,y0,x1,y1,x2,y2,x3,y3]And 4 vertex coordinates of the upper left, upper right, lower left and lower right of the text box are respectively represented.
Step 202, inputting the text content of the text box into a text classification model to obtain the item text content with the type of an item name, and determining the item text box corresponding to the item text content.
In this step, after the text box is identified, an item name text classification model may be further input to the area where the text box is located, the item name text classification model may first identify text content in the text box to obtain a text content set info _ set, and then perform semantic classification on the text content to obtain a set pro _ info _ set of item text content whose classification result is an item name, and finally may determine a text box corresponding to the item text content as an item text box to obtain an item text box set pro _ box _ set.
Step 203, matching the preset keywords with the text content of the text box, and determining the header text box and the attribute name text box.
In this step, the text content of the text box may be specifically matched with the preset header keyword to determine the header text box, and the text content of the text box may be matched with the preset attribute name keyword to determine the header text box.
For example, referring to fig. 2, the header text box 11 may be matched by the keyword "item name", and the attribute name text box 13 may be matched by the keywords "amount", "amount".
Step 204, according to the orientation relation among the header text box, the item text box and the attribute name text box, determining attribute value text boxes respectively corresponding to the item text box and the attribute name text box from all the text boxes, and determining multi-line printing item text boxes in all the item text boxes.
This step may specifically refer to step 103, which is not described herein again.
Optionally, in order to determine the attribute value text box corresponding to the attribute name text box, step 204 may specifically include:
substep 2041 determines a vertical slope from the horizontal slope of a first line formed from the header text box to the attribute name text box.
Optionally, the first straight line is: and a straight line formed from the center point of the header text box to the center point of the attribute name text box.
In the embodiment of the present invention, in order to determine the attribute value text box corresponding to the attribute name text box, referring to fig. 4, which shows a schematic diagram of a local region of a text image according to an embodiment of the present invention, a first straight line 31 may be first formed from a center point of the header text box to a center point of the attribute name text box, where the first straight line 31 reflects a horizontal slope k of a text file in the entire text image in the horizontal direction, and further, a longitudinal slope k', k ═ 1/k in the longitudinal direction of the text file may be obtained by using the horizontal slope k.
Specifically, each text box in the set box _ set according to text box contains 8 vertex coordinate data [ x0,y0,x1,y1,x2,y2,x3,y3]The center point of the header text box and the center point of the attribute name text box can be obtained, and the coordinates of the center point x and the y axis are specifically calculated as follows:
xcenter=(x0+x1+x2+x3)/4
ycenter=(y0+y1+y2+y3)/4
the calculation result includes coordinates (x) of the center point of the header text boxcc,ycc) And center point coordinates (x) of the attribute name textboxpc,ypc。
The first straight line 31 is calculated as follows:
(y-ypc)(xcc-xpc)-(x-xpc)(ycc-ypc)=0;
the horizontal slope k is calculated as follows:
the longitudinal slope k' is calculated as follows:
substep 2042, setting longitudinal second straight lines on both sides of the attribute name text box according to the longitudinal slope.
In this step, referring to fig. 4, since the orientation characteristic in which the attribute name text box 13 and the attribute value text box 14 are arranged in a vertical column exists between them, the boundary points on both sides of the attribute name text box 13 may be (x) respectively3,y3),(x2,y2) And two downward longitudinal second straight lines 32 are made for the attribute name text boxes 13 from the boundary point positions, respectively, and the slope of the second straight lines 32 is a longitudinal slope k'.
Substep 2043 determines the item box overlapped with the second straight line as the attribute value text box corresponding to the attribute name text box.
In this step, referring to fig. 4, a text box selected by the second straight line 32 may be regarded as the attribute value text box 14 under the attribute name text box 13, in accordance with the existence of a constraint relationship arranged in a vertical column between the attribute name text box 13 and the attribute value text box 14. The attribute value text box 14 is used to characterize the specific attribute value of the item text box 12 for a certain attribute name.
Specifically, for each text box, whether the second straight line overlaps with the text box may be determined as follows:
a. the equation for calculating the intersection of the two lines is:
wherein, h (x), f (x) represent equations of two straight lines, and the intersection point (x, y) can be obtained by solving a bivariate linear equation set.
b. Calculating a straight line equation of the bottom edge of the text box:
(y-y0)(x1-x0)-(x-x0)(y1-y0)=0;
c. respectively calculating the intersection point (x) of the bottom edge of the text box and the second straight line by the calculation result of a and the straight line equation of the bottom edge of the text boxtl,ytl),xtr,ytr);
d. Determining whether the text box is an attribute-value text box under an attribute-name text box according to the following determination conditions:
if x is satisfied3<xtr<x2If the text box is the attribute value text box under the attribute name text box, the text box is a text box with the attribute value;
if x is satisfied3<xlr<x2If the text box is the attribute value text box under the attribute name text box, the text box is a text box with the attribute value;
if it satisfies
The text box is an attribute value text box that is under the attribute name text box.
Optionally, in order to determine that the item text box is printed in multiple lines in the item text box, step 204 may further include:
substep 2044, according to the horizontal slope, constructing a horizontal third straight line with the project text box as a starting point.
In this step, referring to fig. 4, in order to determine a multi-line print item text box 121 in the item text box 12, a horizontal third straight line 33 may be constructed starting from the item text box 12 according to a horizontal slope. The starting point of the third straight line 33 may specifically be the center point of the item text box 12.
Sub-step 2045, in the case where the third straight line does not overlap with the attribute value text box, determines that the item text box to which the third straight line corresponds is the multi-line print item text box.
Referring to fig. 4, it can be seen that the third straight line 33 in the lateral direction, which is set starting from the multi-line print item text box 121, does not overlap any of the attribute value text boxes 14, and therefore the multi-line print item text box 121 in all of the item text boxes 12 can be determined from this orientation relationship.
If yes, for the correct and complete item text box: "the following is the list item [ guoko ] interest (lactulose oral solution)," by which a multi-line print item text box 121 can be identified: "the following is the list item" or "the solution".
Optionally, in order to determine the attribute value text box corresponding to the item text box, step 204 may further include:
substep 2046 determines the attribute-value text box overlapping the third straight line as the attribute-value text box corresponding to the item text box corresponding to the third straight line.
With further reference to fig. 4, since the orientation characteristic exists between the item text box 12 and the corresponding attribute value text box 14, which are arranged in a horizontal line, the third horizontal straight line 33 may be first set, with the item text box 12 as a starting point, so that the text box 10 overlapped with the third horizontal straight line 33 is determined as the attribute value text box 14 corresponding to the item text box 12, and by traversing all the item text boxes 12 in the above manner, and sequentially determining the attribute value text box 14 corresponding to each item text box 12, a preliminary structured output of the text image may be obtained. Finally, the structured output of sub-steps 2041 to 2045 is combined to obtain the complete structured output of the text image.
Step 205, merging the multi-line printed item text box and the text boxes of the adjacent lines when establishing the structural relationship of the text image according to the corresponding relationship of the item text box, the attribute name text box and the attribute value text box.
This step may specifically refer to step 104, which is not described herein again.
In summary, according to the structured processing method for a text image provided by the embodiments of the present invention, for the orientation relationship among the header text box, the item text box, and the attribute name text box in the preliminary OCR recognition result of the text image, while determining and outputting the structured correspondence among the item text box, the attribute name text box, and the attribute value text box, the orientation relationship is further provided, and multiple lines of the item text boxes are printed and merged in all the item text boxes, so that the problem of multiple lines of the structured output of the text image is solved, the quality of the structured output of the text image is improved, and in addition, the whole process can be automatically implemented through a machine algorithm, the degree of manual participation is reduced, and the labor cost is reduced.
Fig. 5 is a block diagram of an apparatus for structured processing of text images according to an embodiment of the present invention, and as shown in fig. 5, the apparatus may include:
a recognition module 301, configured to determine text boxes in the text image and item text boxes in all text boxes;
a first determining module 302, configured to determine a header text box and an attribute name text box of all text boxes;
a second determining module 303, configured to determine, according to the orientation relationship among the header text box, the item text box, and the attribute name text box, attribute value text boxes respectively corresponding to the item text box and the attribute name text box from all the text boxes, and determine multi-line print item text boxes from all the item text boxes;
a merging module 304, configured to merge the multiple lines of printed item text boxes with text boxes in adjacent lines when a structural relationship of the text image is established according to the correspondence of the item text box, the attribute name text box, and the attribute value text box.
Optionally, the identifying module 301 includes:
the first determining submodule is used for determining a text box in the text image and text content contained in the text box;
and the classification submodule is used for inputting the text content of the text box into a text classification model, obtaining the project text content with the type of the project name, and determining the project text box corresponding to the project text content.
Optionally, the first determining module 302 includes:
and the second determining submodule is used for determining the header text box and the attribute name text box by matching preset keywords with the text content of the text box.
Optionally, the second determining module 303 includes:
the third determining submodule is used for determining the longitudinal slope according to the horizontal slope of a first straight line formed by the header text box and the attribute name text box;
the fourth determining submodule is used for respectively setting longitudinal second straight lines on two sides of the attribute name text box according to the longitudinal slope;
a fifth determining sub-module, configured to determine an item box overlapped with the second straight line as an attribute value text box corresponding to the attribute name text box.
Optionally, the first determining module 302 includes:
a sixth determining submodule, configured to construct a horizontal third straight line with the project text box as a starting point according to the horizontal slope;
a seventh determining sub-module for determining an item text box corresponding to the third straight line as the multi-line print item text box, in a case where the third straight line does not overlap with the attribute value text box.
Optionally, the second determining module 303 includes:
an eighth determining sub-module, configured to determine the attribute-value text box that overlaps the third straight line as the attribute-value text box corresponding to the item text box corresponding to the third straight line.
Optionally, the first straight line is: and a straight line formed from the center point of the header text box to the center point of the attribute name text box.
In summary, the structured processing apparatus for text images according to the embodiments of the present invention may further provide the orientation relationship while determining and outputting the structured correspondence between the item text box, the attribute name text box, and the attribute value text box in the preliminary OCR recognition result of the text image, and determine and merge multiple lines of printed item text boxes in all the item text boxes, thereby solving the problem of multiple lines of printed item text boxes in the structured output of text images, and improving the quality of structured output of text images.
For the above device embodiment, since it is basically similar to the method embodiment, the description is relatively simple, and for the relevant points, refer to the partial description of the method embodiment.
Preferably, an embodiment of the present invention further provides a computer device, which includes a processor, a memory, and a computer program stored in the memory and capable of running on the processor, and when being executed by the processor, the computer program implements each process of the above-mentioned method for processing a text image in a structured manner, and can achieve the same technical effect, and in order to avoid repetition, details are not described here again.
The embodiment of the present invention further provides a computer-readable storage medium, where a computer program is stored on the computer-readable storage medium, and when the computer program is executed by a processor, the computer program implements each process of the above-mentioned text image structuring processing method embodiment, and can achieve the same technical effect, and in order to avoid repetition, the details are not repeated here. The computer-readable storage medium may be a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk.
The embodiments in the present specification are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other.
As is readily imaginable to the person skilled in the art: any combination of the above embodiments is possible, and thus any combination between the above embodiments is an embodiment of the present invention, but the present disclosure is not necessarily detailed herein for reasons of space.
The structured processing methods of text images provided herein are not inherently related to any particular computer, virtual system, or other apparatus. Various general purpose systems may also be used with the teachings herein. The structure required to construct a system incorporating aspects of the present invention will be apparent from the description above. Moreover, the present invention is not directed to any particular programming language. It is appreciated that a variety of programming languages may be used to implement the teachings of the present invention as described herein, and any descriptions of specific languages are provided above to disclose the best mode of the invention.
In the description provided herein, numerous specific details are set forth. It is understood, however, that embodiments of the invention may be practiced without these specific details. In some instances, well-known methods, structures and techniques have not been shown in detail in order not to obscure an understanding of this description.
Similarly, it should be appreciated that in the foregoing description of exemplary embodiments of the invention, various features of the invention are sometimes grouped together in a single embodiment, figure, or description thereof for the purpose of streamlining the invention and aiding in the understanding of one or more of the various inventive aspects. However, the disclosed method should not be interpreted as reflecting an intention that: that the invention as claimed requires more features than are expressly recited in each claim. Rather, as the following claims reflect, inventive aspects lie in less than all features of a single foregoing disclosed embodiment. Thus, the claims following the detailed description are hereby expressly incorporated into this detailed description, with each claim standing on its own as a separate embodiment of this invention.
Those skilled in the art will appreciate that the modules in the device in an embodiment may be adaptively changed and disposed in one or more devices different from the embodiment. The modules or units or components of the embodiments may be combined into one module or unit or component, and furthermore they may be divided into a plurality of sub-modules or sub-units or sub-components. All of the features disclosed in this specification (including any accompanying claims, abstract and drawings), and all of the processes or elements of any method or apparatus so disclosed, may be combined in any combination, except combinations where at least some of such features and/or processes or elements are mutually exclusive. Each feature disclosed in this specification (including any accompanying claims, abstract and drawings) may be replaced by alternative features serving the same, equivalent or similar purpose, unless expressly stated otherwise.
Furthermore, those skilled in the art will appreciate that while some embodiments described herein include some features included in other embodiments, rather than other features, combinations of features of different embodiments are meant to be within the scope of the invention and form different embodiments. For example, in the claims, any of the claimed embodiments may be used in any combination.
The various component embodiments of the invention may be implemented in hardware, or in software modules running on one or more processors, or in a combination thereof. It will be appreciated by those skilled in the art that a microprocessor or Digital Signal Processor (DSP) may be used in practice to implement some or all of the functions of some or all of the components of the structured processing method of text images according to embodiments of the present invention. The present invention may also be embodied as apparatus or device programs (e.g., computer programs and computer program products) for performing a portion or all of the methods described herein. Such programs implementing the present invention may be stored on computer-readable media or may be in the form of one or more signals. Such a signal may be downloaded from an internet website or provided on a carrier signal or in any other form.
It should be noted that the above-mentioned embodiments illustrate rather than limit the invention, and that those skilled in the art will be able to design alternative embodiments without departing from the scope of the appended claims. In the claims, any reference signs placed between parentheses shall not be construed as limiting the claim. The word "comprising" does not exclude the presence of elements or steps not listed in a claim. The word "a" or "an" preceding an element does not exclude the presence of a plurality of such elements. The invention may be implemented by means of hardware comprising several distinct elements, and by means of a suitably programmed computer. In the unit claims enumerating several means, several of these means may be embodied by one and the same item of hardware. The usage of the words first, second and third, etcetera do not indicate any ordering. These words may be interpreted as names.