CN113283432B

CN113283432B - Image recognition, text sorting method and device

Info

Publication number: CN113283432B
Application number: CN202010106180.7A
Authority: CN
Inventors: 郑琪; 于智; 李亮城; 高飞宇; 王永攀; 张建锋
Original assignee: Alibaba Group Holding Ltd
Current assignee: Alibaba Group Holding Ltd
Priority date: 2020-02-20
Filing date: 2020-02-20
Publication date: 2025-04-04
Anticipated expiration: 2040-02-20
Also published as: CN113283432A

Abstract

The embodiment of the application provides an image recognition and text ordering method and equipment. The method comprises the steps of identifying a plurality of pieces of text information to be ordered contained in an image to be identified, determining the reading sequence of the plurality of pieces of text information to be ordered according to the corresponding characteristics of the plurality of pieces of text information to be ordered, wherein the characteristics carry semantic characteristics, and ordering the plurality of pieces of text information to be ordered according to the reading sequence to obtain a text information sequence to be ordered. The sequencing method provided by the embodiment of the application is suitable for images in any text typesetting format, and has wide application range and good applicability.

Description

Image recognition and text ordering method and device

Technical Field

The present application relates to the field of computer technologies, and in particular, to a method and apparatus for image recognition and text sorting.

Background

With the development of computer technology, image and text recognition technology has been developed, and by using the technology, the device can automatically recognize the text in the image.

In the prior art, a plurality of characters identified from an image are read and ordered from left to right and from top to bottom by default. While this simple reading ordering is only suitable for those pictures whose layout is simple, it fails for those pictures whose layout is complex (e.g., column, ring layout), because this simple reading order breaks the original semantic consistency.

It can be seen that the reading ordering method in the prior art has poor applicability or universality.

Disclosure of Invention

In view of the foregoing, the present application has been made to provide an image recognition, text sorting method and apparatus that solves or at least partially solves the foregoing problems.

Thus, in one embodiment of the present application, an image recognition method is provided. The method comprises the following steps:

identifying a plurality of text information to be sequenced from the image to be identified;

Determining the reading sequence of the plurality of text information to be ordered according to the corresponding characteristics of the plurality of text information to be ordered, wherein the characteristics carry semantic characteristics;

and sorting the plurality of text messages to be sorted according to the reading sequence to obtain a text message sequence to be sorted.

In yet another embodiment of the present application, a text ordering method is provided. The method comprises the following steps:

Acquiring a plurality of text information to be sequenced;

Synthesizing the respective corresponding characteristics of the plurality of word information to be ordered and the adjacent relation among the plurality of word information to be ordered, and determining the reading sequence of the plurality of word information to be ordered;

And sequencing the plurality of first text messages according to the reading sequence to obtain a first text message sequence.

In one embodiment of the present application, an image recognition method is provided. The method comprises the following steps:

determining the character types of the plurality of character information to be ordered;

Acquiring an arrangement rule corresponding to the character type;

And sorting the plurality of word information to be sorted according to the arrangement rule to obtain a word information sequence to be sorted.

In another embodiment of the present application, an electronic device is provided. The device comprises a memory and a processor, wherein,

The memory is used for storing programs;

The processor, coupled to the memory, is configured to execute the program stored in the memory for:

The memory is used for storing programs;

Acquiring a plurality of text information to be sequenced;

The memory is used for storing programs;

Acquiring an arrangement rule corresponding to the character type;

According to the technical scheme provided by the embodiment of the application, after the plurality of text information to be ordered contained in the image to be identified is identified, the text information to be ordered is read and ordered by combining the semantics corresponding to the text information to be ordered. The sequencing method provided by the embodiment of the application is suitable for images in any text typesetting format, and has wide application range and good applicability.

According to the technical scheme provided by the embodiment of the application, when the plurality of text messages to be sequenced are read and sequenced, the characteristics of each text message are considered, and the adjacent relation among the plurality of text messages to be sequenced is considered, so that the accuracy of sequencing can be effectively improved, and the semantic relevance of the finally obtained text message sequence is improved.

Drawings

In order to more clearly illustrate the embodiments of the present application or the technical solutions of the prior art, the following description will briefly explain the drawings used in the embodiments or the description of the prior art, and it is obvious that the drawings in the following description are some embodiments of the present application, and other drawings can be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1a is a diagram illustrating an example of an image recognition method according to an embodiment of the present application;

FIG. 1b is a diagram illustrating an example of an image recognition method according to another embodiment of the present application;

FIG. 1c is a flowchart illustrating an image recognition method according to an embodiment of the present application;

FIG. 2 is a flowchart of a text sorting method according to an embodiment of the present application;

FIG. 3 is a block diagram illustrating an image recognition apparatus according to an embodiment of the present application;

FIG. 4 is a block diagram of a text sorting apparatus according to another embodiment of the present application;

Fig. 5 is a block diagram of an electronic device according to an embodiment of the present application.

Detailed Description

At present, the existing image and text recognition products default to provide a simple reading sequence from left to right and from top to bottom, and the simple ordering scheme can fail for typesetting complicated pictures.

In the prior art, two methods exist, namely a typesetting analysis method and a structured template method.

The typesetting analysis method specifically comprises two modes of bottom-up and top-down. The bottom-up mode utilizes the visual information of the text blocks, such as the characteristics of distance, size, color and the like, to synthesize paragraphs through rules, and after synthesizing the paragraphs, the reading sequence from left to right and from top to bottom is still adopted for text ordering in the paragraphs. The top-down method is to cut the pictures according to paragraphs by directly using an image segmentation method, and after the segmentation of the paragraphs is completed, the sequence reading inside the paragraphs is carried out according to the reading sequence from left to right and from top to bottom.

The typesetting analysis method can process most of document pictures, namely pictures composed of large sheets of regular characters. However, for the complex image-text mixed arrangement, for example, the network advertisement diagram and the e-commerce description diagram have poor effects.

The structured template method outputs a text structure according to the configured template rule, can process more complex typesetting conditions, but can only be suitable for more single typesetting formats, such as invoices, certificates, bank cards and the like, and can not generate semantically related sequences for general conditions.

In order to improve applicability or universality of a reading and sorting method, the embodiment of the application provides a method for reading and sorting characters based on semantics.

In order to enable those skilled in the art to better understand the present application, the following description will make clear and complete descriptions of the technical solutions according to the embodiments of the present application with reference to the accompanying drawings. It will be apparent that the described embodiments are only some, but not all, embodiments of the application. All other embodiments, which can be made by those skilled in the art based on the embodiments of the application without making any inventive effort, are intended to be within the scope of the application.

Furthermore, in some of the flows described in the specification, claims, and drawings above, a plurality of operations occurring in a particular order may be included, and the operations may be performed out of order or concurrently with respect to the order in which they occur. The sequence numbers of operations such as 101, 102, etc. are merely used to distinguish between the various operations, and the sequence numbers themselves do not represent any order of execution. In addition, the flows may include more or fewer operations, and the operations may be performed sequentially or in parallel. It should be noted that, the descriptions of "first" and "second" herein are used to distinguish different messages, devices, modules, etc., and do not represent a sequence, and are not limited to the "first" and the "second" being different types.

Fig. 1c is a schematic flow chart of an image processing method according to an embodiment of the application. The execution subject of the method can be a client or a server. The client may be hardware integrated on the terminal and provided with an embedded program, or may be an application software installed in the terminal, or may be a tool software embedded in an operating system of the terminal, which is not limited in the embodiment of the present application. The terminal can be any terminal equipment including a mobile phone, a tablet personal computer, an intelligent sound box and the like. The server may be a common server, a cloud end, a virtual server, or the like, which is not particularly limited in the embodiment of the present application.

As shown in fig. 1c, the method comprises:

101. And identifying a plurality of text information to be sequenced contained in the image to be identified.

102. And determining the reading sequence of the plurality of text information to be ordered according to the corresponding characteristics of the plurality of text information to be ordered.

103. And sorting the plurality of text messages to be sorted according to the reading sequence to obtain a text message sequence to be sorted.

In the above 101, the image to be identified refers to an image containing text information, such as a network advertisement diagram, an e-commerce description diagram, an invoice diagram, a certificate diagram, a bank card diagram, and the like.

The image to be recognized may be recognized using an image character recognition algorithm, such as OCR (Optical Character Recognition ), to obtain a plurality of text information to be ordered contained therein. The specific implementation of the image text recognition algorithm can be referred to in the prior art, and will not be described herein. In general, an image to be identified is identified by using an image text identification algorithm, so that not only a plurality of text information to be ordered contained in the image to be identified can be obtained, but also the positions of the text information to be ordered in the image to be identified can be obtained.

Wherein, the word information to be ordered refers to single words or words. For example, "I" are words and "we" are words. In practical applications, the text information to be ordered is usually referred to as a single word.

In 102, the features carry semantic features. A natural language processing algorithm can be adopted to extract semantic features corresponding to the identified text information to be sequenced. The process of extracting semantic features can be found in the prior art, and will not be described in detail herein. In an example, the semantic features corresponding to each text message to be ordered may be used as the features corresponding to each text message to be ordered.

In another example, visual features may also be carried in the features. By combining the two aspects of semantic features and visual features, the accuracy of sequencing can be effectively improved. For example, the text information identified in the image to be identified comprises 'I','s' and 'you', and if the text information is only semantically arranged in front of 'I', 'you', the text information and 'you' form 'we' and 'your', 'you'; if the fonts of I and II are found to be inconsistent in combination with the visual aspect, and the fonts of I and II are consistent, the previous position of I and II can be determined.

It should be noted that the features involved in the embodiments of the present application may be in vector form.

In an implementation scheme, the reading sequence of the plurality of text information to be ordered may be determined according to the semantic features and the preset syntax corresponding to the plurality of text information to be ordered.

In 103, semantic consistency of the text information sequence to be ordered is obtained by ordering the text information to be ordered according to the reading sequence.

In the image to be identified, a certain association relationship, such as a semantically, visually and/or positionally association relationship, is hidden among a plurality of text information to be ordered. These relationships are very important information, and if they can be utilized, the sorting accuracy can be effectively improved. Therefore, in an example, in the above 102, "determining the reading order of the plurality of text messages to be ordered according to the features corresponding to the plurality of text messages to be ordered" may be implemented by:

1021. And determining the adjacent relation among the plurality of text messages to be ordered.

1022. And integrating the respective corresponding characteristics of the plurality of word information to be ordered and the adjacent relation among the plurality of word information to be ordered to determine the reading sequence of the plurality of word information to be ordered.

In 1021, the adjacency relationship between the plurality of text messages to be ordered can indicate the association relationship between each text message to be ordered and other text messages to be ordered.

The adjacency relationship among the plurality of text messages to be ordered can be determined by one or more of the following methods:

The method comprises the steps of searching for to-be-sequenced text information in a set range which takes the position of the to-be-sequenced text information in the to-be-sequenced image as the center in the to-be-sequenced image aiming at each to-be-sequenced text information, determining that adjacencies exist between the to-be-sequenced text information and the to-be-sequenced text information in the set range, and determining that adjacencies do not exist between the to-be-sequenced text information and the to-be-sequenced text information outside the set range.

In a first method, the adjacency is specifically an adjacency of the relevant location.

And secondly, determining the adjacent relation among the plurality of text information to be ordered according to the characteristics corresponding to the plurality of text information to be ordered.

In the second method, the method can be realized by the following steps:

s11, calculating the correlation between every two word information to be ordered according to the corresponding characteristics of every two word information to be ordered in the word information to be ordered.

S12, determining whether the two text messages to be ordered are adjacent or not according to the correlation between the two text messages to be ordered.

In the above S11, in one possible scheme, the similarity between the features corresponding to each two pieces of text information to be ordered may be directly calculated, and the similarity between the features corresponding to each two pieces of text information to be ordered is used as the correlation between each two pieces of text information to be ordered. In practical application, the features can be in vector form, and the inner product between the features corresponding to each two text messages to be ordered can be used as the similarity.

In S12, a correlation threshold may be set in advance, and if the correlation between every two text messages to be ordered is greater than or equal to the correlation threshold, the two text messages to be ordered are determined to be adjacent to each other.

The features carry semantic features, so that the adjacency relationship is specifically an adjacency relationship related to semantics.

In another example, visual features may also be carried in the features, so adjacencies are specifically semantic and visual adjacencies.

When the features further carry visual features, in order to improve accuracy of correlation calculation, in the step S11, "calculate correlation between every two to-be-sorted text information according to features corresponding to every two to-be-sorted text information in the plurality of to-be-sorted text information", the method specifically may be implemented by adopting the following steps:

A. and calculating the first similarity between semantic features corresponding to each two text messages to be sequenced.

B. and calculating the second similarity between the visual features corresponding to each two text messages to be sequenced.

C. And combining the first similarity and the second similarity to determine the correlation between every two text messages to be sequenced.

In the step a, the semantic features are specifically in a vector form, and the inner product between the semantic features can be used as the first similarity.

In the step B, the visual features are specifically in the form of vectors, and the inner product between the visual features can be used as the second similarity.

In the above step C, in an implementation manner, a sum of the first similarity and the second similarity may be used as the correlation between the two text messages to be sorted.

In another implementation scheme, the first similarity and the second similarity can be weighted and summed to obtain the correlation of the text information to be ranked every two. The weight corresponding to the first similarity and the weight of the second similarity may be set according to actual needs, for example, may be determined by combining a priori experience, which is not particularly limited in the embodiment of the present application.

In 1021, the adjacency relationship between the plurality of text messages to be ordered indicates whether each two text messages to be ordered in the plurality of text messages to be ordered are adjacent.

In the foregoing 1022, the features corresponding to the text information to be ordered only include the information of the text information itself, and do not include the relevant information around the text information (e.g., the information of other text information to be ordered adjacent to the text information). That is, the expression of the characteristics is not good enough and not comprehensive. In order to improve the feature expression, the features corresponding to the text information to be sequenced and the adjacent relation among the text information to be sequenced can be synthesized, and the features corresponding to the text information to be sequenced are updated to obtain the updated features corresponding to the text information to be sequenced. And determining the reading sequence of the plurality of text information to be ordered according to the updated characteristics corresponding to the plurality of text information to be ordered. The updated characteristics corresponding to the text information to be sequenced not only contain the information of the updated characteristics, but also contain the information of other adjacent text information to be sequenced, the characteristics are more abstract and better in expression, and the sequencing accuracy is improved.

In an example, in the foregoing 1022, "the reading order of the plurality of text information to be ordered is determined by integrating the features corresponding to each of the plurality of text information to be ordered and the adjacency relationship between the plurality of text information to be ordered", specifically, the method may be implemented by the following steps:

S21, constructing a graph structure with nodes and edges according to the adjacent relation among the plurality of text information to be ordered.

The nodes in the graph structure are used for representing text information to be sequenced, and the edges in the graph structure are used for representing whether the nodes are adjacent or not.

S22, taking the characteristics corresponding to the plurality of text information to be ordered and the graph structure as the input of a trained graph convolution neural network model, and executing the graph convolution neural network model to obtain the reading sequence of the plurality of text information to be ordered.

In S21, the nodes in the graph structure are used to represent text information to be ordered, and the edges in the graph structure are used to represent whether the nodes are adjacent to each other. An edge exists between two nodes, indicating that the two nodes are contiguous. The graph structure may be represented by an adjacency matrix. The above-described graph structure may also be referred to as a topology.

In the above step S22, the graph convolution neural network model can well extract features, and can effectively improve the ordering accuracy.

The graph convolution neural network model is specifically used for:

S31, according to the characteristics corresponding to the plurality of text information to be ordered and the graph structure, updated characteristics corresponding to the plurality of text information to be ordered are obtained through graph convolution operation.

S32, determining the reading sequence of the plurality of text information to be ordered according to the updated characteristics corresponding to the plurality of text information to be ordered.

In S31, the graph structure information may be embedded into the features corresponding to the plurality of text information to be ordered through the graph rolling operation, so as to obtain updated features corresponding to the plurality of text information to be ordered. To obtain higher dimensional features, multiple graph convolution operations may be performed to obtain features.

Specifically, a feature extraction sub-network may be included in the convolutional neural network model, and may include a plurality of convolutional network layers, each of which performs a convolutional operation. The method comprises the steps of selecting a plurality of adjacent matrixes representing graph structures as input of a first one of a plurality of scroll network layers, selecting characteristics corresponding to each of the plurality of character information to be ordered, which are output by a previous one of the plurality of scroll network layers, as input of a subsequent one of the plurality of scroll network layers, and selecting characteristics corresponding to each of the plurality of character information to be ordered, which are output by a last one of the plurality of scroll network layers, as updated characteristics corresponding to each of the plurality of character information to be ordered. It should be noted that, the characteristics corresponding to the plurality of text information to be ordered output by each graph convolution network layer are different from the characteristics corresponding to the plurality of text information to be ordered input by the graph convolution network layer, and the characteristics are more abstract and have higher dimension.

It should be noted that each of the convolution network layers includes a trained feature extraction parameter matrix, and each of the convolution network layers is executed in combination with the feature extraction parameter matrix during the convolution operation. The specific implementation of the graph convolution operation may be designed according to actual needs, and this embodiment is not limited in particular.

In one implementation, the plurality of text information to be ordered includes a first text information to be ordered, where the first text information to be ordered refers to any one of the plurality of text information to be ordered. Each picture convolution network layer is specifically used for determining at least one second text message to be ordered adjacent to the first text message to be ordered according to a first text message to be ordered and an adjacent matrix, respectively splicing the characteristics corresponding to the first text message to be ordered and the characteristics corresponding to the at least one second text message to be ordered respectively to obtain at least one first splicing characteristic, combining the at least one first splicing characteristic into a splicing characteristic matrix, carrying out matrix multiplication on the splicing characteristic matrix and a trained characteristic extraction parameter matrix to obtain a first matrix, and carrying out pooling treatment on the first matrix to obtain the characteristics corresponding to the first text message to be ordered. The pooling treatment can be specifically average pooling or maximum pooling.

For example, the input character information to be ordered is characterized by an h-dimensional vector, the character information to be ordered is characterized by a j-dimensional vector, and the first splicing feature is a (h+j) -dimensional vector.

In an implementation scheme, in the step S32, "determining the reading order of the plurality of text information to be ordered according to the updated features corresponding to the plurality of text information to be ordered", the method specifically may be implemented by the following steps:

S321, integrating updated characteristics corresponding to the text information to be sequenced respectively, and calculating global text information characteristics serving as initial reference characteristics.

S322, calculating the attention weight corresponding to the at least one text message to be ordered which is not yet output in the plurality of text messages to be ordered according to the reference feature and the updated feature corresponding to the at least one text message to be ordered which is not yet output in the plurality of text messages to be ordered.

S323, outputting the text information to be ordered corresponding to the maximum attention weight, taking the updated feature corresponding to the text information to be ordered corresponding to the maximum attention weight as a new reference feature, and continuing to execute the attention weight calculation step until all the text information to be ordered are output.

S324, determining the output sequence of the plurality of text messages to be ordered as the reading sequence of the plurality of text messages to be ordered.

In practical applications, the graph roll-up neural network model may also include an attention sub-network. Steps S321 and S322 described above are performed by the attention sub-network.

In S321, in an example, the updated features corresponding to the text information to be sorted may be pooled to obtain global text information features. The pooling treatment can be, in particular, average pooling or maximum pooling. The global character information features are integrated with the features of a plurality of character information to be sequenced.

And determining the global character information features, so that the first character information sequenced in the first character information sequence can be conveniently found.

In another example, the graph rolling operation can be further utilized to further update the updated features corresponding to the plurality of to-be-processed ordering text information to obtain further updated features, and then the further updated features corresponding to the plurality of to-be-ordered text information are subjected to pooling processing to obtain global text information features. I.e. the attention sub-network comprises a graph roll-up network layer, which may in particular be a fully connected network layer, and a pooling layer.

In S322, at least one text message to be ordered, which is not yet outputted in the plurality of text messages to be ordered, includes a third text message to be ordered. The third to-be-ordered text information refers to any one of the at least one to-be-ordered text information. Taking the characteristic as a vector form as an example, the updated characteristic corresponding to the reference characteristic and the third text information to be sequenced can be spliced to obtain a second spliced characteristic, and the second spliced characteristic and the attention parameter vector are subjected to dot multiplication to obtain the attention weight corresponding to the third text information to be sequenced.

For example, the reference feature is an n-dimensional vector, the updated feature corresponding to the third text information to be ordered is an m-dimensional vector, and the second stitching feature is an (n+m) -dimensional vector.

In S323, the text information to be sorted corresponding to the maximum attention weight is output, and the updated feature corresponding to the text information to be sorted corresponding to the maximum attention weight is used as the new reference feature.

If the plurality of text information to be ordered is not all output, the attention weight of the text information to be ordered which is not output currently is calculated continuously based on the new reference characteristics.

And stopping the attention weight calculation step if the plurality of text information to be sequenced are all output.

In S324, the output sequence of the plurality of text messages to be ordered is the reading sequence of the plurality of text messages to be ordered.

In the above embodiment, the text information to be ordered corresponding to the maximum attention weight defaults to one, and when there are a plurality of text information to be ordered, a plurality of reading orders will appear at this time. The graph convolution neural network model can determine a plurality of reading sequences of the text information to be sequenced. The text information to be sequenced can be sequenced according to various reading sequences to obtain a plurality of text information sequences to be sequenced. The method can further comprise the step of displaying the plurality of text information sequences to be ordered on a user interface for selection by a user. In addition, the model can be optimized according to the target text information sequence to be ordered selected by the user. Specifically, the model can be subjected to model training once by combining the image to be identified and the target text information sequence to be ordered, so that the optimization of the model is realized.

The training process of the graph roll neural network model is as follows:

104. And acquiring a sample image and expected text information sequences corresponding to a plurality of sample text information contained in the sample image.

105. And optimizing the graph convolution neural network model according to the sample characteristics corresponding to the sample text information, the graph structure corresponding to the sample text information and the sample text information sequence.

In 105, the sample characteristics corresponding to the plurality of sample text information and the graph structure corresponding to the plurality of sample text information may be input into the graph convolutional neural network model, a predicted reading sequence corresponding to the plurality of sample text information may be determined, the plurality of sample text information may be ordered according to the predicted reading sequence to obtain a predicted text information sequence, and the graph convolutional neural network model may be parameter optimized according to a difference between the predicted text information sequence and the expected text information sequence. The specific parameter optimization process can be found in the prior art, and is not described in detail herein.

In practical application, the method may further include:

106. and extracting semantic features corresponding to the identified text information to be sequenced.

107. And extracting visual features corresponding to the text information to be sequenced respectively.

108. And merging semantic features and visual features corresponding to the plurality of text information to be sequenced to obtain the features corresponding to the plurality of text information to be sequenced.

In 106, a natural language processing algorithm may be used to extract semantic features corresponding to the identified text information to be ranked.

In one possible implementation manner, the "extracting the visual features corresponding to each of the plurality of text information to be ordered" in the above 107 may be implemented by:

1071. And determining the sub-image area where the plurality of text information to be ordered are respectively located from the image to be identified according to the positions of the plurality of text information to be ordered in the image to be identified.

1072. And extracting visual features corresponding to the plurality of text information to be sequenced respectively according to the sub-image areas where the plurality of text information to be sequenced are respectively.

In 1071, the image text information recognition technology is used to recognize a plurality of text information to be sequenced and the positions of each text information to be sequenced in the image to be processed.

In an example, the sub-image area in which the text information to be ordered is located may specifically be a compact rectangular frame area surrounding the text information to be ordered.

In 1072, the visual characteristics may include information such as fonts, font colors, and background textures.

In practice, some conventional feature extraction algorithms, such as SIFT-INVARIANT FEATURE TRANSFORM (Scale-invariant feature transform), may be used to extract visual features.

The visual features extracted by the traditional feature extraction algorithm are low-dimensional information, but not high-dimensional information, namely, the feature expression is poor. In order to improve the expressive performance of the visual features, in an example, the trained convolutional neural network may be used to extract the visual features, for example, the sub-image regions where the text information to be sequenced are respectively located may be respectively input into the trained convolutional neural network, so as to obtain the visual features corresponding to the text information to be sequenced. The specific implementation and training process of the convolutional neural network can be referred to in the prior art, and will not be described herein.

In 108, the plurality of text information to be ordered includes the first text information to be ordered, and the semantic features and the visual features corresponding to the first text information to be ordered may be spliced to obtain the features corresponding to the first text information to be ordered.

In practical application, a plurality of text areas are usually arranged in the image to be identified, the text areas are far apart, at the moment, the text areas of the image to be identified can be divided, and then the text in each text area is ordered, so that the difficulty of subsequent ordering can be reduced, and the ordering accuracy can be improved. Thus, in one example, the above method may further comprise:

109. and identifying a plurality of text information contained in the image to be identified and the position of each text information in the image to be identified.

110. And dividing the plurality of text information by using a clustering algorithm according to the position of each text information in the image to be identified to obtain at least one text information cluster.

111. And selecting a plurality of text information in one text information cluster from the at least one text information cluster as the text information to be sequenced.

In the above 109, the OCR algorithm may be specifically implemented, and the specific implementation may be referred to the corresponding content in each embodiment, which is not described herein.

In 110, a hierarchical clustering algorithm may be specifically adopted as the clustering algorithm. The distance between the text information in the same text information cluster is smaller than the distance between the text information in different text information clusters. The specific implementation process of the clustering algorithm can be referred to the prior art, and will not be described herein.

In the above 111, a plurality of text information in one text information cluster is selected from the at least one text information cluster as the plurality of text information to be ordered.

In practical application, the method can sort the text information in each text information cluster.

An example of an image recognition method according to an embodiment of the present application will be described with reference to fig. 1 a:

and 1, performing character recognition in the image to be recognized, and recognizing that the image to be recognized contains three characters of 'year', 'section' and 'goods'.

And 2, respectively extracting semantic features corresponding to the three characters of the year, the festival and the goods by using a natural language processing algorithm.

And 3, sequencing the plurality of characters according to the semantic features corresponding to the characters to obtain a character sequence 'annual goods knot'.

And 4, outputting the text sequence to an interface for display.

An image recognition method according to a further embodiment of the present application will be described by way of example with reference to fig. 1 b:

And a, performing character recognition in the image to be recognized, and recognizing that the image to be recognized contains three characters of 'year', 'section' and 'goods' and the positions of the three characters in the image to be recognized. And taking out the subgraph (namely the subgraph area) where each text is located according to the position of each text in the image to be identified.

And b, respectively extracting semantic features corresponding to the three characters of 'year', 'section' and 'goods' by using a natural language processing algorithm, and respectively extracting visual features of subgraphs where the characters are located by using a convolutional neural network CNN to obtain the visual features corresponding to the characters.

And c, calculating the correlation between any two characters according to the similarity of the visual features and the similarity of the semantic features, and constructing a graph structure based on the correlation.

And d, splicing the semantic features and the visual features corresponding to the characters to obtain the features corresponding to the characters, inputting the graph structure and the features corresponding to the characters into the trained graph convolution neural network model, and executing the graph convolution neural network model to obtain the reading sequence of the three characters.

And e, sequencing the three characters according to the reading sequence of the three characters to obtain a character sequence 'annual goods festival', and outputting the character sequence to an interface for display.

The method for the proposal does not depend on templates, and can generate a certain sequence under any typesetting condition, thereby having better application range. The evaluation indexes on the common horizontal text and the column typesetting text in the test set are higher than 80%.

Yet another embodiment of the present application provides an image recognition method, including:

501. and identifying a plurality of text information to be sequenced contained in the image to be identified.

502. And determining the character types of the plurality of character information to be ordered.

503. And obtaining an arrangement rule corresponding to the text type.

504. And sorting the plurality of word information to be sorted according to the arrangement rule to obtain a word information sequence to be sorted.

The specific implementation of 501 may be referred to the corresponding content in the above embodiments, and will not be described herein.

In 502, the arrangement rules corresponding to different text types are generally different. For example, the palindromic text in the image to be processed is typically arranged in a top-to-bottom and right-to-left order, and the modern text in the image to be processed is typically arranged in a left-to-right and top-to-bottom order.

In an example, the text types may include ancient text types and modern text types.

In 503, the arrangement rule corresponding to the text type may be obtained according to the corresponding relationship between the text type and the arrangement rule established in advance. The arrangement rules from top to bottom and from right to left can be configured in advance for ancient text types, and from left to right and from top to bottom for modern text types.

In the above 504, the plurality of text information to be ordered may be ordered according to the arrangement rule and the positions of the plurality of text information to be ordered in the image to be processed, so as to obtain a text information sequence to be ordered.

In this embodiment, the text information is ordered according to different arrangement rules for different text types, so that the applicability and accuracy of the ordering scheme can be effectively improved.

It should be noted that, in the method provided in the embodiment of the present application, details of each step may be referred to corresponding details in the above embodiment, which are not described herein. In addition, the method provided in the embodiment of the present application may further include other part or all of the steps in the above embodiments, and specific reference may be made to the corresponding content of the above embodiments, which is not repeated herein.

Fig. 2 is a flow chart illustrating a text sorting method according to another embodiment of the present application. The execution subject of the method can be a client or a server. The client may be hardware integrated on the terminal and provided with an embedded program, or may be an application software installed in the terminal, or may be a tool software embedded in an operating system of the terminal, which is not limited in the embodiment of the present application. The terminal can be any terminal equipment including a mobile phone, a tablet personal computer, an intelligent sound box and the like. The server may be a common server, a cloud end, a virtual server, or the like, which is not particularly limited in the embodiment of the present application.

As shown in fig. 2, the method includes:

201. and acquiring a plurality of text information to be sequenced.

202. And integrating the respective corresponding characteristics of the plurality of word information to be ordered and the adjacent relation among the plurality of word information to be ordered to determine the reading sequence of the plurality of word information to be ordered.

203. And sequencing the plurality of first text messages according to the reading sequence to obtain a first text message sequence.

The above 201 may be a plurality of text information to be ordered, which may be identified from images to be identified, or may be input by a user.

For example, a word ordering function can be embedded in the pupil's home teaching machine, and when the pupil encounters a problem of ' connecting words to sentence ', the pupil can input a plurality of words in the questions, namely a plurality of word information to be ordered, in the home teaching machine.

In 202, the adjacencies between the text information to be ordered may be semantic adjacencies, or may be adjacencies in other aspects, which is not limited in particular in the embodiments of the present application.

The specific implementation process of the foregoing 202 and 203 may be referred to the corresponding content in the foregoing embodiments, which is not repeated herein.

Optionally, the method may further include:

204. and determining the adjacent relation among the plurality of first text messages according to the corresponding characteristics of the plurality of first text messages.

The specific implementation of 204 may be referred to the corresponding content in the above embodiments, and will not be described herein.

Optionally, in the foregoing 202, "the reading sequence of the plurality of text information to be ordered is determined by integrating the features corresponding to each of the plurality of text information to be ordered and the adjacency relations between the plurality of text information to be ordered", which may be implemented specifically by the following steps:

2021. And constructing a graph structure with nodes and edges according to the adjacent relation among the plurality of text information to be ordered.

2022. And taking the characteristics corresponding to the plurality of text information to be ordered and the graph structure as the input of a trained graph convolution neural network model, and executing the graph convolution neural network model to obtain the reading sequence of the plurality of text information to be ordered.

The specific implementation process of 2021 and 2022 may be referred to the corresponding content in each embodiment, which is not described herein.

Fig. 3 is a block diagram showing an image recognition apparatus according to an embodiment of the present application. As shown in fig. 3, the apparatus includes:

The first identifying module 301 is configured to identify, from the image to be identified, a plurality of text information to be ordered contained therein.

The first determining module 302 is configured to determine a reading order of the plurality of text messages to be ordered according to the features corresponding to the plurality of text messages to be ordered.

Wherein the features carry semantic features.

The first sorting module 303 is configured to sort the plurality of text messages to be sorted according to the reading order, so as to obtain a text message sequence to be sorted.

Optionally, the apparatus may further include:

The first acquisition module is used for acquiring a sample image and expected text information sequences corresponding to a plurality of sample text information contained in the sample image;

And the first optimizing module is used for optimizing the graph convolution neural network model according to the sample characteristics corresponding to the sample text information, the graph structure corresponding to the sample text information and the sample text information sequence.

Optionally, the apparatus may further include:

The first extraction module is used for extracting semantic features corresponding to the identified text information to be sequenced respectively and extracting visual features corresponding to the text information to be sequenced respectively;

And the first fusion module is used for fusing the semantic features and the visual features corresponding to the plurality of text information to be sequenced to obtain the features corresponding to the plurality of text information to be sequenced.

It should be noted that, the image recognition device provided in the foregoing embodiments may implement the technical solutions described in the foregoing method embodiments, and the specific implementation principles of the foregoing modules or units may be referred to the corresponding content in the foregoing method embodiments, which is not repeated herein.

Fig. 4 is a block diagram of a text sorting apparatus according to an embodiment of the present application. As shown in fig. 4, the apparatus includes:

A second obtaining module 401, configured to obtain a plurality of text information to be ordered;

A second determining module 402, configured to synthesize features corresponding to each of the plurality of text information to be ordered and an adjacency relationship between the plurality of text information to be ordered, and determine a reading sequence of the plurality of text information to be ordered;

and the second sorting module 403 is configured to sort the plurality of first text messages according to the reading order, so as to obtain a first text message sequence.

Optionally, the apparatus may further include:

And the third determining module is used for determining the adjacent relation among the plurality of first text messages according to the characteristics corresponding to the plurality of first text messages.

It should be noted that, the text sorting device provided in the foregoing embodiments may implement the technical solutions described in the foregoing method embodiments, and the specific implementation principles of the foregoing modules or units may refer to corresponding contents in the foregoing method embodiments, which are not repeated herein.

Still another embodiment of the present application provides an image recognition apparatus including:

the second recognition module is used for recognizing a plurality of text information to be sequenced from the images to be recognized.

And the fourth determining module is used for determining the text types of the plurality of text information to be sequenced.

And the third acquisition module is used for acquiring the arrangement rule corresponding to the text type.

And the third ordering module is used for ordering the plurality of word information to be ordered according to the ordering rule to obtain a word information sequence to be ordered.

Fig. 5 shows a schematic structural diagram of an electronic device according to an embodiment of the present application. As shown, the electronic device includes a memory 1101 and a processor 1102. The memory 1101 may be configured to store various other data to support operations on the electronic device. Examples of such data include instructions for any application or method operating on an electronic device. The memory 1101 may be implemented by any type of volatile or non-volatile memory device or combination thereof, such as Static Random Access Memory (SRAM), electrically erasable programmable read-only memory (EEPROM), erasable programmable read-only memory (EPROM), programmable read-only memory (PROM), read-only memory (ROM), magnetic memory, flash memory, magnetic or optical disk.

The memory is used for storing programs;

the processor 1102 is coupled to the memory 1101, and is configured to execute the program stored in the memory 1101, so as to implement the image recognition method or the text sorting method in the above embodiments.

Further, as shown in FIG. 5, the electronic device also includes a communication component 1103, a display 1104, a power supply component 1105, an audio component 1106, and other components. Only some of the components are schematically shown in fig. 5, which does not mean that the electronic device only comprises the components shown in fig. 5.

Accordingly, the embodiments of the present application also provide a computer-readable storage medium storing a computer program, where the computer program when executed by a computer can implement the steps or functions of the image recognition method and the text sorting method provided in the foregoing embodiments.

The apparatus embodiments described above are merely illustrative, wherein the elements illustrated as separate elements may or may not be physically separate, and the elements shown as elements may or may not be physical elements, may be located in one place, or may be distributed over a plurality of network elements. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment. Those of ordinary skill in the art will understand and implement the present invention without undue burden.

From the above description of the embodiments, it will be apparent to those skilled in the art that the embodiments may be implemented by means of software plus necessary general hardware platforms, or of course may be implemented by means of hardware. Based on this understanding, the foregoing technical solution may be embodied essentially or in a part contributing to the prior art in the form of a software product, which may be stored in a computer readable storage medium, such as ROM/RAM, a magnetic disk, an optical disk, etc., including several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the method described in the respective embodiments or some parts of the embodiments.

It should be noted that the above-mentioned embodiments are merely for illustrating the technical solution of the present application, and not for limiting the same, and although the present application has been described in detail with reference to the above-mentioned embodiments, it should be understood by those skilled in the art that the technical solution described in the above-mentioned embodiments may be modified or some technical features may be equivalently replaced, and these modifications or substitutions do not make the essence of the corresponding technical solution deviate from the spirit and scope of the technical solution of the embodiments of the present application.

Claims

1. An image recognition method, comprising:

Identify multiple text information to be sorted contained in the image to be identified;

The features and graph structures corresponding to the plurality of text information to be sorted are used as inputs of a trained graph convolutional neural network model, and the graph convolutional neural network model is executed to obtain a reading order of the plurality of text information to be sorted; wherein the features carry semantic features; the graph structure is constructed according to the adjacency relationship between the plurality of text information to be sorted, and the nodes in the graph structure are used to represent the text information to be sorted; and the edges in the graph structure are used to represent whether the nodes are adjacent;

The plurality of text information to be sorted are sorted according to the reading order to obtain a sequence of text information to be sorted.

2. The method according to claim 1, further comprising:

Determine the adjacency relationship between the multiple text information to be sorted.

3. The method according to claim 2, further comprising:

According to the adjacency relationship between the plurality of text information to be sorted, a graph structure having nodes and edges is constructed.

4. The method according to claim 3, characterized in that the graph convolutional neural network model is used to:

According to the features corresponding to the plurality of text information to be sorted and the graph structure, obtaining updated features corresponding to the plurality of text information to be sorted through a graph convolution operation;

The reading order of the plurality of text information to be sorted is determined according to the updated features corresponding to each of the plurality of text information to be sorted.

5. The method according to claim 4, characterized in that determining the reading order of the plurality of text information to be sorted according to the updated features corresponding to each of the plurality of text information to be sorted comprises:

The updated features corresponding to the plurality of text information to be sorted are integrated to calculate the global text information features as the starting reference features;

Calculate the attention weight corresponding to each of the at least one text information to be sorted that has not been output among the multiple text information to be sorted according to the reference feature and the updated feature corresponding to each of the at least one text information to be sorted that has not been output among the multiple text information to be sorted;

Output the text information to be sorted corresponding to the maximum attention weight, and use the updated features corresponding to the text information to be sorted corresponding to the maximum attention weight as new reference features, and continue to perform the above attention weight calculation step until all the multiple text information to be sorted are output;

The output order of the plurality of text information to be sorted is determined as the reading order of the plurality of text information to be sorted.

6. The method according to any one of claims 3 to 5, further comprising:

Acquire a sample image and an expected text information sequence corresponding to a plurality of sample text information contained in the sample image;

The graph convolutional neural network model is optimized according to the sample features corresponding to each of the multiple sample text information, the graph structures corresponding to the multiple sample text information, and the expected text information sequence.

7. The method according to any one of claims 2 to 5, characterized in that determining the adjacency relationship between the plurality of text information to be sorted comprises:

The adjacency relationship between the multiple text information to be sorted is determined according to the features corresponding to each of the multiple text information to be sorted.

8. The method according to claim 7, characterized in that determining the adjacency relationship between the plurality of text information to be sorted according to the features corresponding to each of the plurality of text information to be sorted comprises:

Calculating the correlation between each two pieces of text information to be sorted according to the features corresponding to each two pieces of text information to be sorted in the plurality of pieces of text information to be sorted;

Whether the two pieces of text information to be sorted are adjacent is determined according to the correlation between the two pieces of text information to be sorted.

9. The method according to claim 8, characterized in that the features also carry visual features;

Calculating the correlation between each two pieces of text information to be sorted according to the features corresponding to each two pieces of text information to be sorted among the plurality of pieces of text information to be sorted includes:

Calculating the first similarity between the semantic features corresponding to each two pieces of text information to be sorted;

Calculating the second similarity between the visual features corresponding to each two pieces of text information to be sorted;

The first similarity and the second similarity are combined to determine the correlation between each two pieces of text information to be sorted.

10. The method according to any one of claims 1 to 5, further comprising:

Extracting semantic features corresponding to each of the identified plurality of text information to be sorted;

Extracting visual features corresponding to each of the plurality of text information to be sorted;

The semantic features and visual features corresponding to the plurality of text information to be sorted are integrated to obtain the features corresponding to the plurality of text information to be sorted.

11. The method according to claim 10, characterized in that extracting the visual features corresponding to each of the plurality of text information to be sorted comprises:

According to the positions of the plurality of text information to be sorted in the image to be identified, determining the sub-image regions where the plurality of text information to be sorted are located from the image to be identified;

Visual features corresponding to the plurality of text information to be sorted are extracted respectively according to the sub-image regions where the plurality of text information to be sorted are respectively located.

12. The method according to any one of claims 1 to 5, characterized in that identifying a plurality of text information to be sorted contained in the image to be identified comprises:

Identify, from the image to be identified, a plurality of text information contained therein and the position of each text information in the image to be identified;

According to the position of each piece of text information in the image to be recognized, the plurality of text information are divided by using a clustering algorithm to obtain at least one text information cluster;

A plurality of text information in one of the at least one text information cluster is selected as the plurality of text information to be sorted.

13. A method for sorting characters, comprising:

Get multiple text information to be sorted;

Determining the reading order of the plurality of text information to be sorted by comprehensively considering the features corresponding to each of the plurality of text information to be sorted and the adjacency relationship between the plurality of text information to be sorted, including: constructing a graph structure with nodes and edges according to the adjacency relationship between the plurality of text information to be sorted; the nodes in the graph structure are used to represent the text information to be sorted; the edges in the graph structure are used to indicate whether the nodes are adjacent; using the features corresponding to each of the plurality of text information to be sorted and the graph structure as inputs of a trained graph convolutional neural network model, executing the graph convolutional neural network model, so as to obtain the reading order of the plurality of text information to be sorted;

The plurality of text information to be sorted are sorted according to the reading order to obtain a first text information sequence.

14. The method according to claim 13, further comprising:

15. An electronic device, comprising: a memory and a processor, wherein:

The memory is used to store programs;

The processor is coupled to the memory and is configured to execute the program stored in the memory to:

16. An electronic device, comprising: a memory and a processor, wherein:

The memory is used to store programs;

Get multiple text information to be sorted;