CN113536858A

CN113536858A - Image recognition method and system

Info

Publication number: CN113536858A
Application number: CN202010313920.4A
Authority: CN
Inventors: 李兆海; 王永攀; 何梦超
Original assignee: Alibaba Group Holding Ltd
Current assignee: Alibaba Group Holding Ltd
Priority date: 2020-04-20
Filing date: 2020-04-20
Publication date: 2021-10-22

Abstract

The application discloses an image recognition method and system. Wherein, the method comprises the following steps: acquiring a first image and a second image, wherein the arrangement direction of first text data contained in the first image is a first direction, and the arrangement direction of second text data contained in the second image is a second direction; processing the first image and the second image by using a text recognition model to obtain a first recognition result of the first image and a second recognition result of the second image; the text recognition model is used for inputting the first image and the second image into the feature extraction model to obtain a first feature sequence of the first image and a second feature sequence of the second image, inputting the first feature sequence into the first recognition model to obtain a first recognition result, and inputting the second feature sequence into the second recognition model to obtain a second recognition result. The method and the device solve the technical problems that in the related technology, the text data arranged in multiple directions are identified by the text identification method, and calculation and storage resources are wasted.

Description

Image recognition method and system

Technical Field

The present application relates to the field of image recognition, and in particular, to an image recognition method and system.

Background

Currently, an image may be recognized through a text recognition algorithm to recognize text data contained in the image. Because the traditional text line recognition algorithm can only process text data arranged in one direction, when the text data arranged in multiple directions needs to be recognized, a simple solution is to train multiple models respectively, and each model is used for processing the text data arranged in one direction. However, this solution requires storing a plurality of models and performing calculation through the plurality of models, respectively, resulting in a waste of computing resources and storage resources.

In view of the above problems, no effective solution has been proposed.

Disclosure of Invention

The embodiment of the application provides an image recognition method and system, which are used for at least solving the technical problems that in the related technology, a text recognition method recognizes text data arranged in multiple directions and wastes calculation and storage resources.

According to an aspect of an embodiment of the present application, there is provided an image recognition method including: acquiring a first image and a second image, wherein the first image comprises first text data, the second image comprises second text data, the arrangement direction of the first text data is a first direction, and the arrangement direction of the second text data is a second direction; processing the first image and the second image by using a text recognition model to obtain a first recognition result of the first image and a second recognition result of the second image; the text recognition model is used for inputting the first image and the second image into the feature extraction model to obtain a first feature sequence of the first image and a second feature sequence of the second image, inputting the first feature sequence into the first recognition model to obtain a first recognition result, and inputting the second feature sequence into the second recognition model to obtain a second recognition result.

According to another aspect of the embodiments of the present application, there is also provided an image recognition apparatus including: the device comprises a first acquisition module, a second acquisition module and a display module, wherein the first acquisition module is used for acquiring a first image and a second image, the first image comprises first text data, the second image comprises second text data, the arrangement direction of the first text data is a first direction, and the arrangement direction of the second text data is a second direction; the processing module is used for processing the first image and the second image by using the text recognition model to obtain a first recognition result of the first image and a second recognition result of the second image; the text recognition model is used for inputting the first image and the second image into the feature extraction model to obtain a first feature sequence of the first image and a second feature sequence of the second image, inputting the first feature sequence into the first recognition model to obtain a first recognition result, and inputting the second feature sequence into the second recognition model to obtain a second recognition result.

According to another aspect of the embodiments of the present application, there is also provided an image recognition method, including: acquiring a first image and a second image, wherein the first image comprises first text data, the second image comprises second text data, the arrangement direction of the first text data is a first direction, and the arrangement direction of the second text data is a second direction; performing feature extraction on the first image and the second image to obtain a first feature sequence of the first image and a second feature sequence of the second image; acquiring a first recognition result of the first image based on the first feature sequence; and acquiring a second recognition result of the second image based on the second feature sequence.

According to another aspect of the embodiments of the present application, there is also provided a storage medium including a stored program, wherein when the program runs, an apparatus where the storage medium is located is controlled to execute the above-mentioned image recognition method.

According to another aspect of the embodiments of the present application, there is also provided a computing device, including: the image recognition system comprises a memory and a processor, wherein the memory is used for storing programs, and the processor is used for running the programs, and the image recognition method is executed when the programs are run.

According to another aspect of the embodiments of the present application, there is also provided an image recognition system, including: a processor; and a memory coupled to the processor for providing instructions to the processor for processing the following processing steps: acquiring a first image and a second image, wherein the first image comprises first text data, the second image comprises second text data, the arrangement direction of the first text data is a first direction, and the arrangement direction of the second text data is a second direction; processing the first image and the second image by using a text recognition model to obtain a first recognition result of the first image and a second recognition result of the second image; the text recognition model is used for inputting the first image and the second image into the feature extraction model to obtain a first feature sequence of the first image and a second feature sequence of the second image, inputting the first feature sequence into the first recognition model to obtain a first recognition result, and inputting the second feature sequence into the second recognition model to obtain a second recognition result.

In the embodiment of the application, an image containing text data in two different arrangement directions can be processed by using one text recognition model to obtain recognition results of the text data in the different arrangement directions, so that the purpose of image recognition is achieved. It is easy to notice that the horizontal text line and the vertical text line are simultaneously recognized by one model through two recognition model branches, so that the technical effects of saving calculation and storage resources are achieved by avoiding storing two recognition models on line, and the technical problems of calculation and storage resource waste caused by the fact that text data arranged in multiple directions are recognized by a text recognition method in the related art are solved.

Drawings

The accompanying drawings, which are included to provide a further understanding of the application and are incorporated in and constitute a part of this application, illustrate embodiment(s) of the application and together with the description serve to explain the application and not to limit the application. In the drawings:

fig. 1 is a block diagram of a hardware structure of a computer terminal (or mobile device) for implementing an image recognition method according to an embodiment of the present application;

FIG. 2 is a flow chart of an image recognition method according to an embodiment of the present application;

FIG. 3 is an architectural diagram illustrating an alternative image recognition method according to an embodiment of the present application;

FIG. 4 is a schematic diagram of an alternative generate mask matrix according to an embodiment of the present application;

FIG. 5 is a schematic diagram of an image recognition apparatus according to an embodiment of the present application;

FIG. 6 is a flow chart of another image recognition method according to an embodiment of the present application; and

fig. 7 is a block diagram of a computer terminal according to an embodiment of the present application.

Detailed Description

In order to make the technical solutions better understood by those skilled in the art, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only partial embodiments of the present application, but not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

It should be noted that the terms "first," "second," and the like in the description and claims of this application and in the drawings described above are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the application described herein are capable of operation in sequences other than those illustrated or described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

First, some terms or terms appearing in the description of the embodiments of the present application are applicable to the following explanations:

OCR: optical Character Recognition, may refer to the Recognition of Optical characters by image processing and pattern Recognition techniques.

CTC: connection objective Classification, connection-oriented time Classification, can be used to solve the problem that input sequences and output sequences are difficult to correspond one to one.

CRNN: the convolutional Recurrent Neural Network can be a combination of a convolutional Neural Network CNN and a Recurrent Neural Network RNN, and a Network architecture mainly comprises three parts, including a convolutional layer, a cyclic layer and a transcription layer.

Attention Mechanism: the attention mechanism can be used for improving the effect of the RNN-based coding and decoding model, the learning of the neural network model is more flexible by giving different weights to each word in the sentence, and meanwhile, the alignment relationship between the translated input/output sentences can be interpreted as an alignment relationship.

BLSTM: the Bidirectional Long Short-term Memory neural network can simultaneously utilize the information of the past time and the future time, and the network result consists of two common RNNs, one forward RNN, one reverse RNN and the future information.

RESNET: the Residual Neural Network allows original input information to be directly transmitted to a later layer by adding a direct connection channel in the Network, and the Neural Network of the layer does not need to learn the whole output and only needs to learn the Residual output by the last Network.

Mask: the mask, which may be a string of binary codes, bitwise AND the target field, masks the current input bits.

Existing text recognition algorithms can be mainly classified into two categories: CTC-based text recognition algorithms and Attention-based text recognition algorithms.

The text recognition algorithm based on the CTC consists of three parts, wherein the first part is a convolutional neural network and is used for extracting a characteristic sequence of an image; the second part is a recurrent neural network which is used for learning the context information of the text data feature sequence; the third part is a CTC decoder, and the CTC decoder solves the problem of long and short alignment of sequence identification by introducing blank classes, and can decode the probability distribution output by the previous recurrent neural network into a final identification result. In such methods, the CNN of the first part normally changes the height of the image to 1, and therefore the height of the input image must be fixed, which makes it impossible for the model to process an image containing text data aligned in the second direction.

The text recognition algorithm based on Attention also consists of three major parts, the first two parts of which are consistent with the text recognition algorithm based on CTC, but the third part utilizes an Attention Mechanism (Attention Mechanism) to decode, and the recognition result is output in turn at each time point. Unlike the CTC method, the CNN part output of the text recognition algorithm based on Attention may be a two-dimensional feature map, so theoretically this type of algorithm can process images containing text data arranged in both horizontal and vertical directions simultaneously. However, since the number of chinese categories is much larger than that of english, the Attention algorithm classified by time points has a poor effect on chinese text line recognition, and in addition, the forward time consumption of the Attention algorithm is more than that of the CTC algorithm, so the Attention algorithm is not suitable for the current light-reading text recognition scenario.

Because the traditional text line recognition algorithm can only process the text line pictures arranged in one direction, the application range of the model is greatly limited. One of the simplest solutions is to train two models separately, but this solution requires storing both models on-line, resulting in wasted resources.

In order to solve the problems, the text recognition algorithm is optimized and improved, a model capable of recognizing the text data arranged in the horizontal direction and the vertical direction is provided, the two models can be prevented from being stored, and the recognition effect of the model can be improved.

The text recognition algorithm can be applied to various fields needing character recognition, for example, in the field of auxiliary design of an online shopping platform, in order to help a merchant design a product detail page, the merchant can upload a design template, text data arranged in different directions in the design template can be recognized through the algorithm, and the text data can be replaced according to the needs of the merchant to obtain the product detail page of the merchant. Further, for the convenience of the user, the algorithm may be deployed in a cloud server as a saas (software as a service) service, and the user may use the method to identify text data arranged in different directions through the internet according to the requirement.

Example 1

In accordance with an embodiment of the present application, there is provided an image recognition method, it should be noted that the steps illustrated in the flowchart of the drawings may be performed in a computer system such as a set of computer executable instructions, and that while a logical order is illustrated in the flowchart, in some cases the steps illustrated or described may be performed in an order different than here.

The method provided by the embodiment of the application can be executed in a mobile terminal, a computer terminal or a similar operation device. Fig. 1 shows a hardware configuration block diagram of a computer terminal (or mobile device) for implementing an image recognition method. As shown in fig. 1, the computer terminal 10 (or mobile device 10) may include one or more (shown as 102a, 102b, … …, 102 n) processors 102 (the processors 102 may include, but are not limited to, a processing device such as a microprocessor MCU or a programmable logic device FPGA, etc.), a memory 104 for storing data, and a transmission device 106 for communication functions. Besides, the method can also comprise the following steps: a display, an input/output interface (I/O interface), a Universal Serial BUS (USB) port (which may be included as one of the ports of the BUS), a network interface, a power source, and/or a camera. It will be understood by those skilled in the art that the structure shown in fig. 1 is only an illustration and is not intended to limit the structure of the electronic device. For example, the computer terminal 10 may also include more or fewer components than shown in FIG. 1, or have a different configuration than shown in FIG. 1.

It should be noted that the one or more processors 102 and/or other data processing circuitry described above may be referred to generally herein as "data processing circuitry". The data processing circuitry may be embodied in whole or in part in software, hardware, firmware, or any combination thereof. Further, the data processing circuit may be a single stand-alone processing module, or incorporated in whole or in part into any of the other elements in the computer terminal 10 (or mobile device). As referred to in the embodiments of the application, the data processing circuit acts as a processor control (e.g. selection of a variable resistance termination path connected to the interface).

The memory 104 may be used to store software programs and modules of application software, such as program instructions/data storage devices corresponding to the image recognition method in the embodiment of the present application, and the processor 102 executes various functional applications and data processing by running the software programs and modules stored in the memory 104, so as to implement the image recognition method described above. The memory 104 may include high speed random access memory, and may also include non-volatile memory, such as one or more magnetic storage devices, flash memory, or other non-volatile solid-state memory. In some examples, the memory 104 may further include memory located remotely from the processor 102, which may be connected to the computer terminal 10 via a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

The transmission device 106 is used for receiving or transmitting data via a network. Specific examples of the network described above may include a wireless network provided by a communication provider of the computer terminal 10. In one example, the transmission device 106 includes a Network adapter (NIC) that can be connected to other Network devices through a base station to communicate with the internet. In one example, the transmission device 106 can be a Radio Frequency (RF) module, which is used to communicate with the internet in a wireless manner.

The display may be, for example, a touch screen type Liquid Crystal Display (LCD) that may enable a user to interact with a user interface of the computer terminal 10 (or mobile device).

It should be noted here that in some alternative embodiments, the computer device (or mobile device) shown in fig. 1 described above may include hardware elements (including circuitry), software elements (including computer code stored on a computer-readable medium), or a combination of both hardware and software elements. It should be noted that fig. 1 is only one example of a particular specific example and is intended to illustrate the types of components that may be present in the computer device (or mobile device) described above.

Under the above operating environment, the present application provides an image recognition method as shown in fig. 2. Fig. 2 is a flowchart of an image recognition method according to an embodiment of the present application. As shown in fig. 2, the method comprises the steps of:

step S202, a first image and a second image are obtained, wherein the first image comprises first text data, the second image comprises second text data, the arrangement direction of the first text data is a first direction, and the arrangement direction of the second text data is a second direction;

the first image and the second image in the above steps may refer to text images extracted from the image to be processed, and texts included in the two images are arranged according to different directions, where the arrangement directions are the first direction and the second direction, respectively. The image to be processed may be a template image uploaded by a merchant in the field of the aided design of the online shopping platform, or may be an image uploaded by a user received in the SaaS service, but is not limited thereto. In the embodiment of the present application, the first direction is taken as a horizontal direction, and the second direction is taken as a vertical direction for example, but the present invention is not limited thereto.

In an optional embodiment, the image to be processed including the horizontal text and the vertical text may be processed by an image character extraction method, a first image including the horizontal text and a second image including the vertical text are extracted, and the two images are further processed simultaneously by a model to obtain a final recognition result.

Step S204, processing the first image and the second image by using the text recognition model to obtain a first recognition result of the first image and a second recognition result of the second image;

the text recognition model is used for inputting the first image and the second image into the feature extraction model to obtain a first feature sequence of the first image and a second feature sequence of the second image, inputting the first feature sequence into the first recognition model to obtain a first recognition result, and inputting the second feature sequence into the second recognition model to obtain a second recognition result.

Optionally, the feature extraction model comprises: the device comprises a plurality of convolution layers, an excitation layer and a pooling layer which are connected in sequence, wherein the parameter of the last pooling layer is 2 x 1; the first recognition model or the second recognition model includes: convolutional neural networks, bidirectional long-short term memory neural networks, and associative time classification.

In order to ensure that a plurality of text data arranged in different directions are simultaneously identified through one model, the CTC-based text identification algorithm can be improved. Because the bottom layer characteristics of the texts arranged in the horizontal direction are similar to those of the texts arranged in the vertical direction, most of the existing convolutional neural network models can be shared to obtain a characteristic extraction model, so that the number of parameters is reduced. In addition, the text data arranged in different directions can be recognized through different recognition models on the basis of the existing models, and the structures of all the recognition models are completely consistent.

For the feature extraction model, the main structure may be obtained by pruning on the basis of VGG16(Visual Geometry Group Network), and the feature extraction model is used to extract feature information of a text line image for subsequent recognition. For the first recognition model and the second recognition model, the two model structures may be identical, consisting of CNN, BLSTM and CTC.

In order to enhance the feature extraction capability of the feature extraction model, ResNet may be used as the feature extraction model, but the feature extraction model is not limited to this and may have another configuration.

Since the sequence recognition model in the conventional CTC-based text recognition algorithm needs to process the height of the image to 1 and reserve a sufficient width, the model may include 2 × 1-shaped pooling layers, and the first recognition model and the second recognition model may start after the 2 × 1-shaped pooling layers appear, and before the feature extraction model.

For example, taking the text data arranged in the horizontal direction and the vertical direction as an example for explanation, as shown in fig. 3, the overall framework of the text extraction model of the present application can be divided into three parts, and the first part is data processing; the second part is a characteristic extraction model of a shared CNN layer and is used for extracting characteristic information of the first text data and the second text data contained in the image; the third part is a branch recognition network (including the above-described first recognition model and second recognition model) for recognizing the text data arranged in the horizontal direction and the vertical direction, respectively.

Based on the scheme provided by the embodiment of the application, the image containing the text data in two different arrangement directions can be processed by using one text recognition model to obtain the recognition results of the text data in different arrangement directions, so that the purpose of image recognition is achieved. It is easy to notice that the horizontal text line and the vertical text line are simultaneously recognized by one model through two recognition model branches, so that the technical effects of saving calculation and storage resources are achieved by avoiding storing two recognition models on line, and the technical problems of calculation and storage resource waste caused by the fact that text data arranged in multiple directions are recognized by a text recognition method in the related art are solved.

In the above embodiment of the present application, before processing the first image and the second image by using the text recognition model, the method further includes: acquiring the position of each first character in the first image and the position of each second character in the second image; generating a first mask matrix corresponding to the first image based on the position of each first character, and generating a second mask matrix corresponding to the second image based on the position of each second character, wherein the first mask matrix is used for representing the arrangement direction of the first text data, and the second mask matrix is used for representing the arrangement direction of the second text data; splicing the first image, the first mask matrix, the second image and the second mask matrix to obtain an input matrix; and processing the input matrix by using the text recognition model to obtain a first recognition result and a second recognition result.

Optionally, the first mask matrix and the second mask matrix in the above steps may be asymmetric matrices capable of identifying certain direction information, and values of the mask matrices are only related to positions of characters in the text data and an arrangement direction. Therefore, based on the numerical size of the mask matrix, the arrangement direction of the text data can be determined.

In an optional embodiment, because the accuracy requirement of the recognition process on the character position is not high, the position of each character in the text data can be obtained through a simple projection segmentation method, a mask matrix is generated based on the character position and the arrangement direction, the information of the mask matrix is added to the original first image and the original second image, the text recognition model is further utilized for processing, the distinguishing capability of the model on the text data arranged in the horizontal direction and the vertical direction is enhanced, the shared CNN layer learns the feature information suitable for the texts in different directions in a targeted manner, and finally the recognition effect of the model is improved.

For example, still taking the text data arranged in the horizontal direction and the vertical direction as an example for explanation, as shown in fig. 3, the data processing procedure may be to generate a Mask matrix by the character position, and then to splice each image with the corresponding Mask in the channel dimension, and this four-channel matrix is used as the input of the model.

In the above embodiments of the present application, generating, based on the position of each second character, a second mask matrix corresponding to the second image includes: rotating the second image to obtain a rotated image, wherein the arrangement direction of the second text data in the rotated image is a first direction; generating a preset matrix based on the position of each second character; and rotating the preset matrix to obtain a second mask matrix.

In this embodiment, for the text data arranged in the horizontal direction, the values of the preset matrices may sequentially increase from left to right and sequentially increase from top to bottom, similar to the first mask matrix.

In an alternative embodiment, the second image may be rotated to align the text data in the same direction as the first image, then a mask matrix corresponding to the rotated image is generated according to the character position, and then the generated mask matrix is rotated to obtain a mask matrix corresponding to the second image (i.e., the second mask matrix described above).

It should be noted that the rotation direction of the second image is opposite to the rotation direction of the preset matrix.

For example, still taking the text data arranged in the horizontal direction and the vertical direction as an example for explanation, as shown in fig. 3, an image including the text data arranged in the vertical direction may be rotated to change the text data into the horizontal direction, then a Mask matrix is generated by the character position, a Mask matrix corresponding to the text data originally including the vertical direction arrangement is obtained by rotating the Mask matrix counterclockwise by 90 degrees, and finally each image and the corresponding Mask are spliced in the channel dimension, and the matrix of four channels is used as the input of the model.

As shown in fig. 4, for the text data "technical committee for standardization" arranged in the horizontal direction, which includes 8 words, a mask matrix corresponding to each word may be generated, and values of the mask matrix sequentially increase from left to right, thereby obtaining an image added with the mask matrix. It should be noted that the size of the mask matrix is the same as the size of the word correspondence region, for example, if the size of the word correspondence region is 32 × 50, the size of the mask matrix corresponding to the word is 32 × 50. If the size of each word corresponding region is 32 × 50, the size of the mask matrix corresponding to the image is 32 × 400, that is, the size of the mask matrix corresponding to the image is the same as the size of the image.

For the text data 'economic check team' arranged in the vertical direction, the text data 'economic check team' comprises 6 words, so that the text data 'economic check team' can be rotated into the text data arranged in the horizontal direction, a mask matrix corresponding to each word is generated, the values of the mask matrices are sequentially increased from left to right, the mask matrices are further rotated by 90 degrees in the counterclockwise direction, a final mask matrix can be obtained, and an image added with the mask matrices is further obtained.

In the above embodiment of the present application, the method further includes: acquiring a plurality of training images, wherein the arrangement direction of text data contained in each training image is a first direction or a second direction; and training the initial model by using a plurality of training images to obtain a text recognition model.

The training data in the above step may include images of the text data arranged in different directions.

In an optional embodiment, images of text data arranged in different directions can be mixed, the mixed images are used as training data to train the model, in the process of combined training, training data arranged in different directions can be supplemented with each other, the feature extraction capability of an encoder is enhanced, the generalization and robustness of the text recognition model are improved, and the recognition effect of the text recognition model is stronger than that of the model trained by the training data arranged in a single direction.

For example, still taking the text data arranged in the horizontal direction and the vertical direction as an example for explanation, for the text recognition model shown in fig. 3, the text data arranged in the horizontal direction and the vertical direction may be used to mix the training model, so as to obtain the trained text recognition model.

It should be noted that, for simplicity of description, the above-mentioned method embodiments are described as a series of acts or combination of acts, but those skilled in the art will recognize that the present application is not limited by the order of acts described, as some steps may occur in other orders or concurrently depending on the application. Further, those skilled in the art should also appreciate that the embodiments described in the specification are preferred embodiments and that the acts and modules referred to are not necessarily required in this application.

Through the above description of the embodiments, those skilled in the art can clearly understand that the method according to the above embodiments can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware, but the former is a better implementation mode in many cases. Based on such understanding, the technical solutions of the present application may be embodied in the form of a software product, which is stored in a storage medium (e.g., ROM/RAM, magnetic disk, optical disk) and includes instructions for enabling a terminal device (e.g., a mobile phone, a computer, a server, or a network device) to execute the method according to the embodiments of the present application.

Example 2

According to an embodiment of the present application, there is also provided an image recognition apparatus for implementing the image recognition method, as shown in fig. 5, the apparatus 500 includes: a first acquisition module 502 and a processing module 504.

The first obtaining module 502 is configured to obtain a first image and a second image, where the first image includes first text data, the second image includes second text data, an arrangement direction of the first text data is a first direction, and an arrangement direction of the second text data is a second direction; the processing module 504 is configured to process the first image and the second image by using the text recognition model to obtain a first recognition result of the first image and a second recognition result of the second image; the text recognition model is used for inputting the first image and the second image into the feature extraction model to obtain a first feature sequence of the first image and a second feature sequence of the second image, inputting the first feature sequence into the first recognition model to obtain a first recognition result, and inputting the second feature sequence into the second recognition model to obtain a second recognition result.

It should be noted here that the first acquiring module 502 and the processing module 504 are corresponding to steps S202 to S204 in embodiment 1, and the two modules are the same as the corresponding steps in the implementation example and application scenario, but are not limited to the disclosure in embodiment 1. It should be noted that the above modules may be operated in the computer terminal 10 provided in embodiment 1 as a part of the apparatus.

In the above embodiment of the present application, the apparatus further includes: the device comprises a second acquisition module, a generation module and a splicing module.

The second acquisition module is used for acquiring the position of each first character in the first image and the position of each second character in the second image; the generating module is used for generating a first mask matrix corresponding to the first image based on the position of each first character and generating a second mask matrix corresponding to the second image based on the position of each second character, wherein the first mask matrix is used for representing the arrangement direction of the first text data, and the second mask matrix is used for representing the arrangement direction of the second text data; the splicing module is used for splicing the first image, the first mask matrix, the second image and the second mask matrix to obtain an input matrix; the processing module is further used for processing the input matrix by using the text recognition model to obtain a first recognition result and a second recognition result.

In the above embodiments of the present application, the generating module includes: the device comprises a first rotating unit, a generating unit and a second rotating unit.

The first rotation unit is used for rotating the second image to obtain a rotated image, wherein the arrangement direction of the second text data in the rotated image is a first direction; the generating unit is used for generating a preset matrix based on the position of each second character; the second rotation unit is used for rotating the preset matrix to obtain a second mask matrix.

In the above embodiment of the present application, the apparatus further includes: a third acquisition module and a training module.

The third acquisition module is used for acquiring a plurality of training images, wherein the arrangement direction of text data contained in each training image is a first direction or a second direction; the training module is used for training the initial model by utilizing a plurality of training images to obtain a text recognition model.

It should be noted that the preferred embodiments described in the above examples of the present application are the same as the schemes, application scenarios, and implementation procedures provided in example 1, but are not limited to the schemes provided in example 1.

Example 3

According to the embodiment of the application, an image recognition method is further provided.

Fig. 6 is a flowchart of another image recognition method according to an embodiment of the present application. As shown in fig. 6, the method includes the steps of:

step S602, acquiring a first image and a second image, wherein the first image comprises first text data, the second image comprises second text data, the arrangement direction of the first text data is a first direction, and the arrangement direction of the second text data is a second direction;

Step S604, extracting the features of the first image and the second image to obtain a first feature sequence of the first image and a second feature sequence of the second image;

step S606, acquiring a first recognition result of the first image based on the first characteristic sequence;

in step S608, a second recognition result of the second image is obtained based on the second feature sequence.

In the above embodiments of the present application, performing feature extraction on the first image and the second image to obtain a first feature sequence of the first image and a second feature sequence of the second image includes: and inputting the first image and the second image into a shared feature extraction model to obtain a first feature sequence and a second feature sequence.

Optionally, the feature extraction model comprises: the device comprises a plurality of convolution layers, an excitation layer and a pooling layer which are connected in sequence, wherein the parameter of the last pooling layer is 2 x 1. For the feature extraction model, the main structure of the feature extraction model can be obtained by pruning on the basis of VGG16(Visual Geometry Group Network), and the feature extraction model is used for extracting feature information of a text line image and used for subsequent identification.

In the above embodiments of the present application, acquiring the first recognition result of the first image based on the first feature sequence includes: and inputting the first characteristic sequence into the first recognition model to obtain a first recognition result.

Optionally, the first recognition model comprises: convolutional neural networks, bidirectional long-short term memory neural networks, and associative time classification. For the first recognition model, its structure may consist of CNN, BLSTM and CTC.

In the foregoing embodiment of the present application, acquiring the second recognition result of the second image based on the second feature sequence includes: and inputting the second characteristic sequence into a second recognition model to obtain a second recognition result.

Optionally, the second recognition model comprises: convolutional neural networks, bidirectional long-short term memory neural networks, and associative time classification. For the second recognition model, the structure may be the same as the first recognition model, consisting of CNN, BLSTM and CTC.

In the above embodiment of the present application, before the feature extraction is performed on the first image and the second image, the method further includes: acquiring the position of each first character in the first image and the position of each second character in the second image; generating a first mask matrix corresponding to the first image based on the position of each first character, and generating a second mask matrix corresponding to the second image based on the position of each second character, wherein the first mask matrix is used for representing the arrangement direction of the first text data, and the second mask matrix is used for representing the arrangement direction of the second text data; splicing the first image, the first mask matrix, the second image and the second mask matrix to obtain an input matrix; and performing feature extraction on the input matrix to obtain a first feature sequence and a second feature sequence.

In the above embodiment of the present application, the method further includes: acquiring a plurality of training images, wherein the arrangement direction of text data contained in each training image is a first direction or a second direction; and training the feature extraction model, the first recognition model and the second recognition model by using a plurality of training images to obtain a text recognition model.

Example 4

According to an embodiment of the present application, there is also provided an image recognition system including:

a processor; and

a memory coupled to the processor for providing instructions to the processor for processing the following processing steps: acquiring a first image and a second image, wherein the first image comprises first text data, the second image comprises second text data, the arrangement direction of the first text data is a first direction, and the arrangement direction of the second text data is a second direction; processing the first image and the second image by using a text recognition model to obtain a first recognition result of the first image and a second recognition result of the second image; the text recognition model is used for inputting the first image and the second image into the feature extraction model to obtain a first feature sequence of the first image and a second feature sequence of the second image, inputting the first feature sequence into the first recognition model to obtain a first recognition result, and inputting the second feature sequence into the second recognition model to obtain a second recognition result.

Example 5

The embodiment of the application can provide a computer terminal, and the computer terminal can be any one computer terminal device in a computer terminal group. Optionally, in this embodiment, the computer terminal may also be replaced with a terminal device such as a mobile terminal.

Optionally, in this embodiment, the computer terminal may be located in at least one network device of a plurality of network devices of a computer network.

In this embodiment, the computer terminal may execute program codes of the following steps in the image recognition method: acquiring a first image and a second image, wherein the first image comprises first text data, the second image comprises second text data, the arrangement direction of the first text data is a first direction, and the arrangement direction of the second text data is a second direction; processing the first image and the second image by using a text recognition model to obtain a first recognition result of the first image and a second recognition result of the second image; the text recognition model is used for inputting the first image and the second image into the feature extraction model to obtain a first feature sequence of the first image and a second feature sequence of the second image, inputting the first feature sequence into the first recognition model to obtain a first recognition result, and inputting the second feature sequence into the second recognition model to obtain a second recognition result.

Optionally, fig. 7 is a block diagram of a computer terminal according to an embodiment of the present application. As shown in fig. 7, the computer terminal a may include: one or more processors 702 (only one of which is shown), and memory 704.

The memory may be configured to store software programs and modules, such as program instructions/modules corresponding to the image recognition method and apparatus in the embodiments of the present application, and the processor executes various functional applications and data processing by running the software programs and modules stored in the memory, so as to implement the image recognition method. The memory may include high speed random access memory, and may also include non-volatile memory, such as one or more magnetic storage devices, flash memory, or other non-volatile solid-state memory. In some examples, the memory may further include memory remotely located from the processor, and these remote memories may be connected to terminal a through a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

The processor can call the information and application program stored in the memory through the transmission device to execute the following steps: acquiring a first image and a second image, wherein the first image comprises first text data, the second image comprises second text data, the arrangement direction of the first text data is a first direction, and the arrangement direction of the second text data is a second direction; processing the first image and the second image by using a text recognition model to obtain a first recognition result of the first image and a second recognition result of the second image; the text recognition model is used for inputting the first image and the second image into the feature extraction model to obtain a first feature sequence of the first image and a second feature sequence of the second image, inputting the first feature sequence into the first recognition model to obtain a first recognition result, and inputting the second feature sequence into the second recognition model to obtain a second recognition result.

Optionally, the processor may further execute the program code of the following steps: acquiring the position of each first character in the first image and the position of each second character in the second image; generating a first mask matrix corresponding to the first image based on the position of each first character, and generating a second mask matrix corresponding to the second image based on the position of each second character, wherein the first mask matrix is used for representing the arrangement direction of the first text data, and the second mask matrix is used for representing the arrangement direction of the second text data; splicing the first image, the first mask matrix, the second image and the second mask matrix to obtain an input matrix; and processing the input matrix by using the text recognition model to obtain a first recognition result and a second recognition result.

Optionally, the processor may further execute the program code of the following steps: rotating the second image to obtain a rotated image, wherein the arrangement direction of the second text data in the rotated image is a first direction; generating a preset matrix based on the position of each second character; and rotating the preset matrix to obtain a second mask matrix.

Optionally, the processor may further execute the program code of the following steps: acquiring a plurality of training images, wherein the arrangement direction of text data contained in each training image is a first direction or a second direction; and training the initial model by using a plurality of training images to obtain a text recognition model.

By adopting the embodiment of the application, an image recognition scheme is provided. The text data in two different arrangement directions are identified through one text identification model, so that the two identification models are prevented from being stored on line, the technical effect of saving calculation and storage resources is achieved, and the technical problem that calculation and storage resources are wasted due to the fact that the text identification method identifies the text data arranged in the multiple directions in the related technology is solved.

The processor can call the information and application program stored in the memory through the transmission device to execute the following steps: acquiring a first image and a second image, wherein the first image comprises first text data, the second image comprises second text data, the arrangement direction of the first text data is a first direction, and the arrangement direction of the second text data is a second direction; performing feature extraction on the first image and the second image to obtain a first feature sequence of the first image and a second feature sequence of the second image; acquiring a first recognition result of the first image based on the first feature sequence; and acquiring a second recognition result of the second image based on the second feature sequence.

It can be understood by those skilled in the art that the structure shown in fig. 7 is only an illustration, and the computer terminal may also be a terminal device such as a smart phone (e.g., an Android phone, an iOS phone, etc.), a tablet computer, a palmtop computer, a Mobile Internet Device (MID), a PAD, and the like. Fig. 7 is a diagram illustrating a structure of the electronic device. For example, the computer terminal a may also include more or fewer components (e.g., network interfaces, display devices, etc.) than shown in fig. 7, or have a different configuration than shown in fig. 7.

Those skilled in the art will appreciate that all or part of the steps in the methods of the above embodiments may be implemented by a program instructing hardware associated with the terminal device, where the program may be stored in a computer-readable storage medium, and the storage medium may include: flash disks, Read-Only memories (ROMs), Random Access Memories (RAMs), magnetic or optical disks, and the like.

Example 6

Embodiments of the present application also provide a storage medium. Optionally, in this embodiment, the storage medium may be configured to store program codes executed by the image recognition method provided in the foregoing embodiment.

Optionally, in this embodiment, the storage medium may be located in any one of computer terminals in a computer terminal group in a computer network, or in any one of mobile terminals in a mobile terminal group.

Optionally, in this embodiment, the storage medium is configured to store program code for performing the following steps: acquiring a first image and a second image, wherein the first image comprises first text data, the second image comprises second text data, the arrangement direction of the first text data is a first direction, and the arrangement direction of the second text data is a second direction; processing the first image and the second image by using a text recognition model to obtain a first recognition result of the first image and a second recognition result of the second image; the text recognition model is used for inputting the first image and the second image into the feature extraction model to obtain a first feature sequence of the first image and a second feature sequence of the second image, inputting the first feature sequence into the first recognition model to obtain a first recognition result, and inputting the second feature sequence into the second recognition model to obtain a second recognition result.

Optionally, the storage medium is further configured to store program codes for performing the following steps: acquiring the position of each first character in the first image and the position of each second character in the second image; generating a first mask matrix corresponding to the first image based on the position of each first character, and generating a second mask matrix corresponding to the second image based on the position of each second character, wherein the first mask matrix is used for representing the arrangement direction of the first text data, and the second mask matrix is used for representing the arrangement direction of the second text data; splicing the first image, the first mask matrix, the second image and the second mask matrix to obtain an input matrix; and processing the input matrix by using the text recognition model to obtain a first recognition result and a second recognition result.

Optionally, the storage medium is further configured to store program codes for performing the following steps: rotating the second image to obtain a rotated image, wherein the arrangement direction of the second text data in the rotated image is a first direction; generating a preset matrix based on the position of each second character; and rotating the preset matrix to obtain a second mask matrix.

Optionally, the storage medium is further configured to store program codes for performing the following steps: acquiring a plurality of training images, wherein the arrangement direction of text data contained in each training image is a first direction or a second direction; and training the initial model by using a plurality of training images to obtain a text recognition model.

Optionally, in this embodiment, the storage medium is configured to store program code for performing the following steps: acquiring a first image and a second image, wherein the first image comprises first text data, the second image comprises second text data, the arrangement direction of the first text data is a first direction, and the arrangement direction of the second text data is a second direction; performing feature extraction on the first image and the second image to obtain a first feature sequence of the first image and a second feature sequence of the second image; acquiring a first recognition result of the first image based on the first feature sequence; and acquiring a second recognition result of the second image based on the second feature sequence.

The above-mentioned serial numbers of the embodiments of the present application are merely for description and do not represent the merits of the embodiments.

In the above embodiments of the present application, the descriptions of the respective embodiments have respective emphasis, and for parts that are not described in detail in a certain embodiment, reference may be made to related descriptions of other embodiments.

In the embodiments provided in the present application, it should be understood that the disclosed technology can be implemented in other ways. The above-described embodiments of the apparatus are merely illustrative, and for example, the division of the units is only one type of division of logical functions, and there may be other divisions when actually implemented, for example, a plurality of units or components may be combined or may be integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, units or modules, and may be in an electrical or other form.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, functional units in the embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.

The integrated unit, if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present application may be substantially implemented or contributed to by the prior art, or all or part of the technical solution may be embodied in a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a removable hard disk, a magnetic or optical disk, and other various media capable of storing program codes.

The foregoing is only a preferred embodiment of the present application and it should be noted that those skilled in the art can make several improvements and modifications without departing from the principle of the present application, and these improvements and modifications should also be considered as the protection scope of the present application.

Claims

1. An image recognition method, comprising:

acquiring a first image and a second image, wherein the first image comprises first text data, the second image comprises second text data, the arrangement direction of the first text data is a first direction, and the arrangement direction of the second text data is a second direction;

processing the first image and the second image by using a text recognition model to obtain a first recognition result of the first image and a second recognition result of the second image;

the text recognition model is used for inputting the first image and the second image into a feature extraction model to obtain a first feature sequence of the first image and a second feature sequence of the second image, inputting the first feature sequence into the first recognition model to obtain a first recognition result, and inputting the second feature sequence into the second recognition model to obtain a second recognition result.

2. The method of claim 1, wherein prior to processing the first image and the second image with a text recognition model, the method further comprises:

acquiring the position of each first character in the first image and the position of each second character in the second image;

generating a first mask matrix corresponding to the first image based on the position of each first character, and generating a second mask matrix corresponding to the second image based on the position of each second character, wherein the first mask matrix is used for representing the arrangement direction of the first text data, and the second mask matrix is used for representing the arrangement direction of the second text data;

splicing the first image, the first mask matrix, the second image and the second mask matrix to obtain an input matrix;

and processing the input matrix by using the text recognition model to obtain the first recognition result and the second recognition result.

3. The method of claim 2, wherein generating a second mask matrix corresponding to the second image based on the position of each second character comprises:

rotating the second image to obtain a rotated image, wherein the arrangement direction of second text data in the rotated image is a first direction;

generating a preset matrix based on the position of each second character;

and rotating the preset matrix to obtain the second mask matrix.

4. The method of claim 3, wherein the first mask matrix and the second mask matrix are asymmetric matrices; and the values of the preset matrix or the first mask matrix are sequentially increased from left to right and sequentially increased from top to bottom.

5. The method of claim 1, wherein the method further comprises:

acquiring a plurality of training images, wherein the arrangement direction of text data contained in each training image is a first direction or a second direction;

and training an initial model by using the plurality of training images to obtain the text recognition model.

6. The method of claim 1, wherein the feature extraction model comprises: the device comprises a plurality of convolution layers, an excitation layer and a pooling layer which are connected in sequence, wherein the parameter of the last pooling layer is 2 x 1; the first recognition model or the second recognition model includes: convolutional neural networks, bidirectional long-short term memory neural networks, and associative time classification.

7. An image recognition method, comprising:

performing feature extraction on the first image and the second image to obtain a first feature sequence of the first image and a second feature sequence of the second image;

acquiring a first recognition result of the first image based on the first feature sequence;

and acquiring a second recognition result of the second image based on the second feature sequence.

8. The method of claim 7, wherein feature extracting the first image and the second image to obtain a first feature sequence of the first image and a second feature sequence of the second image comprises:

and inputting the first image and the second image into a shared feature extraction model to obtain the first feature sequence and the second feature sequence.

9. The method of claim 7, wherein obtaining a first recognition result of the first image based on the first sequence of features comprises:

and inputting the first characteristic sequence into a first recognition model to obtain the first recognition result.

10. The method of claim 7, wherein obtaining a second recognition result of the second image based on the second sequence of features comprises:

and inputting the second characteristic sequence into a second recognition model to obtain a second recognition result.

11. The method of claim 7, wherein prior to feature extracting the first image and the second image, the method further comprises:

and performing feature extraction on the input matrix to obtain the first feature sequence and the second feature sequence.

12. A storage medium comprising a stored program, wherein an apparatus in which the storage medium is located is controlled to perform the image recognition method of any one of claims 1 to 11 when the program is run.

13. A computing device, comprising: a memory for storing a program and a processor for running the program, wherein the program when running performs the image recognition method of any one of claims 1 to 11.

14. An image recognition system comprising:

a processor; and

a memory coupled to the processor for providing instructions to the processor for processing the following processing steps: acquiring a first image and a second image, wherein the first image comprises first text data, the second image comprises second text data, the arrangement direction of the first text data is a first direction, and the arrangement direction of the second text data is a second direction; processing the first image and the second image by using a text recognition model to obtain a first recognition result of the first image and a second recognition result of the second image; the text recognition model is used for inputting the first image and the second image into a feature extraction model to obtain a first feature sequence of the first image and a second feature sequence of the second image, inputting the first feature sequence into the first recognition model to obtain a first recognition result, and inputting the second feature sequence into the second recognition model to obtain a second recognition result.