CN111626287B

CN111626287B - Training method and device for recognition network for recognizing Chinese in scene

Info

Publication number: CN111626287B
Application number: CN201910146791.1A
Authority: CN
Inventors: 郜业飞; 董健; 颜水成
Original assignee: Beijing Qihoo Technology Co Ltd
Current assignee: Beijing Qihoo Technology Co Ltd
Priority date: 2019-02-27
Filing date: 2019-02-27
Publication date: 2025-11-21
Anticipated expiration: 2039-02-27
Also published as: CN111626287A

Abstract

This invention provides a training method and apparatus for a recognition network that identifies Chinese characters within a scene. The method includes: randomly generating a first corpus sample using commonly used Chinese characters; synthesizing the first corpus sample with a first background image to obtain a first synthetic scene image sample containing Chinese character regions; and training a recognition network for identifying Chinese characters within a scene using the first synthetic scene image sample. Since the probability of commonly used Chinese characters appearing in the randomly generated corpus sample tends to be uniform, when training the recognition network using the scene image sample synthesized from the randomly generated corpus sample, the frequency with which the recognition network sees all commonly used Chinese characters also tends to be consistent. This solves the long-tail distribution problem of Chinese characters to a certain extent and improves the recognition effect of Chinese characters in a scene.

Description

Training method and device for recognition network for recognizing Chinese in scene

Technical Field

The invention relates to the technical field of image recognition, in particular to a training method for recognizing a Chinese character recognition network in a scene, a training device for recognizing the Chinese character recognition network in the scene, a computer storage medium and computing equipment.

Background

At present, the deep learning technology is widely applied in the field of graphic images. OCR (Optical Character Recognition ) is widely used in a plurality of application scenes such as license plate recognition, street view recognition, network image/video monitoring and the like as a key link for interaction between electronic equipment and external environment in life. And due to the deep learning, the OCR recognition precision is remarkably improved, and the commercial product output of the related technology is promoted.

Nowadays, the application of scene character recognition models based on deep learning in English character recognition is widely studied by domestic and foreign scholars, and good recognition effect is obtained. However, since the Chinese has the characteristics of no special interval between characters, abundant number of characters, close character fonts, long tail distribution of corpus and the like, the English recognition scheme is directly migrated to the Chinese environment to perform Chinese scene character recognition, and the expectation is difficult to achieve.

Therefore, a method for improving the recognition effect of the Chinese characters in the scene by improving the long-tail word problem of the Chinese scene character recognition is needed.

Disclosure of Invention

In view of the foregoing, the present invention has been made to provide a training method of identifying a recognition network of chinese in a scene, a training apparatus of identifying a recognition network of chinese in a scene, a computer storage medium, and a computing device that overcome or at least partially solve the foregoing problems.

According to an aspect of the embodiment of the present invention, there is provided a training method for identifying a recognition network for recognizing chinese in a scene, including:

randomly generating a first corpus sample by using common Chinese characters;

synthesizing the first corpus sample and a first background image to obtain a first synthesized scene image sample containing Chinese character areas;

Training a recognition network for recognizing Chinese in a scene by using the first synthesized scene image sample.

Optionally, in the first corpus sample, the occurrence frequency of each Chinese character is controllable.

Optionally, in the first corpus sample, the occurrence frequencies of all Chinese characters are controlled to be equal.

Optionally, before randomly generating the first corpus sample using the commonly used chinese characters, the method further comprises:

the commonly used Chinese characters are obtained from a codebook used for Chinese character input.

Optionally, the method further comprises:

acquiring corpus with real semantic information;

Synthesizing the corpus with the real semantic information and a second background image to obtain a second synthesized scene image sample containing Chinese text areas;

Training the recognition network using the second composite scene image samples.

Optionally, the first background image is the same as the second background image.

Optionally, obtaining the corpus with real semantic information includes:

and intercepting words with specific length from the text material containing natural semantics as the corpus with real semantic information.

Optionally, the method further comprises:

Acquiring real scene image data;

And carrying out parameter adjustment on the identification network by utilizing the real scene image data.

Optionally, acquiring the real scene image data includes:

And labeling the real scene image, and cutting out a Chinese character area in the real scene image.

Optionally, the recognition network is used for recognizing chinese in natural scenes.

According to another aspect of the embodiment of the present invention, there is also provided a training apparatus for identifying a recognition network for recognizing chinese in a scene, including:

the random corpus generation module is suitable for randomly generating a first corpus sample by using common Chinese characters;

An image sample synthesis module adapted to synthesize the first corpus sample with a first background image to obtain a first synthesized scene image sample containing a Chinese text region, and

And the recognition network training module is suitable for training a recognition network for recognizing Chinese in a scene by using the first synthesized scene image sample.

Optionally, the random corpus generation module is further adapted to:

The commonly used Chinese characters are obtained from a codebook for Chinese character input before randomly generating a first corpus sample using the commonly used Chinese characters.

Optionally, the apparatus further comprises:

the real corpus acquisition module is suitable for acquiring the corpus with real semantic information;

the image sample synthesis module is further adapted to:

the recognition network training module is further adapted to:

Optionally, the real corpus acquisition module is further adapted to:

Optionally, the apparatus further comprises:

A real scene data acquisition module adapted to acquire real scene image data, and

And the identification network adjustment module is suitable for carrying out parameter adjustment on the identification network by utilizing the real scene image data.

Optionally, the real scene data acquisition module is further adapted to:

According to yet another aspect of embodiments of the present invention, there is also provided a computer storage medium storing computer program code which, when run on a computing device, causes the computing device to perform a training method of identifying a recognition network of chinese within a scene according to any of the preceding claims.

According to yet another aspect of an embodiment of the present invention, there is also provided a computing device including:

Processor, and

A memory storing computer program code;

The computer program code, when executed by the processor, causes the computing device to perform a training method of identifying a recognition network of chinese within a scene according to any of the above.

According to the training method and device for the recognition network for recognizing the Chinese in the scene, the corpus sample is randomly generated by using the common Chinese characters, the obtained corpus sample is synthesized with the background image to obtain the synthesized scene image sample containing the Chinese character area, and the recognition network is trained by using the synthesized scene image sample. Since only a small portion of the commonly used Chinese characters frequently appear in the natural corpus information, and other Chinese characters rarely or even do not appear (i.e., so-called long-tail distribution), if the recognition network is trained by using the natural corpus information material, a good recognition effect cannot be obtained for the Chinese characters with low occurrence frequency in the corpus. In the randomly generated corpus sample, the occurrence probability of the common Chinese characters tends to be uniform, and when the recognition network is trained by utilizing the scene image sample synthesized based on the randomly generated corpus sample, the frequency that the recognition network can see all the common Chinese characters tends to be consistent, so that the problem of long-tail distribution of Chinese characters is solved to a certain extent, and the recognition effect of the Chinese characters in a scene is improved.

Furthermore, the occurrence frequency of each Chinese character in the corpus sample synthesized randomly is controlled to be equal, so that the problem of long tail distribution of Chinese characters is further effectively solved.

Furthermore, after the first stage training is performed on the recognition network by using the scene image samples synthesized based on the corpus samples generated randomly, the second stage training may be performed on the recognition network by using the scene image samples synthesized based on the corpus with real semantic information, and finally fine tuning is performed on the recognition network by using the real scene image data. Through the multi-stage training strategy, the generalization capability of the recognition network is further improved, and the recognition effect of Chinese characters in the scene is further improved. The foregoing description is only an overview of the present invention, and is intended to be implemented in accordance with the teachings of the present invention in order that the same may be more clearly understood and to make the same and other objects, features and advantages of the present invention more readily apparent.

The above, as well as additional objectives, advantages, and features of the present invention will become apparent to those skilled in the art from the following detailed description of a specific embodiment of the present invention when read in conjunction with the accompanying drawings.

Drawings

Various other advantages and benefits will become apparent to those of ordinary skill in the art upon reading the following detailed description of the preferred embodiments. The drawings are only for purposes of illustrating the preferred embodiments and are not to be construed as limiting the invention. Also, like reference numerals are used to designate like parts throughout the figures. In the drawings:

FIG. 1 illustrates a flowchart of a training method for identifying a recognition network for Chinese within a scene, according to an embodiment of the invention;

FIG. 2 is a flow chart of a training method for identifying a recognition network for Chinese in a scene according to another embodiment of the invention;

FIG. 3 is a schematic diagram showing a training apparatus for recognizing a recognition network of Chinese in a scene according to an embodiment of the present invention, and

Fig. 4 is a schematic structural diagram of a training apparatus for recognizing a recognition network of chinese in a scene according to another embodiment of the present invention.

Detailed Description

Exemplary embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While exemplary embodiments of the present disclosure are shown in the drawings, it should be understood that the present disclosure may be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art.

The current mainstream scene text recognition scheme is to extract features of an image text region by using a CRNN (Convolutional Recurrent Neural Network, convolutional neural network), and the process combines the extraction of spatial feature information in the image by using a CNN (Convolutional Neural Network ) and the coding capability of the RNN (Recurrent Neural Network, convolutional neural network) on time sequence information. Further, the encoding result of the text region is decoded by using a CTC (Connectionist Temporal Classification, connection timing classification) network to obtain corresponding text information.

In the field, english character recognition is widely studied by students at home and abroad, a plurality of recognition schemes are sequentially put forward, and good recognition results are obtained. For the english scenario, the english letters are only 26, even if digits are added, the total number is only tens, and spaces exist between each word of english. However, for Chinese scenes, chinese shows square words, and in a sentence, the distinction between words is not obvious (especially in the case of close fonts), nor is there a clear space gap. In particular, although there are about 5000-6000 chinese characters commonly used, in natural corpus (for example, a book), usually, 80% of the text appears in hundreds of words with the first frequency of use, while the other thousands of words rarely appear, which is the so-called long-tail distribution of chinese characters. In summary, the Chinese scene character recognition has the characteristics of no special interval among characters, abundant character quantity, close character fonts and long tail distribution of corpus, so that the English recognition scheme is directly migrated to the Chinese environment and is difficult to reach the expectation.

In order to solve the technical problems, the embodiment of the invention provides a training method for identifying a Chinese character recognition network in a scene. FIG. 1 illustrates a flowchart of a training method for identifying a recognition network for Chinese within a scene, according to an embodiment of the invention. Referring to fig. 1, the method may include at least the following steps S102 to S106.

Step S102, randomly generating a first corpus sample by using the common Chinese characters.

Step S104, the first corpus sample and the first background image are synthesized to obtain a first synthesized scene image sample containing Chinese character areas.

Step S106, training a recognition network for recognizing Chinese in the scene by using the first synthesized scene image sample.

In the embodiment of the invention, the recognition network is a deep learning network, and adopts a CRNN combined with CTC architecture, and mainly recognizes Chinese characters in natural scenes.

According to the training method and device for the recognition network for recognizing the Chinese in the scene, the corpus sample is randomly generated by using the common Chinese characters, the obtained corpus sample is synthesized with the background image to obtain the synthesized scene image sample containing the Chinese character area, and the recognition network is trained by using the synthesized scene image sample. Because the occurrence probability of the common Chinese characters in the randomly generated corpus sample tends to be uniform, when the recognition network is trained by utilizing the scene image sample synthesized based on the randomly generated corpus sample, the frequency that the recognition network can see all the common Chinese characters tends to be consistent, so that the problem of long-tail distribution of the Chinese characters is solved to a certain extent, and the recognition effect of the Chinese characters in a scene is improved.

In step S102 above, the first corpus sample is generated by combining the commonly used chinese characters. In order to make the distribution of Chinese characters in the generated corpus sample tend to be uniform, enough Chinese characters in common use can be used, for example, 5000-6000 Chinese characters in common use are used.

Alternatively, commonly used Chinese characters may be obtained from a codebook for Chinese character input (e.g., a codebook of a dog Chinese character input method). Preferably, the frequently used Chinese characters with the previous frequency are selected in the codebook.

In a preferred embodiment, to further solve the problem of long-tail distribution of the corpus, the occurrence frequency of each Chinese character in the randomly generated first corpus sample can be controlled, so that the text distribution in the corpus meets the requirement.

Furthermore, the occurrence frequencies of all Chinese characters in the first corpus sample are controlled to be equal, so that the Chinese characters in the corpus are uniformly distributed.

In step S104, the first background image may be an image of a real scene without text, and the first corpus sample is fused into the first background image to obtain a first synthesized scene image sample.

In step S106, the recognition network is trained by using the obtained first synthesized scene image sample, so that the frequencies of the recognition network for all the commonly used Chinese characters are consistent in the training process, and when the trained recognition network is used for recognizing the Chinese characters in the scene, a better and more accurate recognition effect on the Chinese characters (especially the Chinese characters with lower use frequency) can be achieved.

In an alternative embodiment of the present invention, after training the recognition network using the first synthesized scene image sample synthesized based on the first corpus sample generated randomly, the following steps may be further performed:

First, a corpus with real semantic information is obtained. And then, synthesizing the corpus with the real semantic information with a second background image to obtain a second synthesized scene image sample containing the Chinese text region. Finally, training the recognition network by using the second synthesized scene image sample.

The effect of Chinese recognition can be further improved by training the recognition network by using the scene image sample synthesized based on the corpus sample generated randomly (which may be called as first-stage training), and then training the recognition network by using the scene image sample synthesized based on the corpus with real semantic information (which may be called as second-stage training).

Alternatively, to simplify the composition of the scene image samples and the training operation of the recognition network, the second background image may employ the same image of the real scene as the first background image.

In practical applications, there may be various ways to obtain corpus with real semantic information. For example, text of a particular length may be truncated from text material containing natural semantics as a corpus with real semantic information. The text material may be, for example, news, books, etc.

In an alternative embodiment of the present invention, after training the recognition network with the first synthesized scene image samples synthesized based on the first corpus sample generated randomly or after training the recognition network with the second synthesized scene image samples synthesized based on the corpus with real semantic information, the following steps may be further performed:

and acquiring real scene image data, and further, utilizing the real scene image data to carry out parameter adjustment on the identification network.

Further, the real scene image data may be obtained by:

The parameters of the recognition network are finely adjusted by adopting the data set containing the Chinese real scene image, so that the generalization capability of the recognition network is improved, and the Chinese character recognition effect is further improved.

Having described various implementations of the links of the embodiment shown in fig. 1, the implementation process of the training method for identifying a chinese character recognition network in a scene of the present invention will be described in detail by a specific embodiment.

Fig. 2 is a flow chart of a training method for identifying a chinese character recognition network in a scene according to an embodiment of the present invention. In this embodiment, the recognition network is a deep learning network, and a CRNN-CTC combined architecture is adopted. Referring to fig. 2, the method may include at least the following steps S202 to S216.

Step S202, obtaining common Chinese characters from a codebook for Chinese character input, and randomly generating a first corpus sample by using the common Chinese characters, wherein the occurrence frequencies of all Chinese characters in the first corpus sample are controlled to be equal.

Step S204, the first corpus sample and the first background image are synthesized to obtain a first synthesized scene image sample containing Chinese character areas.

Step S206, training the recognition network for recognizing Chinese in the natural scene in a first stage by using the first synthesized scene image sample.

Step S208, intercepting words with specific length from the text material containing natural semantics as corpus with real semantic information.

The text material is, for example, news material, books, or the like.

Step S210, synthesizing the corpus with the real semantic information and a second background image to obtain a second synthesized scene image sample containing Chinese text areas, wherein the second background image is identical to the first background image.

Step S212, performing second-stage training on the identification network by using the second synthesized scene image sample.

And step S214, labeling the real scene image, and cutting out a Chinese character area in the real scene image to obtain a real scene image data set.

Step S216, performing parameter fine tuning on the identification network by using the real scene image data set.

In the embodiment, through a multi-stage training strategy, the problem of long tail words of Chinese scene character recognition is effectively solved, and the recognition effect of Chinese characters in a natural scene is improved.

Based on the same inventive concept, the embodiment of the invention also provides a training device for identifying the Chinese character in the scene, which is used for supporting the training method for identifying the Chinese character in the scene provided by any one embodiment or combination thereof. Fig. 3 shows a schematic structural diagram of a training apparatus 300 for identifying a recognition network of chinese in a scene according to an embodiment of the present invention. Referring to fig. 3, the apparatus 300 may include at least a random corpus generation module 310, an image sample synthesis module 320, and an identification network training module 330.

The functions of the components or devices of the training apparatus 300 for identifying the Chinese character recognition network in the recognition scene according to the embodiment of the present invention will be described, and the connection relationship between the components:

The random corpus generating module 310 is adapted to randomly generate a first corpus sample using commonly used Chinese characters.

The image sample synthesis module 320 is connected to the random corpus generation module 310, and is adapted to synthesize the first corpus sample with the first background image to obtain a first synthesized scene image sample containing a chinese text region.

The recognition network training module 330 is coupled to the image sample synthesis module 320 and is adapted to train a recognition network for recognizing chinese within a scene using the first synthesized scene image samples.

In an alternative embodiment of the present invention, the frequency of occurrence of each chinese character in the obtained first corpus sample is controllable.

Further, in the obtained first corpus sample, the occurrence frequencies of all Chinese characters are controlled to be equal.

In an alternative embodiment of the invention, the random corpus generation module 310 is further adapted to:

The commonly used Chinese characters are obtained from a codebook for Chinese character input before randomly generating the first corpus sample using the commonly used Chinese characters.

In an alternative embodiment of the present invention, as shown in fig. 4, the training apparatus 300 for identifying the chinese character recognition network in the scene illustrated in fig. 3 may further include a real corpus obtaining module 340. The real corpus acquisition module 340 may be connected to the image sample synthesis module 320 and adapted to acquire a corpus with real semantic information. Accordingly, the image sample synthesis module 320 is further adapted to synthesize a corpus with real semantic information with a second background image to obtain a second synthesized scene image sample containing chinese text regions. The recognition network training module 330 is further adapted to train the recognition network with the second composite scene image samples.

In an alternative embodiment of the invention, the first background image is identical to the second background image.

In an alternative embodiment of the present invention, the real corpus acquisition module 340 is further adapted to:

And intercepting words with specific length from the text material containing natural semantics as corpus with real semantic information.

In an alternative embodiment of the present invention, still referring to fig. 4, the training apparatus 300 for identifying a recognition network of chinese in a scene may further include a real scene data acquisition module 350 and a recognition network adjustment module 360. The real scene data acquisition module 350 is adapted to acquire real scene image data. The recognition network adjustment module 360 may be connected to the real scene data acquisition module 350 and the recognition network training module 330, respectively, and is adapted to perform parameter adjustment on the recognition network using the real scene image data.

In an alternative embodiment of the invention, the real scene data acquisition module 350 is further adapted to:

In an alternative embodiment of the invention, the recognition network is used to recognize chinese within a natural scene.

Based on the same inventive concept, the embodiment of the invention also provides a computer storage medium. The computer storage medium stores computer program code which, when run on a computing device, causes the computing device to perform a training method for identifying a recognition network of chinese within a scene according to any one or a combination of the above embodiments.

Based on the same inventive concept, the embodiment of the invention also provides a computing device. The computing device may include:

Processor, and

A memory storing computer program code;

The computer program code, when executed by a processor, causes the computing device to perform a training method for identifying a recognition network of chinese within a scene according to any one or a combination of the embodiments described above.

According to any one of the optional embodiments or the combination of multiple optional embodiments, the following beneficial effects can be achieved according to the embodiment of the invention:

Furthermore, after the first stage training is performed on the recognition network by using the scene image samples synthesized based on the corpus samples generated randomly, the second stage training may be performed on the recognition network by using the scene image samples synthesized based on the corpus with real semantic information, and finally fine tuning is performed on the recognition network by using the real scene image data. Through the multi-stage training strategy, the generalization capability of the recognition network is further improved, and the recognition effect of Chinese characters in the scene is further improved.

It will be clear to those skilled in the art that the specific working procedures of the above-described systems, devices and units may refer to the corresponding procedures in the foregoing method embodiments, and are not repeated herein for brevity.

In addition, each functional unit in the embodiments of the present invention may be physically independent, two or more functional units may be integrated together, or all functional units may be integrated in one processing unit. The integrated functional units may be implemented in hardware or in software or firmware.

Those skilled in the art will appreciate that the integrated functional units, if implemented in software and sold or used as a stand-alone product, may be stored on a computer readable storage medium. Based on such understanding, the technical solution of the present invention may be embodied in essence or in whole or in part in the form of a software product stored in a storage medium, comprising instructions for causing a computing device (e.g., a personal computer, a server, or a network device, etc.) to perform all or part of the steps of the methods described in the embodiments of the present invention when the instructions are executed. The storage medium includes various media capable of storing program codes, such as a U disk, a mobile hard disk, a Read Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk.

Or all or part of the steps of implementing the foregoing method embodiments may be implemented by hardware (such as a personal computer, a server, or a computing device such as a network device) associated with program instructions, which may be stored in a computer-readable storage medium, which when executed by a processor of the computing device, performs all or part of the steps of the method of embodiments of the present invention.

It should be noted that the above embodiments are merely for illustrating the technical solution of the present invention and not for limiting the same, and although the present invention has been described in detail with reference to the above embodiments, it should be understood by those skilled in the art that modifications may be made to the technical solution described in the above embodiments or equivalents may be substituted for some or all of the technical features thereof without departing from the scope of the present invention.

Claims

1. A method for training a recognition network to identify Chinese characters within a scene, comprising:

The first corpus sample is generated randomly using commonly used Chinese characters, wherein the frequency of occurrence of each Chinese character in the first corpus sample is controlled to be equal;

The first corpus sample is combined with the first background image to obtain a first synthetic scene image sample containing Chinese text regions.

The first stage of training is performed on the recognition network used to identify Chinese characters in the scene using the first synthesized scene image samples;

This also includes:

Obtain corpora with real semantic information;

The corpus with real semantic information is combined with the second background image to obtain a second synthetic scene image sample containing Chinese text regions.

The recognition network is trained in the second stage using the second synthesized scene image samples.

This also includes:

The real-world scene image is annotated, and the Chinese text area in the real-world scene image is cropped out;

The parameters of the recognition network trained in the second stage were fine-tuned using a dataset of real-world scene images containing Chinese characters.

2. The method according to claim 1, wherein before randomly generating the first corpus sample using commonly used Chinese characters, it further comprises:

The commonly used Chinese characters are obtained from the codebook used for Chinese character input.

3. The method according to claim 1, wherein the first background image is the same as the second background image.

4. The method according to claim 1, wherein obtaining corpus with real semantic information includes:

A specific length of text is extracted from text materials containing natural semantics to form the corpus containing real semantic information.

5. The method according to any one of claims 1-4, wherein the recognition network is used to recognize Chinese characters in a natural scene.

6. A training device for a recognition network that identifies Chinese characters within a scene, comprising:

The random corpus generation module is suitable for randomly generating a first corpus sample using commonly used Chinese characters, wherein the frequency of occurrence of each Chinese character in the first corpus sample is controlled to be equal;

The image sample synthesis module is adapted to synthesize the first corpus sample and the first background image to obtain a first synthesized scene image sample containing Chinese text regions; and

The recognition network training module is adapted to perform a first-stage training on the recognition network for recognizing Chinese characters within the scene using the first synthesized scene image samples;

This also includes:

The real-text acquisition module is suitable for acquiring texts with real semantic information.

The image sample synthesis module is also adapted to:

The recognition network training module is also adapted to:

The recognition network training module is also adapted to: annotate real scene images and crop out the Chinese text regions in the real scene images; and fine-tune the parameters of the recognition network after the second stage of training using a dataset of real scene images containing Chinese characters.

7. The apparatus according to claim 6, wherein the random corpus generation module is further adapted to:

Before randomly generating the first corpus sample using commonly used Chinese characters, the commonly used Chinese characters are obtained from the codebook used for Chinese character input.

8. The apparatus according to claim 6, wherein the first background image is the same as the second background image.

9. The apparatus according to claim 6, wherein the real corpus acquisition module is further adapted to:

10. The apparatus according to any one of claims 6-9, wherein the recognition network is used to recognize Chinese characters in a natural scene.

11. A computer storage medium storing computer program code, which, when run on a computing device, causes the computing device to execute a training method for a Chinese recognition network in a recognition scene according to any one of claims 1-5.

12. A computing device, comprising:

Processor; and

A memory that stores computer program code;

When the computer program code is run by the processor, it causes the computing device to execute the training method of the Chinese recognition network in the recognition scene according to any one of claims 1-5.