CN115082598B

CN115082598B - Text image generation, training, text image processing method and electronic equipment

Info

Publication number: CN115082598B
Application number: CN202211015424.6A
Authority: CN
Inventors: 郭若愚; 杜宇宁; 赖宝华; 马艳军
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2022-08-24
Filing date: 2022-08-24
Publication date: 2023-07-11
Anticipated expiration: 2042-08-24
Also published as: WO2024040870A1; CN115082598A

Abstract

The invention provides a text image generation, training and text image processing method and electronic equipment, and relates to the technical field of artificial intelligence. The specific implementation scheme is as follows: dividing the sample text image set into at least one sample text image subset according to a sample text output result set and a sample label set of the sample text image set; determining a target cutting position set of the sample text image set to be cut according to a sample text output result set of the sample text image set to be cut; cutting the sample text image set to be cut based on the target cutting position set to obtain at least one cutting sample text image subset; a target sample text image set is obtained from at least one cropped sample text image subset and at least one sample text image subset. The method can effectively ensure the accuracy of the target cutting position, effectively avoid the damage of character information, and improve the complexity of the image background and the image diversity of the sample text image in the target sample text image set.

Description

Text image generation, training, text image processing method and electronic equipment

Technical Field

The invention relates to the technical field of artificial intelligence, in particular to the technical field of computer vision and deep learning, which can be applied to optical character recognition scenes. And in particular to a text image generation, training, text image processing method and electronic equipment.

Background

With the development of computer technology, artificial intelligence technology has also been developed. Artificial intelligence techniques may include computer vision techniques, speech recognition techniques, natural language processing techniques, machine learning, deep learning, big data processing techniques, knowledge graph techniques, and the like.

Artificial intelligence techniques are widely used in various fields. For example, artificial intelligence techniques may be utilized to generate text images for training the deep learning model.

Disclosure of Invention

The invention provides a text image generation, training and text image processing method and electronic equipment.

According to an aspect of the present invention, there is provided a text image generating method including: dividing a sample text image set into at least one sample text image subset according to a sample text output result set and a sample label set of the sample text image set, wherein the at least one sample text image subset comprises a first sample text image subset, and the first sample text image subset comprises sample text images with correct sample text output results; determining a target cutting position set of a sample text image set to be cut according to a sample text output result set of the sample text image set to be cut, wherein the sample text image set to be cut is determined according to the first sample text image subset; cutting the sample text image set to be cut based on the target cutting position set to obtain at least one cutting sample text image subset; and obtaining a target sample text image set according to the at least one clipping sample text image subset and the at least one sample text image subset.

According to another aspect of the present invention, there is provided a training method of a deep learning model, including: acquiring a target sample text image set; and training the deep learning model by using the target sample text image set to obtain a text image processing model, wherein the target sample text image set is obtained by using the method according to the invention.

According to another aspect of the present invention, there is provided a text image processing method including: acquiring a text image to be processed; and inputting the text image to be processed into a text image processing model to obtain a text image processing result, wherein the text image processing model is trained by the method according to the invention.

According to another aspect of the present invention, there is provided a text image generating apparatus including: the dividing module is used for dividing the sample text image set into at least one sample text image subset according to a sample text output result set and a sample label set of the sample text image set, wherein the at least one sample text image subset comprises a first sample text image subset, and the first sample text image subset comprises sample text images with correct sample text output results; the determining module is used for determining a target cutting position set of the sample text image set to be cut according to a sample text output result set of the sample text image set to be cut, wherein the sample text image set to be cut is determined according to the first sample text image subset; the first obtaining module is used for cutting the sample text image set to be cut based on the target cutting position set to obtain at least one cutting sample text image subset; and a second obtaining module, configured to obtain a target sample text image set according to the at least one clipping sample text image subset and the at least one sample text image subset.

According to another aspect of the present invention, there is provided a training apparatus of a deep learning model, including: the first acquisition module is used for acquiring a target sample text image set; and a third obtaining module, configured to train the deep learning model by using the target sample text image set to obtain a text image processing model, where the target sample text image set is obtained by using the apparatus according to the present invention.

According to another aspect of the present invention, there is provided a text image processing apparatus including: the second acquisition module is used for acquiring a text image to be processed; and a fourth obtaining module, configured to input the text image to be processed into a text image processing model to obtain a text image processing result, where the text image processing model is obtained by training the apparatus according to the present invention.

According to another aspect of the present invention, there is provided an electronic apparatus including: at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method according to the present invention.

According to another aspect of the present invention, there is provided a non-transitory computer readable storage medium storing computer instructions for causing a computer as described above to perform the method according to the present invention.

It should be understood that the description in this section is not intended to identify key or critical features of the embodiments of the invention or to delineate the scope of the invention. Other features of the present invention will become apparent from the description that follows.

Drawings

The drawings are included to provide a better understanding of the present invention and are not to be construed as limiting the invention. Wherein:

FIG. 1 schematically illustrates an exemplary system architecture that may be used for a text image generation method, a training method for a deep learning model, and a text image processing method and apparatus according to an embodiment of the present invention;

FIG. 2 schematically illustrates a flow chart of a text image generating method according to an embodiment of the invention;

fig. 3A schematically illustrates a schematic diagram of a text image generating method according to an embodiment of the present invention;

FIG. 3B schematically illustrates an example schematic diagram of a generation process of a third sample text image subset according to an embodiment of the invention;

FIG. 3C schematically illustrates an example schematic diagram of a generation process of a third sample text image subset according to another embodiment of the invention;

FIG. 3D schematically illustrates an example schematic diagram of a generation process of a third sample text image subset according to another embodiment of the invention;

FIG. 3E schematically illustrates an example schematic diagram of a generation process of a third sample text image subset according to another embodiment of the present invention;

FIG. 4 schematically illustrates a flow chart of a training method of a deep learning model according to an embodiment of the invention;

FIG. 5 schematically shows a flow chart of a text image processing method according to an embodiment of the invention;

fig. 6 schematically shows a block diagram of a text image generating apparatus according to an embodiment of the present invention;

FIG. 7 schematically illustrates a block diagram of a training apparatus of a deep learning model according to an embodiment of the present invention;

fig. 8 schematically shows a block diagram of a text image processing apparatus according to an embodiment of the present invention; and

fig. 9 schematically shows a block diagram of an electronic device adapted to implement a text image generating method, a training method of a deep learning model and a text image processing method according to an embodiment of the invention.

Detailed Description

Exemplary embodiments of the present invention will now be described with reference to the accompanying drawings, in which various details of the embodiments of the present invention are included to facilitate understanding, and are to be considered merely exemplary. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the invention. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.

Fig. 1 schematically illustrates an exemplary system architecture in which a text image generating method, a training method of a deep learning model, and a text image processing method and apparatus according to an embodiment of the present invention may be implemented.

It should be noted that fig. 1 is only an example of a system architecture to which the embodiments of the present invention may be applied to help those skilled in the art understand the technical content of the present invention, and does not mean that the embodiments of the present invention may not be used in other devices, systems, environments, or scenarios. For example, in another embodiment, an exemplary system architecture to which the text image generation method, the training method of the deep learning model, and the text image generation method and apparatus may be applied may include a terminal device, but the terminal device may implement the text image generation method, the training method of the deep learning model, and the text image processing method and apparatus provided by the embodiments of the present invention without interacting with a server.

As shown in fig. 1, a system architecture 100 according to this embodiment may include

terminal devices

101, 102, 103, a network 104, and a server 105. The network 104 is used as a medium to provide communication links between the

terminal devices

101, 102, 103 and the server 105. The network 104 may include various connection types. Such as at least one of a wired and wireless communication link, etc.

The user may interact with the server 105 via the network 104 using the

terminal devices

101, 102, 103 to receive or send messages or the like. Various communication client applications can be installed on the

terminal devices

101, 102, 103. For example, at least one of a knowledge reading class application, a web browser application, a search class application, an instant messaging tool, a mailbox client and social platform software, and the like.

The

terminal devices

101, 102, 103 may be various electronic devices having a display screen and supporting web browsing. For example, at least one of a smart phone, tablet, laptop portable computer, desktop computer, and the like may be included.

The server 105 may be various types of servers that provide various services. For example, the server 105 may be a cloud server, also called a cloud computing server or a cloud host, which is a host product in a cloud computing service system, so as to solve the defects of large management difficulty and weak service expansibility in the traditional physical hosts and VPS services (Virtual Private Server, virtual private servers). The server 105 may also be a server of a distributed system or a server that incorporates a blockchain.

It should be noted that the text image generating method and the text image processing method provided by the embodiments of the present invention may be generally executed by the

terminal device

101, 102, or 103. Accordingly, the text image generating apparatus and the text image processing apparatus provided by the embodiments of the present invention may also be provided in the

terminal device

101, 102, or 103.

Alternatively, the text image generating method and the text image processing method provided by the embodiments of the present invention may also be generally performed by the server 105. Accordingly, the text image generating apparatus and the text image processing apparatus provided by the embodiments of the present invention may be generally provided in the server 105. The text image generating method and the text image processing method provided by the embodiments of the present invention may also be performed by a server or a server cluster that is different from the server 105 and is capable of communicating with the

terminal devices

101, 102, 103 and/or the server 105. Accordingly, the text image generating apparatus and the text image processing apparatus provided by the embodiments of the present invention may also be provided in a server or a server cluster that is different from the server 105 and is capable of communicating with the

terminal devices

101, 102, 103 and/or the server 105.

It should be noted that, the training method of the deep learning model provided by the embodiment of the present invention may be generally performed by the server 105. Accordingly, the training device for deep learning model provided in the embodiment of the present invention may be generally disposed in the server 105. The training method of the deep learning model provided by the embodiment of the present invention may also be performed by a server or a server cluster that is different from the server 105 and is capable of communicating with the

terminal devices

101, 102, 103 and/or the server 105. Accordingly, the training apparatus of the deep learning model provided in the embodiment of the present invention may also be provided in a server or a server cluster that is different from the server 105 and is capable of communicating with the

terminal devices

101, 102, 103 and/or the server 105.

Alternatively, the training method of the deep learning model provided by the embodiment of the present invention may also be generally performed by the

terminal device

101, 102, or 103. Accordingly, the training apparatus for deep learning model provided in the embodiment of the present invention may also be provided in the

terminal device

101, 102, or 103.

It should be understood that the number of terminal devices, networks and servers in fig. 1 is merely illustrative. There may be any number of terminal devices, networks, and servers, as desired for implementation.

It should be noted that the sequence numbers of the respective operations in the following methods are merely representative of the operations for the purpose of description, and should not be construed as representing the order of execution of the respective operations. The method need not be performed in the exact order shown unless explicitly stated.

Fig. 2 schematically shows a flow chart of a text image generating method according to an embodiment of the invention.

As shown in FIG. 2, the method 200 includes operations S210-S240.

In operation S210, the set of sample text images is divided into at least one subset of sample text images according to the set of sample text output results and the set of sample labels of the set of sample text images.

In operation S220, a target clipping position set of the sample text image set to be clipped is determined according to the sample text output result set of the sample text image set to be clipped.

In operation S230, the sample text image set to be cut is cut based on the target cut position set, resulting in at least one cut sample text image subset.

In operation S240, a target sample text image set is obtained from the at least one cropped sample text image subset and the at least one sample text image subset.

According to an embodiment of the invention, the at least one sample text image subset may comprise a first sample text image subset. The first subset of sample text images may include sample text images for which the sample text output results are correct. The set of sample text images to be cropped may be determined from the first subset of sample text images.

According to an embodiment of the present invention, the text image may include at least one of: document text images and scene text images. Document text images may refer to text images that are well-laid out, light-controlled, and relatively single in background. A scene text image may refer to a text image with a relatively complex background, multiple text forms, and uncontrolled light. The text form may include at least one of: color, size, font, direction, and layout irregularities of the text, etc. The layout irregularities may include at least one of bending, tilting, creasing, deforming, and incomplete or the like.

According to an embodiment of the invention, the set of sample text images may comprise at least one sample text image. The sample text image may include at least one of: a sample document text image and a sample scene text image. The sample text image set may be an image set of a text visual task. The sample text image may be a text image of various text visual tasks. For example, the text visual task may include at least one of: a text image recognition task, a text image classification task, a text image segmentation task, a text image detection task, a text image retrieval task, and the like. Further, the text visual task may also include at least one of: a subdivision domain task corresponding to a text image recognition task, a subdivision domain task corresponding to a text image classification task, a subdivision domain task corresponding to a text image segmentation task, a subdivision domain task corresponding to a text image detection task, and a subdivision domain task corresponding to a text image retrieval task.

According to an embodiment of the present invention, for example, the subdivision domain task corresponding to the text image recognition task may include at least one of: bill image recognition task, medical text image recognition task, financial product text image recognition task, video subtitle recognition task, security monitoring recognition task, and the like. The subdivision domain task corresponding to the text image classification task may include at least one of: bill image classification tasks, medical text image classification tasks, financial product text image classification tasks, video subtitle classification tasks, security monitoring classification tasks, and the like. The subdivision domain task corresponding to the text image segmentation task may include at least one of: bill image segmentation tasks, medical text image segmentation tasks, financial product text image segmentation tasks, and the like. The subdivision domain task corresponding to the text image detection task may include at least one of: bill image detection task, medical text image detection task, financial product text image detection task, video subtitle detection task, security monitoring detection task, etc. The subdivision domain task corresponding to the text image retrieval task may include at least one of: a bill image retrieval task, a medical text image retrieval task, a financial product text image retrieval task, a video subtitle retrieval task, a security monitoring retrieval task and the like.

According to an embodiment of the present invention, there may be a sample text output result set and a sample label set corresponding to the sample text image set. The sample text output result set may include at least one sample text output result. The sample tag set may include at least one sample tag. The sample text image may have a sample text output result and a sample label corresponding to the sample text image. The sample text output result may characterize a predicted text result of the sample text image. The sample text output may include at least one of a sample text recognition output and a sample text semantic output. The sample text recognition output may characterize a predicted text recognition result of the sample text image. The sample text semantic output results may characterize predicted semantic results of the sample text image. The sample tag may characterize the true text result of the sample text image. The sample tags may include at least one of sample text identification tags and sample text semantic tags. The sample text recognition tag may characterize the true text recognition result of the sample text image. The sample text semantic tags may characterize the true semantic results of the sample text image. The text recognition result may refer to a sequence of characters included in the text image.

According to an embodiment of the invention, the set of sample text images may comprise a first subset of sample text images. The sample text image in the first subset of sample text images may refer to a sample text image whose sample text output result is the correct sample text output result. The first subset of sample text images may comprise a set of sample text images to be cropped. The set of sample text images to be cropped may include at least one sample text image to be cropped. The sample text image to be cropped may refer to a sample text image in the first subset of sample text images that satisfies a predetermined cropping condition. The predetermined clipping condition may be configured according to actual service requirements, which is not limited herein. For example, the predetermined clipping condition may include a predetermined probability value corresponding to the sample text image being less than or equal to a predetermined probability threshold.

According to an embodiment of the present invention, a sample text image to be cut may have at least one cutting position corresponding to the sample text image to be cut. The target clipping position may refer to a clipping position satisfying a predetermined position condition among the at least one clipping position. The predetermined location condition may be configured according to actual service requirements, and is not limited herein. For example, the predetermined location condition may refer to a condition that is randomly determined from at least one clipping location.

According to an embodiment of the invention, the clipping sample text image subset may comprise at least one clipping sample text image. The clipping sample text image may refer to clipping the sample text image to be clipped based on the target clipping position.

According to embodiments of the present invention, a sample set of text images may be obtained from a data source in response to detecting a text image generation instruction. The data source may include at least one of: local databases, cloud databases, and network resources. A data interface may be invoked. A set of sample text images is obtained from a data source using a data interface. The sample text image set may include at least one sample text image. The sample text image may be at least one of: a simulated sample text image and a real sample text image. The real sample text image may be a sample text image in the public dataset. The simulated sample text image is generated based on one of the following ways: generated based on predetermined image parameters and generated based on generating the countermeasure network model processing predetermined random noise data.

According to the embodiment of the invention, for the sample text image in the sample text image set, the first local feature extraction can be performed on the sample text image to obtain a first local sample feature map. And carrying out global feature extraction on the first local sample feature map to obtain a global sample feature sequence. And performing sequence decoding on the global sample feature sequence to obtain a sample text identification output result of the sample text image. And carrying out second local feature extraction on the sample text image to obtain a second local sample feature map. And carrying out semantic understanding on the second local sample feature map to obtain a sample text semantic output result of the sample text image. And obtaining a sample text output result of the sample text image according to at least one of the sample text identification output result and the sample text semantic output result of the sample text image. For example, the sample text image may be processed based on a deep learning model to obtain a sample text output result. The deep learning model may include a deep learning model capable of implementing text recognition of character sequences of indefinite length and a deep learning model capable of implementing text semantic understanding. The model structure of the deep learning model may be configured according to actual service requirements, which is not limited herein. For example, the deep learning model may include at least one model structure. The model structure may comprise at least one model substructure and a connection relationship of the respective model substructure to each other. The model structure may be a structure obtained by connecting at least one model substructure based on a connection relationship between the model substructures. The at least one model substructure comprised by the model structure may be a structure from at least one operational layer. For example, the model structure may be a structure obtained by connecting at least one model substructure from at least one operation layer based on a connection relationship between model substructures. For example, the at least one operational layer may include at least one of: input layer, convolution layer, hidden layer, transcription layer, pooling layer, anti-convolution layer, feedforward neural network layer, attention layer, residual layer, full connection layer, batch normalization layer, linear Embedding (i.e. Linear Embedding) layer, nonlinear layer, etc.

According to an embodiment of the invention, the deep learning model of text recognition may comprise one of: a text recognition model based on CRNN (Convolutional Recurrent Neural Network ) and a text recognition model based on encoder-decoder. The CRNN may include a convolutional layer, a cyclic layer, and a transcriptional layer encoder-decoder may include one of: symmetrical encoder-decoder and asymmetrical encoder-decoder.

According to an embodiment of the present invention, the CRNN-based text recognition model may include at least one of: CTC (i.e., connectionist Temporal Classification) based CRNN model, attention (i.e., attention) based CRNN model, and ACE (i.e., aggregation Cross Entropy) based CRNN model. The encoder-decoder based text recognition model may include a Sequence-To-Sequence based text recognition model.

According to an embodiment of the invention, the text semantic understanding deep learning model may include at least one of: a convolutional neural network-based text semantic understanding model, a cyclic neural network-based text semantic understanding model, and a transducer (i.e., converter) -based text semantic understanding model.

According to the embodiment of the invention, the training mode of the deep learning model can be configured according to actual service requirements, and the training mode is not limited herein. For example, the training regimen may include at least one of: unsupervised training, supervised training, and semi-supervised training.

According to the embodiment of the invention, the sample text image set can be divided into at least one sample text image subset according to the sample text output result of the sample text image and the sample label. For example, the at least one subset of the text sample images may comprise a first subset of the text sample images. Further, the at least one sample text image subset may further comprise a second sample text image subset. The sample text image in the second sample text image subset may refer to a sample text image whose sample text output result is an erroneous sample text output result.

According to the embodiment of the invention, for the sample text image to be cut in the sample text image set to be cut, a plurality of candidate cutting positions can be determined according to the sample text output result of the sample text image to be cut. At least one target clipping location is determined from the plurality of candidate clipping locations. For example, at least one target clipping location may be randomly determined from a plurality of candidate clipping locations. Alternatively, a position corresponding to at least one target character may be determined from among a plurality of candidate clipping positions. A position corresponding to the at least one target character is determined as the at least one target clipping position.

According to the embodiment of the invention, for the sample text image to be cut in the sample text image set to be cut, the sample text image to be cut can be cut based on at least one target cutting position corresponding to the sample text image to be cut, so as to obtain at least one cutting sample image.

According to the embodiment of the invention, after at least one clipping sample text image corresponding to each of the sample text images to be clipped included in the sample text image set to be clipped is obtained, at least one clipping sample text image corresponding to each of the sample text images to be clipped included in the sample text image set to be clipped is combined to obtain at least one combined sample text image.

According to an embodiment of the present invention, obtaining a target sample text image set from at least one cropped sample text image subset and at least one sample text image subset may include: the target sample text image set may be obtained from other sample text image subsets of the at least one sample text image subset than the first sample text image subset, other sample text images of the first sample text image subset than the sample text image set to be cropped, and the at least one combined sample text image. Alternatively, the set of target sample text images may be derived from the set of sample text images and at least one combined sample text image.

According to the embodiment of the invention, the text image generating method of the embodiment of the invention can be executed by the electronic equipment. For example, the electronic device may be a server or a terminal device. The electronic device may include at least one processor. The processor may be configured to perform the text image generating method provided by the embodiment of the present invention. For example, the text image generating method provided by the embodiment of the present invention may be performed by a single processor, or may be performed in parallel by a plurality of processors.

According to the embodiment of the invention, the target cutting position set is determined according to the sample text output result set of the sample text image set to be cut, the sample text image set to be cut is determined according to the first sample text image subset, and the sample text images in the first sample text image subset are sample text images with correct sample text output results and are determined from the sample text image set according to the sample text output result set of the sample text image set and the sample label set, so that the accuracy of the target cutting position can be effectively ensured, and character information can be effectively prevented from being damaged. In addition, the target sample text image set is obtained according to at least one sample text image subset and at least one clipping sample text image subset obtained by clipping the sample text image set to be clipped based on the target clipping position set, and the image background complexity and the image diversity of the sample text images in the target sample text image set are improved, so that the target sample text image set with more abundant context information can be obtained. Therefore, training optimization of a subsequent model is performed by utilizing the target sample text image set, the iteration times of the model are reduced, the training speed of the model is improved, the data processing amount and the resource consumption amount of the electronic equipment are further reduced, the effect of improving the internal performance of the electronic equipment conforming to the natural rule is further obtained, and therefore the core competitiveness of the electronic equipment is improved.

According to an embodiment of the present invention, the above text image generating method may further include the following operations.

And carrying out data enhancement processing on the original sample text image set to obtain an intermediate sample text image set. And obtaining a sample text image set according to the original sample text image set and the intermediate sample text image set.

According to an embodiment of the present invention, the set of original sample text images may comprise at least one original sample text image. The data enhancement may include at least one of: supervised data enhancement and unsupervised data enhancement. The supervised data enhancement may include at least one of: single sample data enhancement and multiple sample data enhancement. The unsupervised data enhancement may include at least one of: data enhancement to generate new data and data enhancement to learn enhancement policies.

According to an embodiment of the invention, the single sample data enhancement may comprise at least one of: a geometric transformation class and a color transformation class. The geometric transformation class may include at least one of: flipping, rotating, random cropping, deforming, and scaling, etc. The color transformation class may include at least one of: noise, blurring, color transformation, erasure, padding, etc.

According to an embodiment of the present invention, the multi-sample data enhancement may include at least one of: SMOTE (i.e., synthetic Minority Over-sampling Technique), sample Pairing, mixup, cutout, cutmix, fmix, and ROImix, among others.

According to an embodiment of the invention, generating the data enhancement of the new data may include generating the data enhancement based on an antagonistic network model. The data enhancement of the learning enhancement policy may include automatic data enhancement.

According to the embodiment of the invention, for the original sample text image in the original sample text image set, the original sample text image can be subjected to data enhancement to obtain at least one middle sample text image corresponding to the original sample text image. The data enhancements of the respective original sample text images may be one of different from each other, partially identical, and wholly identical. For example, the set of original sample text images may include an original sample text image a and an original sample text image B. The data enhancement of the geometric transformation class can be performed on the original sample text image A, so as to obtain at least one middle sample text image corresponding to the original sample text image A. The original sample text image B may be subjected to data enhancement of a color conversion class, resulting in at least one intermediate sample text image corresponding to the original sample text image B.

According to an embodiment of the present invention, obtaining a sample text image set from an original sample text image set and an intermediate sample text image set may include: the set of intermediate sample text images is determined as a set of sample text images. Alternatively, at least part of the original sample text image set and at least part of the intermediate sample text image set are determined as sample text image sets.

According to the embodiment of the invention, different data enhancement can be performed on different original sample text images, so that the image diversity of the third sample text image in the third sample text image subset can be effectively ensured. On the basis, the deep learning model is trained by utilizing the third sample text image subset, so that the generalization performance of the model can be improved.

According to an embodiment of the present invention, obtaining a sample text image set from an original sample text image set and an intermediate sample text image set may include the following operations.

For the original sample text image in the original sample text image set, in the case that the height of the original sample text image is not determined to be the preset height, under the condition that the aspect ratio of the original sample text image is kept unchanged, the height of the original sample text image is adjusted to the preset height, and the adjusted original sample text image is obtained. For the intermediate sample text image in the intermediate sample text image set, in the case that the height of the intermediate sample text image is determined not to be the predetermined height, the height of the intermediate sample text image is adjusted to the predetermined height while maintaining the aspect ratio of the intermediate sample text image unchanged, resulting in an adjusted intermediate sample text image. And obtaining a sample text image set according to at least one of the original sample text image set, the at least one adjusted original sample text image, the intermediate sample text image set and the at least one adjusted intermediate sample text image set.

According to an embodiment of the present invention, operation S210 may include the following operations.

And comparing the sample text output result set of the sample text image set with the sample label set to obtain a comparison result. Based on the comparison, the sample text image set is divided into at least one sample text image subset.

According to an embodiment of the present invention, the comparison result may include that the relationship between the two objects satisfies the predetermined matching condition and that the relationship between the two objects does not satisfy the predetermined matching condition. The two objects may refer to a sample text output result and a sample tag. The predetermined matching condition may be configured according to actual service requirements, which is not limited herein. For example, the predetermined matching condition may include that two objects match.

According to the embodiment of the invention, for the sample text image in the sample text image set, the sample text output result of the sample text image and the sample label can be compared to obtain the comparison result corresponding to the sample text image. The sample text image may be partitioned into sample text image subsets corresponding to the comparison result based on the comparison result corresponding to the sample text image.

According to an embodiment of the invention, the set of sample text images may comprise a plurality of sample text images. The at least one sample text image subset may further comprise a second sample text image subset.

According to an embodiment of the present invention, dividing the sample text image set into at least one sample text image subset according to the comparison result may include the following operations.

For a sample text image of the plurality of sample text images, determining the sample text image as a sample text image in the first subset of sample text images in a case where it is determined that a relationship between a sample text output result of the sample text image and a sample label satisfies a predetermined matching condition. In a case where it is determined that the relation between the sample text output result of the sample text image and the sample label does not satisfy the predetermined matching condition, the sample text image is determined as the sample text image in the second sample text image subset.

According to an embodiment of the present invention, the predetermined matching condition may refer to a criterion for dividing the sample text image subset. The predetermined matching condition may include a difference between the sample text output result and the sample tag being less than or equal to a predetermined difference threshold. The predetermined difference threshold may be configured according to actual service requirements, and is not limited herein. For example, the predetermined difference threshold may be 0.1.

According to an embodiment of the present invention, the sample text image in the first subset of sample text images may refer to a sample text image whose sample text output result is a correct sample text output result. The sample text image in the second sample text image subset may refer to a sample text image whose sample text output result is an erroneous sample text output result.

According to an embodiment of the present invention, it is determined, for a sample text image of a plurality of sample text images, whether a difference between a sample text output result of the sample text image and a sample label is less than or equal to a predetermined difference threshold. In the case where it is determined that the difference between the sample text output result of the sample text image and the sample label is less than or equal to the predetermined difference threshold, the sample text image may be determined to be a sample text image in the first sample text image subset. In case it is determined that the difference between the sample text output result of the sample text image and the sample label is greater than a predetermined difference threshold, the sample text image may be determined as a sample text image in the second subset of sample text images.

According to the embodiment of the invention, the target cutting position set is determined according to the sample text output result set of the sample text image set to be cut, the sample text image set to be cut is determined according to the first sample text image subset, and the first sample text image in the first sample text image subset is the sample text image with the relation between the sample text output result and the sample label meeting the preset matching condition, so that the accuracy of the target cutting position can be effectively ensured, and the character information can be effectively prevented from being damaged.

According to an embodiment of the invention, the first set of sample images may comprise a plurality of first sample images.

According to an embodiment of the invention, the set of sample text images to be cropped may be determined by:

for a first sample image of the plurality of first sample images, determining the first sample image as a sample text image to be cropped in the set of sample text images to be cropped if it is determined that the predetermined probability value of the first sample image is less than or equal to the predetermined probability threshold.

According to an embodiment of the present invention, the predetermined probability value and the predetermined probability threshold value may be used as determining that the first sample image in the first sample image subset is a sample text image to be cropped in the sample text image set to be cropped. The predetermined probability value and the predetermined probability threshold value may be configured according to actual service requirements, which is not limited herein. The predetermined probability value may be a number greater than or equal to 0 and less than 1. The predetermined probability threshold may be a number greater than or equal to 0 and less than or equal to 1. For example, the predetermined probability threshold may be determined based on model characteristics of the deep learning model. Model features may include at least one of complexity, fit, and versatility of the model structure. For example, if the model characteristics of the model structure of the deep-learning model are at least one of more versatile, more complex, and easier to overfit, a predetermined probability threshold value of greater value may be configured. If the model characteristics of the model structure of the deep-learning model are at least one of less generic, less complex, and easier under-fitting, a predetermined probability threshold value of smaller value may be configured.

According to an embodiment of the present invention, the set of sample text images to be cropped may comprise a plurality of sample text images to be cropped.

According to an embodiment of the present invention, operation S220 may include the following operations.

And determining at least one target cutting position from a plurality of candidate cutting positions according to a sample text output result of the sample text image to be cut aiming at the sample text image to be cut in the sample text image set to be cut.

According to the embodiment of the invention, a plurality of candidate clipping positions can be determined according to the sample text output result of the sample text image to be clipped. At least one target clipping location is randomly determined from a plurality of candidate clipping locations.

According to the embodiment of the invention, the image diversity of the sample text image can be improved by randomly determining at least one target clipping position from a plurality of candidate clipping positions.

According to an embodiment of the invention, the set of sample text images may comprise a plurality of sample text images.

According to the embodiment of the invention, the sample text recognition output result can be obtained by performing sequence decoding on the global sample feature sequence of the sample text image. The global sample feature sequence may be obtained by performing global feature extraction on a first local sample feature map of the sample text image. The first local sample feature map may be obtained by performing first local feature extraction on the sample text image.

According to the embodiment of the invention, the sample text semantic output result can be obtained by carrying out semantic understanding on the second local sample feature map of the sample text image. The second local sample feature map may be obtained by performing second local feature extraction on the sample text image.

According to the embodiment of the invention, the CRNN-based text recognition model can be utilized to process the sample text image, so that a sample text recognition output result is obtained. The CRNN may include a convolutional layer, a cyclic layer, and a transcriptional layer. The sample text image may be processed with a convolution layer to obtain a first local sample feature map. The first local sample feature map may be processed using a loop layer to obtain a global sample feature sequence. The global sample feature sequence can be processed by using a transcription layer to obtain a sample text recognition output result.

According to an embodiment of the present invention, in a case where the sample text output result includes a sample text recognition result and a sample text semantic output result, determining at least one of the target clipping positions from a plurality of candidate clipping positions according to the sample text output result of the sample text image to be clipped may include the following operations.

And determining a plurality of candidate clipping positions according to the sample text recognition output result of the sample text image to be clipped. And determining at least one target clipping position from the plurality of candidate clipping positions according to a sample text semantic output result of the sample text image to be clipped.

According to an embodiment of the present invention, for example, the sample text recognition output result of the sample text image to be cut may be "going to work today". And determining four candidate clipping positions, namely, candidate clipping positions between 'Jing' and 'Tian', candidate clipping positions between 'Tian' and 'go', candidate clipping positions between 'go' and candidate clipping positions between 'go' and 'ban', according to the sample text recognition output result. From the sample text semantic output results, it can be determined that "present" and "day" should not be separated, and "on" should not be separated, and thus, two target clipping positions, i.e., a candidate clipping position between "day" and "go" and a candidate clipping position between "go" and "on", can be determined from four candidate clipping positions.

According to the embodiment of the invention, at least one target clipping position is determined from a plurality of candidate clipping positions according to the sample text semantic output result of the sample text image to be clipped, so that the accuracy of the target clipping position is improved.

According to an embodiment of the present invention, operation S230 may include the following operations.

And cutting the sample text image set to be cut based on the target cutting position set to obtain a first cutting sample text image subset and a second cutting sample text image subset.

According to an embodiment of the invention, the first cut sample text image subset may comprise at least one first cut sample text image. The second cropped sample text image subset may comprise at least one second cropped sample text image. The at least one target clipping position corresponding to the sample text image to be clipped may include a first target clipping position and a second target clipping position.

According to the embodiment of the invention, for the sample text image to be cut in the sample text image set to be cut, cutting can be performed based on the first target cutting position corresponding to the sample text image to be cut, so as to obtain a first cutting sample text image corresponding to the sample text image to be cut. And cutting the sample text image to be cut based on a second target cutting position corresponding to the sample text image to be cut, so as to obtain a second cut sample text image corresponding to the sample text image to be cut.

According to an embodiment of the present invention, operation S240 may include the following operations.

And obtaining a third sample text image subset according to at least one clipping sample text image subset. And obtaining a target sample text image set according to the at least one sample text image subset and the third sample text image subset.

According to an embodiment of the present invention, at least one of the cropped sample text image subsets may be combined to obtain a third sample text image subset. A target sample text image set may be derived from the second sample text image subset and the third sample text image subset.

According to an embodiment of the present invention, the obtaining of the third sample text image subset from at least one clipping sample text image subset may comprise the following operations.

And combining the clipping sample text images in the at least one clipping sample text image subset based on a predetermined combination strategy to obtain a third sample text image subset.

According to an embodiment of the present invention, the predetermined combination policy may refer to a policy for combining the cut sample text images. For example, the predetermined combination policy may include at least one of: random combining strategies and fixed combining strategies. The third sample text image subset may include at least one third sample text image. The third sample text image may be the same as or different from the sample text image in the set of sample text images.

According to an embodiment of the present invention, for a clipping sample text image subset of at least one clipping sample text image subset, for clipping sample text images of the clipping sample text image subset, the clipping sample text image may be combined with clipping sample text images of other clipping sample text image subsets to obtain at least one third sample text image. The other clip sample text image subset may be any one or more of the at least one clip sample text image subset other than the clip sample text image subset.

For example, the at least one cut sample text image subset may include a first cut sample text image subset and a second cut sample text image subset. The first cut sample text image subset may represent a cut sample text image subset of the first direction. The second acquired sample text image subset may characterize a cropped sample text image subset of the second direction. The first direction may refer to a right direction. The second direction may refer to the left direction. For a first cut sample text image of the first cut sample text image subset, the first cut sample text image may be combined with at least one second cut sample text image of the second cut sample text image subset to obtain at least one third sample text image.

According to the embodiment of the invention, the third sample text image subset is obtained by combining the clipping sample text images in at least one clipping sample text image subset based on the preset combination strategy, so that random combination of clipping sample text images is realized, and the image background complexity and the image diversity of the third sample text images in the third sample text image subset are improved. On the basis, the deep learning model is trained by utilizing the third sample text image subset, so that the generalization performance of the model can be improved.

And cutting the sample tag set of the sample text image set to be cut based on the target cutting position set to obtain at least one cutting sample tag subset. And obtaining a target sample label set according to the sample label subset corresponding to the at least one sample text sample image subset and the at least one clipping sample label subset.

According to an embodiment of the present invention, obtaining the target sample tag set according to the sample tag subset corresponding to the at least one sample text sample image subset and the at least one clipping sample tag subset may include the following operations.

And according to at least one clipping sample label subset, obtaining a sample label subset corresponding to the third sample text image subset. And obtaining a target sample label set according to the sample label subset corresponding to the at least one sample text image subset and the sample label subset corresponding to the third sample text image subset.

According to an embodiment of the present invention, clipping the sample tag subset according to at least one, to obtain a sample tag subset corresponding to the third sample text image subset may include the following operations.

And combining the clipping sample labels in the at least one clipping sample label subset based on a preset combination strategy to obtain a sample label subset corresponding to the third sample text image subset.

The text image generating method according to the embodiment of the present invention will be further described with reference to fig. 3A, 3B, 3C, 3D and 3E in conjunction with the specific embodiments.

Fig. 3A schematically illustrates a schematic diagram of a text image generating method according to an embodiment of the present invention.

As shown in fig. 3A, in 300A, a set of sample text images 303 is divided into a first subset of sample text images 303_1 and a second subset of sample text images 303_2 based on a set of sample text output results 301 and a set of sample labels 302 of the set of sample text images. A set of sample text images to be cropped 304 is determined from the first subset of sample text images 303_1.

A set of target clipping locations 306 for the set of sample text images 304 to be clipped is determined from the set of sample text output results 305 for the set of sample text images 304 to be clipped. The set of sample text images to be cropped 304 is cropped based on the set of target crop locations 306 to obtain at least one subset of sample text images to be cropped 307. A target sample text image set 308 is derived from at least one cropped sample text image subset 307, the first sample text image subset 303_1, and the second sample text image subset 303_2.

Fig. 3B schematically shows an example schematic diagram of a generation process of a third sample text image subset according to an embodiment of the invention.

As shown in fig. 3B, in 300B, the set of sample text images to be cropped 309 may include sample text image to be cropped 309_1 and sample text image to be cropped 309_2.

From the sample text output result of the sample text image 309_1 to be cut, it is determined that the target cut position is "position between infant and hundred" from among the plurality of candidate cut positions. Clipping the sample text image to be clipped 309_1 based on the target clipping position results in a clipping sample text image 309_1_1 and a clipping sample text image 309_1_2. The cut sample text image 309_1_1 is a sample text image corresponding to "mother and infant". The cut sample text image 309_1_2 is a sample text image corresponding to "hundred" or "hundred".

From the sample text output result of the sample text image 309_2 to be cut, it is determined that the target cut position is the "position between the rotation and the let" from among the plurality of candidate cut positions. Clipping the sample text image to be clipped 309_2 based on the target clipping position results in a clipping sample text image 309_2_1 and a clipping sample text image 309_2_2. The cut sample text image 309_2_1 is a sample text image corresponding to "turn". The cut sample text image 309_2_2 is a sample text image corresponding to "let".

Based on a predetermined combination policy, the cropped sample text image 309_1_1 and the cropped sample text image 309_2_2 are combined resulting in a third sample text image 310_1 in the third sample text image subset 310, and the cropped sample text image 309_1_2 and the cropped sample text image 309_2_1 are combined resulting in a third sample text image 310_2 in the third sample text image subset 310. The third sample text image 310_1 is a sample text image corresponding to "mother and child let". The third sample text image 310_2 is a sample text image corresponding to "transfer hundred".

Fig. 3C schematically illustrates an example schematic diagram of a generation process of a third sample text image subset according to another embodiment of the invention.

As shown in fig. 3C, in 300C, unlike fig. 3B, the third sample text image 311_1 is a sample text image corresponding to "let mother and infant". The third sample text image 311_2 is a sample text image corresponding to "hundred transfer".

Fig. 3D schematically shows an example schematic diagram of a generation process of a third sample text image subset according to another embodiment of the invention.

As shown in fig. 3D, in 300D, unlike fig. 3B, the cut sample text image 309_1_1 and the cut sample text image 309_2_1 are combined to obtain a third sample text image 312_1 in the third sample text image subset 312, and the cut sample text image 309_1_2 and the cut sample text image 309_2_2 are combined to obtain a third sample text image 312_2 in the third sample text image subset 312, based on a predetermined combination policy. The third sample text image 312_1 is a sample text image corresponding to "mother-baby-transfer". The third sample text image 312_2 is a sample text image corresponding to "Baihui".

Fig. 3E schematically shows an example schematic diagram of a generation process of a third sample text image subset according to another embodiment of the invention.

As shown in fig. 3E, in 300E, unlike fig. 3D, the third sample text image 313_1 is a sample text image corresponding to "baby mother". The third sample text image 313_2 is a sample text image corresponding to "let hundred" sink.

Fig. 4 schematically shows a flow chart of a training method of a deep learning model according to an embodiment of the invention.

As shown in FIG. 4, the method 400 may include operations S410-S420.

In operation S410, a target sample text image set is acquired.

In operation S420, the deep learning model is trained using the target sample text image set to obtain a text image processing model.

According to the embodiment of the invention, the target sample text image set can be obtained by the text image generation method according to the embodiment of the invention.

According to the embodiment of the invention, the target cutting position set of the target sample text image set is determined according to the sample text output result set of the sample text image set to be cut, the sample text image set to be cut is determined according to the first sample text image subset, and the first sample text image subset is a sample text image which is determined from the sample text image set according to the sample text output result set of the sample text image set and the sample label set and comprises a correct sample text output result, so that the accuracy of the target cutting position can be effectively ensured, and character information can be effectively prevented from being damaged. On the basis, a target sample text image set is obtained according to at least one clipping sample text image subset and at least one sample text image subset, and the target sample text image set with more abundant contextual information can be obtained. Therefore, training optimization of a subsequent model is performed by utilizing the target sample text image set, the iteration times of the model are reduced, and the training speed of the model is improved, so that the data processing capacity and the resource consumption of the electronic equipment are reduced, the effect of improving the internal performance of the electronic equipment conforming to the natural law is further obtained, and the core competitiveness of the electronic equipment is improved.

Fig. 5 schematically shows a flow chart of a text image processing method according to an embodiment of the invention.

As shown in FIG. 5, the method 500 includes operations S510-S520.

In operation S510, a text image to be processed is acquired.

In operation S520, the text image to be processed is input into the text image processing model, and a text image processing result is obtained.

According to the embodiment of the invention, the text image processing model can be trained by the training method of the deep learning model.

In the technical scheme of the invention, the related processes of collecting, storing, using, processing, transmitting, providing, disclosing and the like of the personal information of the user accord with the regulations of related laws and regulations, and the public order is not violated.

The above is only an exemplary embodiment, but not limited thereto, and other text image generation methods, training methods of a deep learning model, and text image processing methods known in the art may be included as long as the accuracy of the target clipping position can be effectively ensured and a target sample text image set with more abundant contextual information can be obtained.

Fig. 6 schematically shows a block diagram of a text image generating apparatus according to an embodiment of the present invention.

As shown in fig. 6, the text image generating apparatus 600 may include a dividing module 610, a determining module 620, a first obtaining module 630, and a second obtaining module 640.

A dividing module 610 is configured to divide the sample text image set into at least one sample text image subset according to the sample text output result set and the sample label set of the sample text image set. The at least one subset of sample images includes a first subset of sample images. The first subset of sample text images includes sample text images for which the sample text output results are correct.

The determining module 620 is configured to determine a target clipping location set of the sample text image set to be clipped according to the sample text output result set of the sample text image set to be clipped. The set of sample text images to be cropped is determined from the first subset of sample text images.

The first obtaining module 630 is configured to crop the sample text image set to be cropped based on the target cropping location set, so as to obtain at least one cropped sample text image subset.

A second obtaining module 640 is configured to obtain a target sample text image set according to at least one clipping sample text image subset and at least one sample text image subset.

According to an embodiment of the present invention, the dividing module 610 may include a comparing sub-module and a dividing sub-module.

And the comparison sub-module is used for comparing the sample text output result set of the sample text image set with the sample label set to obtain a comparison result.

And the dividing sub-module is used for dividing the sample text image set into at least one sample text image subset according to the comparison result.

According to an embodiment of the invention, the set of sample text images comprises a plurality of sample text images, the at least one subset of sample text images further comprising a second subset of sample text images.

According to an embodiment of the present invention, the dividing sub-module may include a first determination unit and a second determination unit for a sample text image of the plurality of sample text images.

A first determining unit configured to determine the sample text image as a sample text image in the first sample text image subset, in a case where it is determined that a relation between a sample text output result of the sample text image and the sample label satisfies a predetermined matching condition.

And a second determining unit configured to determine the sample text image as a sample text image in the second sample text image subset, in a case where it is determined that a relation between a sample text output result of the sample text image and the sample label does not satisfy a predetermined matching condition.

According to embodiments of the present invention, the determination module 620 may include a determination sub-module for a sample text image to be cropped in a set of sample text images to be cropped.

And the determining submodule is used for determining at least one target cutting position from a plurality of candidate cutting positions according to a sample text output result of the sample text image to be cut.

According to an embodiment of the present invention, the sample text output result may include at least one of: sample text recognition output results and sample text semantic output results.

According to an embodiment of the present invention, in the case where the sample text output result includes the sample text recognition result and the sample text semantic output result, the determination sub-module may include a third determination unit and a fourth determination unit.

And the third determining unit is used for determining a plurality of candidate cutting positions according to the sample text recognition output result of the sample text image to be cut.

And the fourth determining unit is used for determining at least one target clipping position from the plurality of candidate clipping positions according to the sample text semantic output result of the sample text image to be clipped.

According to an embodiment of the invention, the first obtaining module 630 may include a first obtaining sub-module.

The first obtaining submodule is used for cutting the sample text image set to be cut based on the target cutting position set to obtain a first cutting sample text image subset and a second cutting sample text image subset.

According to an embodiment of the present invention, the second obtaining module 640 may include a second obtaining sub-module and a third obtaining sub-module.

And the second obtaining submodule is used for obtaining a third sample text image subset according to at least one clipping sample text image subset.

And a third obtaining sub-module, configured to obtain a target sample text image set according to the at least one sample text image subset and the third sample text image subset.

According to an embodiment of the invention, the second obtaining sub-module may comprise an obtaining unit.

And the obtaining unit is used for combining the clipping sample text images in the at least one clipping sample text image subset based on a preset combination strategy to obtain a third sample text image subset.

for a first sample image of the plurality of first sample images,

in the case that the predetermined probability value of the first text sample image is determined to be less than or equal to the predetermined probability threshold, the first text sample image is determined to be the sample text image to be cropped in the sample text image set to be cropped.

According to an embodiment of the present invention, the text image generating apparatus may further include a third obtaining module and a fourth obtaining module.

And the third obtaining module is used for carrying out data enhancement processing on the original sample text image set to obtain an intermediate sample text image set.

And the fourth obtaining module is used for obtaining the sample text image set according to the original sample text image set and the intermediate sample text image set.

According to an embodiment of the invention, the sample text image set may be a text image set of a text visual task.

Fig. 7 schematically shows a block diagram of a training apparatus of a deep learning model according to an embodiment of the invention.

As shown in fig. 7, the training apparatus 700 of the deep learning model may include a first acquisition module 710 and a fifth acquisition module 720.

A first acquisition module 710 is configured to acquire a target sample text image set.

And a fifth obtaining module 720, configured to train the deep learning model with the target sample text image set, and obtain a text image processing model.

According to the embodiment of the invention, the target sample text image set can be trained by the training device of the deep learning model according to the embodiment of the invention.

Fig. 8 schematically shows a block diagram of a text image processing apparatus according to an embodiment of the present invention.

As shown in fig. 8, the image processing apparatus 800 may include a second acquisition module 810 and a sixth acquisition module 820.

A second obtaining module 810, configured to obtain a text image to be processed.

A sixth obtaining module 820, configured to input the text image to be processed into the text image processing model, and obtain a text image processing result.

According to an embodiment of the present invention, the text image processing model may be trained by the image processing apparatus according to the embodiment of the present invention.

According to embodiments of the present invention, the present invention also provides an electronic device, a readable storage medium and a computer program product.

According to an embodiment of the present invention, an electronic apparatus includes: at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor, the instructions being executable by the at least one processor to enable the at least one processor to perform the method as described above.

According to an embodiment of the present invention, a non-transitory computer-readable storage medium storing computer instructions for causing a computer to perform the method as described above.

According to an embodiment of the invention, a computer program product comprises a computer program which, when executed by a processor, implements a method as described above.

Fig. 9 schematically shows a block diagram of an electronic device adapted to implement a text image generating method, a training method of a deep learning model and a text image processing method according to an embodiment of the invention. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular telephones, smartphones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be exemplary only, and are not meant to limit implementations of the inventions described and/or claimed herein.

As shown in fig. 9, the electronic device 900 includes a computing unit 901 that can perform various appropriate actions and processes according to a computer program stored in a Read Only Memory (ROM) 902 or a computer program loaded from a storage unit 908 into a Random Access Memory (RAM) 903. In the RAM 903, various programs and data required for the operation of the electronic device 900 can also be stored. The computing unit 901, the ROM 902, and the RAM 903 are connected to each other by a bus 904. An input/output (I/O) interface 905 is also connected to the bus 904.

A number of components in the electronic device 900 are connected to the I/O interface 905, including: an input unit 906 such as a keyboard, a mouse, or the like; an output unit 907 such as various types of displays, speakers, and the like; a storage unit 908 such as a magnetic disk, an optical disk, or the like; and a communication unit 909 such as a network card, modem, wireless communication transceiver, or the like. The communication unit 909 allows the electronic device 900 to exchange information/data with other devices through a computer network such as the internet and/or various telecommunications networks.

The computing unit 901 may be a variety of general and/or special purpose processing components having processing and computing capabilities. Some examples of computing unit 901 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various specialized Artificial Intelligence (AI) computing chips, various computing units running machine learning model algorithms, a Digital Signal Processor (DSP), and any suitable processor, controller, microcontroller, etc. The computing unit 901 performs the respective methods and processes described above, for example, a text image generating method, a training method of a deep learning model, and a text image processing method. For example, in some embodiments, the text image generation method, the training method of the deep learning model, and the text image processing method may be implemented as computer software programs tangibly embodied on a machine-readable medium, such as the storage unit 908. In some embodiments, part or all of the computer program may be loaded and/or installed onto the electronic device 900 via the ROM 902 and/or the communication unit 909. When the computer program is loaded into the RAM 903 and executed by the computing unit 901, one or more steps of the text image generating method, the training method of the deep learning model, and the text image processing method described above may be performed. Alternatively, in other embodiments, the computing unit 901 may be configured to perform the text image generation method, the training method of the deep learning model, and the text image processing method in any other suitable manner (e.g., by means of firmware).

Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuit systems, field Programmable Gate Arrays (FPGAs), application Specific Integrated Circuits (ASICs), application Specific Standard Products (ASSPs), systems On Chip (SOCs), complex Programmable Logic Devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs, the one or more computer programs may be executed and/or interpreted on a programmable system including at least one programmable processor, which may be a special purpose or general-purpose programmable processor, that may receive data and instructions from, and transmit data and instructions to, a storage system, at least one input device, and at least one output device.

Program code for carrying out methods of the present invention may be written in any combination of one or more programming languages. These program code may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus such that the program code, when executed by the processor or controller, causes the functions/operations specified in the flowchart and/or block diagram to be implemented. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package, partly on the machine and partly on a remote machine or entirely on the remote machine or server.

In the context of the present invention, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. The machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and pointing device (e.g., a mouse or trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user may be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic input, speech input, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a background component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such background, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), wide Area Networks (WANs), and the internet.

The computer system may include a client and a server. The client and server are typically remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The server may be a cloud server, a server of a distributed system, or a server incorporating a blockchain.

It should be appreciated that various forms of the flows shown above may be used to reorder, add, or delete steps. For example, the steps described in the present disclosure may be performed in parallel, sequentially, or in a different order, so long as the desired result of the technical solution of the present disclosure is achieved, and the present disclosure is not limited herein.

The above embodiments do not limit the scope of the present invention. It will be apparent to those skilled in the art that various modifications, combinations, sub-combinations and alternatives are possible, depending on design requirements and other factors. Any modifications, equivalent substitutions and improvements made within the spirit and principles of the present invention should be included in the scope of the present invention.

Claims

1. A text image generation method, comprising:

dividing a sample text image set into a plurality of sample text image subsets according to a sample text output result set and a sample label set of the sample text image set, wherein the plurality of sample text image subsets comprise a first sample text image subset;

determining a target cutting position set of a sample text image set to be cut according to a sample text output result set of the sample text image set to be cut, wherein the sample text image set to be cut is determined according to the first sample text image subset, the sample output result set comprises at least one sample text output result, the sample text output result comprises a sample text identification output result, the sample text image set comprises a plurality of sample text images, the sample text identification output result is obtained by sequentially decoding a global sample feature sequence of the sample text images, the global sample feature sequence is obtained by performing global feature extraction on a first local sample feature map of the sample text images, and the first local sample feature map is obtained by performing first local feature extraction on the sample text images;

Cutting the sample text image set to be cut based on the target cutting position set to obtain a plurality of cutting sample text image subsets; and

obtaining a target sample text image set according to the plurality of clipping sample text image subsets and at least one sample text image subset;

the sample text output result set and the sample label set according to the sample text image set divide the sample text image set into a plurality of sample text image subsets, and the sample text output result set and the sample label set comprise:

comparing the sample text output result set of the sample text image set with the sample label set to obtain a comparison result; and

dividing the sample text image set into a plurality of sample text image subsets according to the comparison result;

wherein the set of sample text images includes a plurality of sample text images, the plurality of sample text image subsets further including a second sample text image subset;

wherein the dividing the sample text image set into the plurality of sample text image subsets according to the comparison result includes:

for a sample text image of the plurality of sample text images,

determining the sample text image as a sample text image in the first subset of sample text images in the case that it is determined that a relation between a sample text output result of the sample text image and a sample label satisfies a predetermined matching condition, wherein the predetermined matching condition includes a difference between the sample text output result and the sample label being less than or equal to a predetermined difference threshold; and

In a case where it is determined that the relation between the sample text output result of the sample text image and the sample label does not satisfy the predetermined matching condition, the sample text image is determined as a sample text image in the second sample text image subset.

2. The method of claim 1, wherein the set of sample text images to be cropped comprises a plurality of sample text images to be cropped;

the determining a target cutting position set of the sample text image set to be cut according to the sample text output result set of the sample text image set to be cut comprises the following steps:

for a sample text image to be cropped in the set of sample text images to be cropped,

and determining at least one target cutting position from a plurality of candidate cutting positions according to a sample text output result of the sample text image to be cut.

3. The method of claim 1, wherein the sample text output result further comprises a sample text semantic output result.

4. A method according to claim 3, wherein the sample text semantic output result is obtained by semantic understanding of a second local sample feature map of the sample text image, the second local sample feature map being obtained by second local feature extraction of the sample text image.

5. A method according to claim 3, wherein, in case the sample text output result comprises the sample text recognition output result and the sample text semantic output result, determining at least one target clipping position from a plurality of candidate clipping positions according to the sample text output result of the sample text image to be clipped comprises:

determining the plurality of candidate cutting positions according to a sample text recognition output result of the sample text image to be cut; and

and determining at least one target clipping position from the plurality of candidate clipping positions according to a sample text semantic output result of the sample text image to be clipped.

6. The method of any one of claims 1-3, wherein the cropping the set of sample text images to be cropped based on the set of target cropping locations to obtain a plurality of cropped sample text image subsets, comprising:

7. A method according to any one of claims 1-3, wherein said obtaining a target sample text image set from said plurality of cropped sample text image subsets and at least one of said sample text image subsets comprises:

Obtaining a third sample text image subset according to the plurality of clipping sample text image subsets; and

and obtaining the target sample text image set according to at least one sample text image subset and the third sample text image subset.

8. The method of claim 7, wherein the deriving a third sample text image subset from the plurality of cropped sample text image subsets comprises:

and combining the clipping sample text images in the plurality of clipping sample text image subsets based on a predetermined combination strategy to obtain the third sample text image subset.

9. The method of any of claims 1-3, wherein the first subset of sample images comprises a plurality of first sample images;

wherein the set of sample text images to be cropped is determined by:

for a first sample image of the plurality of first sample images,

in the case that the predetermined probability value of the first text sample image is less than or equal to the predetermined probability threshold, the first text sample image is determined as a sample text image to be cut in the sample text image set to be cut.

10. The method of any one of claims 1-3, further comprising:

performing data enhancement processing on the original sample text image set to obtain an intermediate sample text image set; and

and obtaining the sample text image set according to the original sample text image set and the intermediate sample text image set.

11. The method of any of claims 1-3, wherein the sample text image set is a text image set of a text visual task.

12. A training method of a deep learning model, comprising:

acquiring a target sample text image set; and

training the deep learning model by using the target sample text image set to obtain a text image processing model,

wherein the set of target sample text images is obtained using the method according to any one of claims 1-11.

13. A text image processing method, comprising:

acquiring a text image to be processed; and

inputting the text image to be processed into a text image processing model to obtain a text image processing result,

wherein the text image processing model is trained using the method of claim 12.

14. A text image generating apparatus comprising:

The dividing module is used for dividing the sample text image set into a plurality of sample text image subsets according to a sample text output result set and a sample label set of the sample text image set, wherein the plurality of sample text image subsets comprise a first sample text image subset;

a determining module, configured to determine a target clipping position set of a sample text image set to be clipped according to a sample text output result set of the sample text image set to be clipped, where the sample text image set to be clipped is determined according to the first sample text image subset, the sample output result set includes at least one sample text output result, the sample text output result includes a sample text identification output result, the sample text image set includes a plurality of sample text images, the sample text identification output result is obtained by sequentially decoding a global sample feature sequence of the sample text images, the global sample feature sequence is obtained by performing global feature extraction on a first local sample feature map of the sample text images, and the first local sample feature map is obtained by performing first local feature extraction on the sample text images;

The first obtaining module is used for cutting the sample text image set to be cut based on the target cutting position set to obtain a plurality of cutting sample text image subsets; and

the second obtaining module is used for obtaining a target sample text image set according to the plurality of clipping sample text image subsets and at least one sample text image subset;

wherein, the division module includes:

the comparing sub-module is used for comparing the sample text output result set of the sample text image set with the sample label set to obtain a comparison result; and

wherein, for a sample text image of the plurality of sample text images, the partitioning submodule includes:

a first determining unit configured to determine the sample text image as a sample text image in the first subset of sample text images, in a case where it is determined that a relation between a sample text output result of the sample text image and a sample label satisfies a predetermined matching condition, wherein the predetermined matching condition includes a difference between the sample text output result and the sample label being less than or equal to a predetermined difference threshold; and

A second determining unit configured to determine the sample text image as a sample text image in the second sample text image subset, in a case where it is determined that a relationship between a sample text output result of the sample text image and a sample label does not satisfy the predetermined matching condition.

15. The apparatus of claim 14, wherein the set of sample text images to be cropped comprises a plurality of sample text images to be cropped;

wherein, for the sample text image to be cut in the sample text image set to be cut, the determining module includes:

16. The apparatus of claim 14, wherein the sample text output result further comprises a sample text semantic output result.

17. The apparatus of claim 16, wherein the sample text semantic output results are semantically understood from a second local sample feature map of the sample text image, the second local sample feature map being derived from a second local feature extraction of the sample text image.

18. The apparatus of claim 16, wherein, in a case where the sample text output result includes the sample text recognition output result and the sample text semantic output result, determining a sub-module comprises:

the third determining unit is used for determining a plurality of candidate cutting positions according to the sample text recognition output result of the sample text image to be cut; and

and a fourth determining unit, configured to determine at least one target clipping position from the plurality of candidate clipping positions according to a sample text semantic output result of the sample text image to be clipped.

19. The apparatus of any one of claims 14-16, wherein the first obtaining module comprises:

and the first obtaining submodule is used for cutting the sample text image set to be cut based on the target cutting position set to obtain a first cutting sample text image subset and a second cutting sample text image subset.

20. The apparatus of any of claims 14-16, wherein the second obtaining module comprises:

the second obtaining submodule is used for obtaining a third sample text image subset according to the plurality of clipping sample text image subsets; and

And a third obtaining sub-module, configured to obtain the target sample text image set according to at least one sample text image subset and the third sample text image subset.

21. The apparatus of claim 20, wherein the second obtaining sub-module comprises:

and the obtaining unit is used for combining the clipping sample text images in the plurality of clipping sample text image subsets based on a preset combination strategy to obtain the third sample text image subset.

22. The apparatus of any of claims 14-16, wherein the first subset of sample images comprises a plurality of first sample images;

wherein the set of sample text images to be cropped is determined by:

for a first sample image of the plurality of first sample images,

23. The apparatus according to any one of claims 14-16, further comprising:

the third obtaining module is used for carrying out data enhancement processing on the original sample text image set to obtain an intermediate sample text image set; and

And a fourth obtaining module, configured to obtain the sample text image set according to the original sample text image set and the intermediate sample text image set.

24. The apparatus of any of claims 14-16, wherein the sample text image set is a text image set of a text visual task.

25. A training device for a deep learning model, comprising:

the first acquisition module is used for acquiring a target sample text image set; and

a fifth obtaining module, configured to train the deep learning model by using the target sample text image set to obtain a text image processing model,

wherein the set of target sample text images is obtained with the apparatus according to any one of claims 14 to 24.

26. A text image processing apparatus comprising:

the second acquisition module is used for acquiring a text image to be processed; and

a sixth obtaining module, configured to input the text image to be processed into a text image processing model to obtain a text image processing result,

wherein the text image processing model is trained using the apparatus of claim 25.

27. An electronic device, comprising:

at least one processor; and

A memory communicatively coupled to the at least one processor; wherein,,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-13.

28. A non-transitory computer readable storage medium storing computer instructions for causing the computer to perform the method of any one of claims 1-13.