CN118522019B

CN118522019B - Text recognition method, electronic device and storage medium

Info

Publication number: CN118522019B
Application number: CN202410996958.4A
Authority: CN
Inventors: 沈孔怀; 王科洋; 吕翠文; 邵明
Original assignee: Zhejiang Dahua Technology Co Ltd
Current assignee: Zhejiang Dahua Technology Co Ltd
Priority date: 2024-07-24
Filing date: 2024-07-24
Publication date: 2024-11-29
Anticipated expiration: 2044-07-24
Also published as: CN118522019A

Abstract

The application discloses a text recognition method, electronic equipment and a storage medium. The text recognition method comprises the steps of obtaining an image to be recognized, performing first downsampling on feature representations of a plurality of image blocks obtained by dividing the image to be recognized to obtain initial feature representations, wherein a combination of the feature representations of the image blocks is the feature representations of the image to be recognized, the initial feature representations carry at least part of the feature representations of the image blocks, fusing the initial feature representations with the feature representations of the image blocks to obtain target feature representations, wherein the number of the feature representations of the image blocks contained in the target feature representations is the same as the number of the feature representations of the image blocks contained in the initial feature representations, and performing text recognition on the target feature representations to obtain text recognition results about the image to be recognized. According to the scheme, the target feature representation is utilized for text recognition, so that the efficiency of text recognition can be improved.

Description

Text recognition method, electronic device and storage medium

Technical Field

The present application relates to the field of image processing, and in particular, to a text recognition method, an electronic device, and a storage medium.

Background

Text recognition technology has evolved tremendously over the last decades, but remains challenging. Especially text recognition in natural scenes can be affected by a number of factors. The text recognition in the natural scene may refer to an image to be recognized having text in the natural scene. The low definition of the image to be recognized with text itself can affect the text recognition effect. Therefore, in the prior art, a high-resolution image to be recognized needs to be used, so that small target texts in the image to be recognized are clearer, and the clearer small target texts improve the text recognition effect. However, the high-resolution image to be recognized has large calculation amount in the text recognition process, so that the text recognition efficiency of a general text recognition model is low or even the text recognition model cannot be completed, the text recognition can be performed only after the structure of the text recognition model is expanded and retrained, and the whole process is complicated.

Aiming at the defects of the prior art, how to provide an effective text recognition scheme is a technical problem to be solved urgently by the person skilled in the art.

Disclosure of Invention

The application provides at least a text recognition method, electronic equipment and a storage medium.

The application provides a text recognition method which comprises the steps of obtaining an image to be recognized, carrying out first downsampling on feature representations of a plurality of image blocks obtained by dividing the image to be recognized to obtain initial feature representations, wherein the combination of the feature representations of the image blocks is the feature representations of the image to be recognized, the initial feature representations carry at least part of the feature representations of the image blocks, fusing the initial feature representations with the feature representations of the image blocks to obtain target feature representations, wherein the number of the feature representations of the image blocks contained in the target feature representations is the same as that of the feature representations of the image blocks contained in the initial feature representations, and carrying out text recognition on the target feature representations to obtain text recognition results related to the image to be recognized.

The application provides a text recognition device which comprises an acquisition module, a sampling module, a fusion module and a recognition module, wherein the acquisition module is used for acquiring an image to be recognized, the sampling module is used for performing first downsampling on feature representations of a plurality of image blocks obtained by dividing the image to be recognized to obtain initial feature representations, the combination of the feature representations of the image blocks is the feature representations of the image to be recognized, the initial feature representations carry at least part of the feature representations of the image blocks, the fusion module is used for fusing the initial feature representations with the feature representations of the image blocks to obtain target feature representations, the number of the feature representations of the image blocks contained in the target feature representations is the same as the number of the feature representations of the image blocks contained in the initial feature representations, and the recognition module is used for performing text recognition on the target feature representations to obtain text recognition results about the image to be recognized.

The application provides an electronic device, which comprises a memory and a processor, wherein the processor is used for executing program instructions stored in the memory so as to realize the text recognition method.

The present application provides a computer readable storage medium having stored thereon program instructions which, when executed by a processor, implement the above-described text recognition method.

According to the scheme, the image to be identified is obtained, the feature representations of the image blocks obtained by dividing the image to be identified are subjected to first downsampling to obtain the initial feature representations, the initial feature representations with the smaller feature representation number are fused with the feature representations of the image blocks to obtain the target feature representations, the number of the feature representations contained in the target feature representations is smaller than that of the feature representations contained in the image to be identified, text identification is performed on the target feature representations instead of the image to be identified directly, the number of the feature representations in the text identification process can be reduced, and therefore the text identification efficiency is improved.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the application as claimed.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the application and together with the description, serve to explain the principles of the application.

FIG. 1 is a flow chart of a text recognition method according to an embodiment of the present application;

FIG. 2 is a schematic diagram of an image to be recognized in an embodiment of a text recognition method according to the present application;

FIG. 3 is a second flow chart of an embodiment of a text recognition method according to the present application;

FIG. 4 is a flowchart illustrating a third embodiment of a text recognition method according to the present application;

FIG. 5 is a flowchart illustrating a text recognition method according to an embodiment of the present application;

FIG. 6 is a schematic diagram illustrating the structure of an embodiment of a text recognition device according to the present application;

FIG. 7 is a schematic diagram of an embodiment of an electronic device of the present application;

fig. 8 is a schematic diagram of a computer-readable storage medium according to an embodiment of the present application.

Detailed Description

The following describes embodiments of the present application in detail with reference to the drawings.

In the following description, for purposes of explanation and not limitation, specific details are set forth such as the particular system architecture, interfaces, techniques, etc., in order to provide a thorough understanding of the present application.

The term "and/or" is merely an association relationship describing the associated object, and means that three relationships may exist, for example, a and/or B may mean that a exists alone, while a and B exist together, and B exists alone. In addition, the character "/" herein generally indicates that the front and rear associated objects are an "or" relationship. Further, "a plurality" herein means two or more than two. In addition, the term "at least one" herein means any one of a plurality or any combination of at least two of a plurality, for example, including at least one of A, B, C, may mean including any one or more elements selected from the group consisting of A, B and C.

The application provides text recognition methods and text recognition devices. Application scenarios of the text recognition method include, but are not limited to, text recognition of images with text in natural scenes. The execution subject of the text recognition method may be any device having a text recognition function, for example, a text recognition apparatus. The text recognition means may be provided in a terminal device or a server or other processing device, wherein the terminal device may be a device for text recognition, a User Equipment (UE), a mobile device, a User terminal, a cellular phone, a cordless phone, a Personal digital assistant (Personal DIGITAL ASSISTANT, PDA), a handheld device, a computing device, an in-vehicle device, etc. In some possible implementations, the text recognition method may be implemented by way of a processor invoking computer readable instructions stored in a memory.

Referring to fig. 1, fig. 1 is a flowchart illustrating an embodiment of a text recognition method according to the present application. Specifically, the text recognition method may include the steps of:

And S11, acquiring an image to be identified.

The image to be recognized may be an image containing text to be recognized. Specifically, the image to be identified can be a natural scene picture, a document scanning piece, a handwriting font picture, a screen shot, an expression package with text information, a design work, medical image data and the like. The natural scene picture can be a picture with text information such as a signboard on a street, a billboard, a shop signboard, a guideboard, a menu, a product package and the like. The document scanning part can be a digital scanning part of documents such as contracts, files, reports, invoices, business cards and the like, and text information in the digital scanning part needs to be extracted. The hand-written font picture can be hand-written text pictures with different styles and qualities, such as hand-written notes, letters, envelopes, business cards, calligraphic works and the like. The screen shots can be screen shot pictures containing text information, such as application program interfaces, webpage screen shots, mobile phone page screen shots and the like. Wherein, the artwork or design works can be the artwork or the text information in the design works such as poster, advertisement design manuscript, book covers, painting and the like. The medical image data may be pictures of medical fields such as medical reports, image diagnosis reports, labeling information in medical images, and the like. This is merely illustrative and does not limit the type of image to be identified.

The method for acquiring the image to be identified can be that the image to be identified is acquired in real time through the image acquisition equipment, or the image to be identified stored in the database is used, or a frame of image in the video file is intercepted.

And step S12, performing first downsampling on feature representations of a plurality of image blocks obtained by dividing the image to be identified to obtain initial feature representations.

The combination of the feature representations of the plurality of image blocks is a feature representation of the image to be identified, the initial feature representation carrying at least part of the feature representations of the image blocks.

Firstly, dividing image blocks of an image to be identified to obtain characteristic representations of a plurality of image blocks. Then, the feature representations of the plurality of image blocks are first downsampled to obtain an initial feature representation. Wherein the combination of the feature representations of the plurality of image blocks is a feature representation of the image to be identified. It is understood that the feature representation of the plurality of image blocks may refer to feature representations of all image blocks contained in the entire image to be identified. The first downsampling of the feature representations of the plurality of image blocks to obtain an initial feature representation may be performed by first downsampling the feature representations of the image to be identified to obtain the initial feature representation.

And carrying out image block division on the image to be identified to obtain characteristic representations of a plurality of image blocks. In some application scenarios, firstly, an image to be identified is divided according to a preset size to obtain a plurality of image blocks. Wherein, each image block obtained through division can be represented as a patch. Then, for each image block, the image block is subjected to vector mapping, and the feature representation of the image block after vector mapping of the image block is obtained. Wherein the feature representation of the image block may be represented as a token.

The first downsampling may be a filtering of the feature representations of the plurality of image blocks according to a preset criterion to obtain feature representations of at least part of the image blocks. It will be appreciated that the process of obtaining an initial feature representation of a plurality of image blocks after a first downsampling may reduce the number of feature representations of the image blocks. The predetermined criterion may be a sampling criterion associated with at least one predetermined direction in the image to be identified. The noise influence can be filtered or the characteristics related to the text to be identified can be enhanced in the process of obtaining the initial characteristic representation after the characteristic representations of the image blocks are subjected to first downsampling.

And step S13, fusing the initial characteristic representation with the characteristic representations of the image blocks to obtain a target characteristic representation.

The number of feature representations of the image blocks contained in the target feature representation is the same as the number of feature representations of the image blocks contained in the initial feature representation.

The initial feature representation carries at least a portion of the feature representation of the image block. It will be appreciated that the feature representations of at least part of the image blocks are composed or assembled into an initial feature representation. The initial feature representation has noise reduced local feature information. The feature representation of the plurality of image blocks has global feature information for the entire image to be identified. In some application scenarios, the process of fusing the initial feature representation with the feature representations of the plurality of image blocks to obtain the target feature representation may be to fuse the local feature information after noise reduction and the global feature information of the whole image to be identified, so that the obtained target feature representation can balance the importance of the global feature information and the local feature information, and thus, the important information in the input initial feature representation and the feature representations of the plurality of image blocks can be effectively captured. In some application scenarios, the method of fusing the initial feature representation with the feature representations of the plurality of image blocks may be to fuse the initial feature representation with the feature representations of the plurality of image blocks to obtain the target feature representation. In some application scenarios, the method of fusing the initial feature representation with the feature representations of the plurality of image blocks may also be to dynamically adjust weights between the initial feature representation and the feature representations of the plurality of image blocks using an attention mechanism to obtain the target feature representation. In some application scenarios, the method of fusing the initial feature representation with the feature representations of the image blocks may be to design a plurality of parallel or cascaded network structures through a cascade or parallel network structure, each network processes the initial feature representation or the feature representations of the image blocks respectively, and finally, the output of each network is fused or jointly trained to obtain a target feature representation, so that the obtained target feature representation can fully utilize local feature information and global feature information on different levels.

And S14, carrying out text recognition on the target feature representation to obtain a text recognition result about the image to be recognized.

The text recognition result regarding the image to be recognized may be a prediction result of the text to be recognized in the image to be recognized.

Referring to fig. 2, fig. 2 is a schematic diagram of an image to be recognized in an embodiment of a text recognition method according to the present application.

The text to be recognized is made up of at least one word. For example, the text to be recognized may be an article, a paragraph, a sentence, or the like, and the specific form of the text to be recognized is not specifically limited herein. The number of texts to be recognized may be one or a plurality. The word may be a single chinese character, a word composed of chinese characters, a word composed of english letters or a plurality of english letters, etc., and the specific form of the word is not specifically limited herein.

As shown in fig. 2, the image to be identified may be an image of a commercial sign on a street with text information. The buildings on the streets shown in FIG. 2 have text information of address information "XX road: no. 4" of the building and store information "public abcd store". For example, a word in text to be recognized at an image position in an image to be recognized may be "public". The word in the text to be recognized in the image to be recognized may be a number-dependent "4". At this time, the text recognition result of the image to be recognized may be that the word at the certain image position is "public", or other words, or blank words. First, the text recognition result may be whether the image location has a word. In the case where the image location has a word, it is explained that the text recognition of the image location is successful. Further, the text recognition result may also include whether the word at the image location is "public". Specifically, the word whose text recognition result is the image location is "male" or other word. In the case where the image position does not have a word, a text recognition failure of the image position is described. In some application scenarios, the text recognition result may be address information "XX line: no. 4" in the text information, or store information "public abcd store" in the text information.

The text recognition method for the target feature representation may be that the target feature representation is input into a preset encoder to perform feature extraction, so as to obtain the target feature representation after feature extraction. And then inputting the target characteristic representation after the characteristic extraction into a preset decoder for recognition processing or text detection processing to obtain a text recognition result about the image to be recognized.

In some embodiments, the step S12 may include the steps of firstly, extracting features of the image to be identified, and determining sampling weights of image blocks in the image to be identified in at least one preset direction. And then, sampling the characteristic representation of each image block along at least one preset direction according to the sampling weight to obtain an initial characteristic representation.

And extracting the characteristics of the image to be identified, and determining the sampling weight of each image block in the image to be identified in at least one preset direction.

The sampling weight is used for representing the importance degree or attention degree of each image block in the image to be identified in at least one preset direction. Specifically, the sampling weight may be a value corresponding to a feature extraction result corresponding to each image block in the image to be identified. In some application scenarios, the sampling weight may be a mapping value of a feature extraction result obtained by performing normalization processing or activation function processing on the feature extraction result corresponding to each image block in the image to be identified. The value of the mapping value ranges from 0 to 1. The larger the map value, the more important the sampling weights are. The smaller the map value, the lower the importance of the sampling weight. In other application scenarios, when the image blocks in the image to be identified are ranked into a plurality of rows and a plurality of columns, the sampling weight may be a row number table corresponding to all rows or a column number table corresponding to all columns. For example, the image blocks in the image to be recognized are arranged in 7×8. Wherein the arrangement of the image blocks in the image to be identified is 7 rows and 8 columns. The sampling weight may be a line number table corresponding to 7 lines, and each value in the line number table may be an importance degree of the image block corresponding to the same position in the line direction. The sampling weight may be an 8-column corresponding column number table, and each value in the column number table may be an importance level of the image block corresponding to the same position in the column direction. It will be appreciated that for an image block of an image to be identified, the mapping values of the image block in the row direction and the mapping values in the column direction may be different. In particular, for one image block of an image to be identified, the importance of the image block in the row direction and the importance of the image block in the column direction may be different.

In some application scenarios, the process of extracting features of an image to be identified and determining the sampling weight of each image block in the image to be identified in at least one preset direction may be that firstly, the image to be identified is divided according to a preset size to obtain a plurality of image blocks. The image to be identified comprises a plurality of image blocks, and sampling feature extraction is respectively carried out on each image block in the image to be identified by taking the image block as a unit to obtain feature extraction results of each image block. And arranging or combining the feature extraction results of the image blocks according to at least one preset direction to obtain a feature extraction result arrangement set in the at least one preset direction. The image blocks included in the image to be identified are illustratively arranged in a plurality of rows and/or columns. The feature extraction result arrangement set in one preset direction comprises feature extraction result arrangement subsets obtained by arranging or combining feature extraction results of each image block in each preset direction. The feature extraction result arrangement set in one preset direction includes all feature extraction result arrangement subsets in the preset direction. For example, the feature extraction result arrangement set in the row direction includes feature extraction result arrangement subsets of all rows. For a row of image blocks in the image to be identified, the feature extraction result arrangement subset of the row includes feature extraction results of all image blocks in the row.

In other application scenes, the process of extracting features of the image to be identified and determining the sampling weight of each image block in the image to be identified in at least one preset direction may be that firstly, each image block in the image to be identified is divided in at least one preset direction according to a preset scale to obtain an image block set in at least one preset direction. The set of tiles in at least one preset direction may comprise a plurality of subsets of tiles in the same preset direction. The image blocks in the image to be identified are divided into image blocks according to a preset scale, so that a plurality of image blocks are obtained. The plurality of image blocks may be to divide the image to be identified into grids according to a preset scale, where each grid corresponds to one image block. The preset scale is used to represent the size of each image block. The image blocks in one preset direction are taken as a subset of the image blocks in one preset direction. The subset of image blocks in one preset direction comprises a plurality of image blocks in the same preset direction. The preset direction includes a row direction or a column direction. The preset direction includes a transverse direction or a longitudinal direction. In some application scenarios, the preset direction is a row direction corresponding to the preset direction being a lateral direction. In some application scenarios, the preset direction is a column direction and corresponds to the preset direction being a longitudinal direction. Illustratively, a plurality of tiles corresponding to a row are taken as a subset of tiles in the row direction. The image to be identified comprises a plurality of rows or columns. And a plurality of image blocks corresponding to all rows are used as an image block set in the row direction. The set of tiles in the row direction comprises a subset of tiles in each row direction. Then, under the condition that at least one image block set in the preset direction is obtained, taking the image block set in the at least one preset direction as a unit, sampling feature extraction is carried out on each image block in the image block set in the at least one preset direction, and a feature extraction result corresponding to the image block set in the at least one preset direction is obtained. By way of example, sampling feature extraction is performed on each image block in the image block set in the row direction by taking at least the image block set in the row preset direction as a unit, so as to obtain a feature extraction result corresponding to the image block set in the row direction. Specifically, sampling feature extraction is performed on a subset of image blocks in the row direction in the image block set in the row direction, so as to obtain feature extraction results corresponding to the subset of image blocks in each row direction.

In some application scenarios, each image block in the set of image blocks in at least one preset direction is set. And taking a feature extraction result corresponding to the image block set in at least one preset direction as a sampling feature extraction output. Or, taking each image block in the image to be identified as sampling characteristic extraction input. And taking the feature extraction result of each image block as sampling feature extraction output. In some application scenarios, the sampling feature extraction may be performed by taking the sampling feature extraction input as an input of a preset feature extraction network layer, obtaining an output of the preset feature extraction network layer, and taking the output of the preset feature extraction network layer as the sampling feature extraction output. In some application scenarios, the preset feature extraction network may be a Convolutional Neural Network (CNN), or may be a deep learning model such as a cyclic neural network (RNN) or a transducer.

In some application scenarios, the step of dividing the image to be identified into image blocks to obtain feature representations of a plurality of image blocks and the step of extracting features of the image to be identified to determine sampling weights of the image blocks in the image to be identified in at least one preset direction may be performed in parallel, and the execution sequence of the two steps is not consecutive. The two steps are only needed to be executed before the step of sampling the characteristic representation of each image block along at least one preset direction according to the sampling weight to obtain the initial characteristic representation is executed.

It can be considered that the more accurate sampling weight is obtained by disassembling the feature representations of the plurality of image blocks in the image to be identified into at least one preset direction for importance analysis, so that the initial feature representation is determined to be small in data volume and higher in importance degree. The text recognition result is more accurate after the initial characteristic representation is determined.

In some embodiments, the step of extracting features of the image to be identified and determining the sampling weight of each image block in the image to be identified in at least one preset direction may include the steps of firstly extracting gradient features of the image to be identified and performing a second downsampling process to obtain gradient information in each preset direction and downsampled images. And then, respectively carrying out fusion processing on each gradient information and the downsampled image to obtain sampling weights corresponding to each preset direction.

And carrying out gradient feature extraction and second downsampling treatment on the image to be identified to obtain gradient information in each preset direction and downsampled images. The gradient feature extraction and the second downsampling of the image to be identified may be performed serially or in parallel. The gradient feature extraction is used for extracting gradient information of the image to be identified in different preset directions. The second downsampling process may be a size reduction process of the image to be identified. Specifically, the second downsampling process may downscale the image to be identified by half. In some application scenarios, under the condition that the gradient feature extraction and the second downsampling processing are performed in parallel, the gradient feature extraction is performed on the image to be identified, so as to obtain initial gradient information in each preset direction. And synchronously performing second downsampling processing on the image to be identified to obtain an initial downsampled image. And combining the initial gradient information with the first downsampled image respectively to obtain gradient information in each preset direction. The initial downsampled image is taken as the downsampled image. In other application scenarios, when the gradient feature extraction and the second downsampling process are performed in series, the serial execution sequence may be that the second downsampling process is performed on the image to be identified first to obtain a downsampled image. And extracting gradient characteristics of the downsampled image to obtain gradient information in each preset direction. In other application scenarios, when the gradient feature extraction and the second downsampling process are performed in series, the serial execution sequence may also be that the gradient feature extraction is performed on the image to be identified first, so as to obtain gradient information in each preset direction. And then performing second downsampling processing on the gradient information in each preset direction to obtain downsampled images.

In other application scenarios, before the steps of extracting gradient features and performing second downsampling processing on the image to be identified to obtain gradient information in each preset direction and downsampling images, the text recognition method may include the steps of preprocessing the image to be identified to obtain a preprocessed image. Specifically, the preprocessing of the image to be recognized may be graying processing and smoothing processing of the image to be recognized. The graying processing and smoothing processing of the image to be recognized may be performed serially or in parallel. In some application scenarios, when the graying processing and the smoothing processing are executed in parallel, the graying processing is performed on the image to be identified, so as to obtain an initial graying image. And synchronously smoothing the image to be identified to obtain a smoothed image. And combining the initial gray level image with the smoothed image to obtain the preprocessing image. In other application scenes, when the graying processing and the smoothing processing are performed in series, the graying processing is performed on the image to be identified, so as to obtain a graying image. And then carrying out smoothing treatment on the gray-scale image to obtain a smooth image. The smoothed image is taken as a preprocessed image. In other application scenes, when the graying processing and the smoothing processing are performed in series, the smoothing processing is performed on the image to be recognized, so as to obtain a smoothed image. And then carrying out graying treatment on the smooth image to obtain a graying image. The graying image is taken as a preprocessing image.

The step of performing gradient feature extraction and second downsampling processing on the image to be identified to obtain gradient information and downsampled images in each preset direction may include the step of performing gradient feature extraction and second downsampling processing on the preprocessed image to obtain gradient information and downsampled images in each preset direction. Gradient feature extraction and second downsampling of the preprocessed image may be performed serially or in parallel. Illustratively, the gradient feature extraction and the second downsampling process on the preprocessed image are performed serially. The serial execution sequence may be to perform a second downsampling process on the preprocessed image to obtain a downsampled image. And extracting gradient characteristics of the downsampled image to obtain gradient information in each preset direction. It will be appreciated that the gradient feature extraction and the second downsampling may be performed in series or in parallel with reference to the above-mentioned image to be identified, and will not be described herein.

In other application scenes, after the steps of extracting gradient features and performing second downsampling processing on the image to be identified to obtain gradient information and downsampled images in each preset direction, the text recognition method can further comprise the steps of respectively performing normalization processing on the gradient information and the downsampled images in each preset direction to obtain normalized gradient information and normalized downsampled images in each preset direction. Specifically, the gradient information in each preset direction is normalized respectively, so as to obtain normalized gradient information in each preset direction. And carrying out normalization processing on the downsampled image to obtain a normalized downsampled image. The step of respectively performing fusion processing on the gradient information and the downsampled image to obtain sampling weights corresponding to the preset directions may further include the step of respectively performing fusion processing on the normalized gradient information and the normalized downsampled image to obtain sampling weights corresponding to the preset directions.

The at least one preset direction includes a transverse direction and a longitudinal direction, and each gradient information includes first gradient information in the transverse direction and second gradient information in the longitudinal direction. The step of performing gradient feature extraction and second downsampling on the image to be identified to obtain gradient information in each preset direction and downsampling image may include performing gradient feature extraction and second downsampling on the image to be identified to obtain downsampled image, first gradient information in a transverse direction and second gradient information in a longitudinal direction. The step of respectively performing fusion processing on each gradient information and the downsampled image to obtain sampling weights corresponding to each preset direction may include performing fusion processing on the first gradient information and the downsampled image to obtain a first sampling weight in a transverse direction, and performing fusion processing on the second gradient information and the downsampled image to obtain a second sampling weight in a longitudinal direction. In other application scenarios, the step of respectively performing fusion processing on the normalized gradient information and the normalized downsampled image to obtain sampling weights corresponding to the preset directions may include the steps of performing fusion processing on the normalized first gradient information and the normalized downsampled image to obtain a first transverse sampling weight, and performing fusion processing on the normalized second gradient information and the normalized downsampled image to obtain a second longitudinal sampling weight.

The gradient information in each preset direction can be used for representing the decomposition information of signals corresponding to the image to be identified in different frequencies and directions. The first gradient information is used to represent decomposition information of the image to be recognized in the lateral direction. The second gradient information is used to represent decomposition information of the image to be recognized in the longitudinal direction. The downsampled image includes semantic information representing an image to be identified obtained by a second downsampling process. The downsampled image may be, for example, an image of the image to be identified that has undergone a change in size. For example, the downsampled image may be an image of the image to be identified after halving the size. The sampling weights corresponding to the preset directions may represent the importance level of each image block in the image to be identified in the preset directions. In particular, the first sampling weight may represent a degree of importance of each image block in the image to be identified in the lateral direction. The second sampling weight may represent a degree of importance of each image block in the image to be identified in the longitudinal direction.

In some embodiments, the at least one preset direction includes a transverse direction and a longitudinal direction, each gradient information includes a first gradient information in the transverse direction and a second gradient information in the longitudinal direction, and the step of respectively performing fusion processing on each gradient information and the downsampled image to obtain a sampling weight corresponding to each preset direction may include the steps of taking the first gradient information as target gradient information, taking the sampling weight corresponding to the transverse direction as target sampling weight, or taking the second gradient information as target gradient information, and taking the sampling weight corresponding to the longitudinal direction as target sampling weight. And then, carrying out gradient fusion processing on the target gradient information and the downsampled image to obtain a first fusion image. And then, extracting initial features of the first fused image to obtain initial image features. And then, carrying out first dimension transformation processing on the initial image characteristics to obtain first dimension characteristics. And then, inputting the first dimension characteristic into a perceptron module for advanced step characteristic extraction to obtain a target sampling weight.

In some application scenarios, the process of performing gradient fusion processing on the target gradient information and the downsampled image to obtain the first fused image may be to splice the target gradient information and the downsampled image along the target direction, so as to obtain the first fused image in the target direction. And taking the preset direction corresponding to the target gradient information as a target direction. The gradient fusion process may be to stitch the target gradient information with the downsampled image along the target direction. The first fused image in the target direction may contain gradient information in the target direction and semantic information of the downsampled image.

In some application scenarios, the process of extracting the initial feature of the first fused image to obtain the initial image feature may be that the first fused image in the target direction is input into a preset feature extraction network, and the initial feature of the first fused image in the target direction is extracted to obtain the initial image feature in the target direction. The preset feature extraction network may be the above preset feature extraction network layer. The preset feature extraction network may be, for example, the feature extraction network ConvNet.

In some application scenarios, the process of performing the first dimension transformation on the initial image feature to obtain the first dimension feature may be that the initial image feature in the target direction is input to the first dimension transformation module, and the first dimension transformation is performed on the initial image feature in the target direction to obtain the first dimension feature in the target direction. Specifically, the first dimension transformation processing enables the first dimension characteristics to adapt to the input dimension of the perceptron module, and the perceptron module outputs normally.

In some application scenarios, the first dimension feature is input into the perceptron module for advanced step feature extraction to obtain the target sampling weight. In some application scenarios, the perceptron module may be an MLP layer. And inputting the first dimension characteristic in the target direction into a perceptron module for advanced dimension characteristic extraction to obtain advanced dimension characteristics in the target direction. And determining the advanced dimension characteristic in the target direction as a target sampling weight. In some application scenarios, the determining the advanced dimension feature in the target direction as the target sampling weight may be performing normalization processing or activation function processing on the advanced dimension feature in the target direction, to obtain a mapping value of the feature extraction result. And taking the mapping value of the feature extraction result as a target sampling weight in the target direction.

In some application scenes, gradient fusion processing is carried out on the first gradient information and the downsampled image, so that a first fusion image in the transverse direction is obtained. And extracting initial features of the first fused image in the transverse direction to obtain initial image features in the transverse direction. And then, carrying out first-dimension transformation processing on the initial image features in the transverse direction to obtain the first-dimension features in the transverse direction. And then, inputting the first dimension characteristic into a perceptron module for advanced step characteristic extraction to obtain the advanced first dimension characteristic in the transverse direction. And determining the advanced first dimension characteristic in the transverse direction as the sampling weight corresponding to the transverse direction. The laterally corresponding sampling weight is a first sampling weight. In some application scenarios, the determining the advanced first dimension feature in the transverse direction as the sampling weight corresponding to the transverse direction may be performing normalization processing or activation function processing on the advanced first dimension feature in the transverse direction, to obtain a mapping value of the feature extraction result. And taking the mapping value of the feature extraction result as a first sampling weight. In other application scenes, gradient fusion processing is carried out on the second gradient information and the downsampled image, so that a first fusion image in the longitudinal direction is obtained. And extracting initial features of the first fused image in the longitudinal direction to obtain initial image features in the longitudinal direction. And then, carrying out first dimension transformation processing on the initial image features in the longitudinal direction to obtain first dimension features in the longitudinal direction. And then, inputting the first dimension characteristic into a perceptron module for advanced step characteristic extraction to obtain advanced first dimension characteristic in the longitudinal direction. And determining the first dimension characteristic of the steps in the longitudinal direction as the sampling weight corresponding to the longitudinal direction. The longitudinally corresponding sampling weight is a second sampling weight. In some application scenarios, the determining the advanced first dimension feature in the longitudinal direction as the sampling weight corresponding to the longitudinal direction may be performing normalization processing or activation function processing on the advanced dimension feature in the longitudinal direction, to obtain a mapping value of the feature extraction result. And taking the mapping value of the feature extraction result as a second sampling weight.

It is considered that since the text in the natural scene often has obvious tiles with a bottom, the edges of the tiles have a large difference from the background environment, so that the frequency adaptation module is introduced. The tiles contain background information of the area where the text is located. As shown in fig. 2, the floor tile contains the base color of the billboard in the area where the text is located. The frequency adaptation module may be denoted Domain Adapte module. The step of extracting the features of the image to be identified and determining the sampling weight of each image block in the image to be identified in at least one preset direction is executed by the frequency adaptation module. Therefore, the frequency adaptation module is introduced, interference of background information of an area where the text is located can be reduced, modal information used by the network is further enriched, the network can pay attention to text information existing in a high-resolution image more easily, and training difficulty is simplified.

In some embodiments, sampling the feature representations of the image blocks along at least one preset direction according to the sampling weights, the obtaining the initial feature representation may comprise sampling the feature representations of the image blocks laterally and/or sampling the feature representations of the image blocks longitudinally. In some application scenarios, when at least one preset direction is a preset direction, the preset direction may be a transverse direction or a longitudinal direction, and the process of obtaining the initial feature representation may be to sample the feature representation of each image block transversely or sample the feature representation of each image block longitudinally along at least one preset direction according to the sampling weight. Under the condition that the preset direction is transverse, the transverse sampling process may be to acquire a first sampling weight, and perform a first screening on the feature representation of each transverse image block according to the first sampling weight to obtain a first selection number of candidate transverse feature representations. The candidate lateral feature representation is taken as the initial feature representation. Under the condition that the preset direction is longitudinal, the longitudinal sampling process may be to acquire a second sampling weight, and according to the second sampling weight, perform a first screening on the feature representation of the image block in each longitudinal direction to obtain a second selection number of candidate longitudinal feature representations. The candidate longitudinal feature representation is taken as an initial feature representation. In other application scenarios, the at least one preset direction is a plurality of preset directions. Illustratively, the present application takes a plurality of preset directions as the transverse direction and the longitudinal direction as an example, and the number of the plurality of preset directions is not limited herein. The process of sampling the feature representation of each image block along at least one preset direction according to the sampling weight to obtain the initial feature representation may be to perform the transverse sampling and the longitudinal sampling in series, or may be to perform the transverse sampling and the longitudinal sampling in parallel. The serial execution of the lateral sampling and the longitudinal sampling may be performed by first performing the lateral sampling of the feature representation of each image block and then performing the longitudinal sampling of the feature representation of each image block. The serial execution of the lateral sampling and the longitudinal sampling may be performed by first longitudinally sampling the feature representation of each image block and then laterally sampling the feature representation of each image block. It will be appreciated that the specific process of lateral sampling and longitudinal sampling may be referred to above, and will not be described here.

In some application scenarios, the initial feature representation may represent at least a portion of the feature representations of the image blocks. The above-mentioned manner of sampling the feature representations of each image block may be to sample and sort the feature representations of each image block based on a target sampling weight in a target direction, so as to obtain an importance degree sort corresponding to the feature representations of each image block. And screening from the feature representations of the image blocks in each target direction based on the importance degree ranking to obtain candidate feature representations. Wherein, in the case that at least one preset direction is a preset direction, the preset direction may be a transverse direction or a longitudinal direction, and the candidate feature representation is directly used as the initial feature representation. Wherein, in the case that at least one preset direction is a plurality of preset directions, the target direction may be a transverse direction or a longitudinal direction, and the initial feature representation is determined based on each candidate feature representation.

In some embodiments, the step of sampling the feature representation of each image block along at least one preset direction according to the sampling weights to obtain an initial feature representation may include the steps of first obtaining a first selected number of image blocks of the image to be identified in the lateral direction and a second selected number of image blocks of the image to be identified in the longitudinal direction. Then, according to the first sampling weight, carrying out first screening on the characteristic representations of each transverse image block to obtain a first selection number of candidate transverse characteristic representations; and performing first screening on the feature representations of the image blocks in each longitudinal direction according to the second sampling weight to obtain a second selection number of candidate longitudinal feature representations. And then, carrying out second screening on each candidate transverse characteristic representation and each candidate longitudinal characteristic representation to obtain a preset number of selection characteristic representations, wherein the preset number is equal to the product of the first selection number and the second selection number. An initial feature representation is then determined based on the selected feature representation.

The first selection number and the second selection number may be preset. The first selection number and the second selection number may be the same or different. In some application scenarios, the first number of choices and the second number of choices are determined based on an input size of the image encoder. In particular, the input size of the image encoder may be a first preset number of heights and a second preset number of widths. In some application scenarios, the height in the input size of the image encoder consists of the height of the first preset number of image blocks and the width in the input size of the image encoder consists of the width of the first preset number of image blocks. The transverse direction corresponds to the row direction and the longitudinal direction corresponds to the column direction. When the target sampling weight is the first sampling weight, the importance degree ranking may be a mapping value ranking of feature representations of each image block for each row. In some application scenarios, the first filtering may be to, for each row, based on a mapping value ordering of feature representations of the image blocks, filter feature representations of the image blocks that are ranked top of the mapping value ordering, and take the top-ranked feature representations as candidate lateral feature representations. The number of candidate lateral feature representations is a first preset number, and the feature representation ranked top may be the feature representation of the image block ranked top M1 in order of mapping values in each row. The value of M1 is the same as the value of the first preset number. When the target sampling weight is the second sampling weight, the mapping values of the characteristic representations of the image blocks are ordered for each column. In other application scenarios, the first filtering may further be to, for each column, based on the mapping value ranks of the feature representations of the image blocks, filter the feature representations with the mapping value ranks top from the feature representations of the image blocks, and take the feature representations with the top ranks as candidate vertical feature representations. The number of candidate vertical feature representations is a second preset number, and the feature representation ranked top may be the feature representation of the image block ranked top M2 in order of mapping values in each column. The value of M2 is the same as the second preset number of values. It will be appreciated that each candidate lateral feature representation and each candidate longitudinal feature representation may be feature representations of the same image block or feature representations of different image blocks.

In some application scenarios, the second filtering may be to compare whether the candidate lateral feature representations and the candidate longitudinal feature representations are feature representations of the same image block. In response to the feature representations of the same image block between each candidate lateral feature representation and each candidate longitudinal feature representation, the feature representation of each of the same image block is determined as each selected feature representation. The selection feature representation comprises a feature representation of a preset number of image blocks.

In other application scenarios, the second filtering may further be to overlap each candidate lateral feature representation with each candidate longitudinal feature representation to obtain a candidate target feature representation. Because the candidate transverse feature representations and the candidate longitudinal feature representations can be feature representations of at least partial different image blocks, the number of image blocks corresponding to the candidate target feature representations is larger than that of the image blocks corresponding to the selection feature representations, that is, the number of image blocks corresponding to the candidate target feature representations is larger than the preset number. Selecting a preset number of feature representations of the image blocks from the target candidate feature representations in a mode that feature representation mapping values of the image blocks are ranked at the front, and determining the feature representations of the preset number of image blocks as each selected feature representation. The selection feature representation is a feature representation of a preset number of image blocks. It will be appreciated that the selection feature representation may contain image information for the image block. In some application scenarios, the step of determining the initial feature representation based on the selected feature representations may be to directly combine the selected feature representations to obtain the initial feature representation. The initial feature representation may comprise a preset number of selection feature representations.

It can be considered that the initial feature representation at this time is a feature representation of higher importance in the image to be recognized.

In some embodiments, the step of determining the initial feature representation based on the selection feature representations may include the steps of first obtaining location information corresponding to the selection feature representations, the location information being used to represent a location relationship of the selection feature representations to image blocks of the image to be identified.

And then, carrying out position fusion processing on each selected feature representation and each position information to obtain an initial feature representation.

The positional relationship may be that each selection feature represents positional information of a corresponding image block in the image to be identified. In some application scenarios, the positional relationship may be by first acquiring a blank image. The size of the blank image is the same as the image to be identified. And then carrying out image block division processing on the blank image to obtain each blank image block. And performing position coding on each blank image block to obtain the position information of the blank image block. The position information of the blank image block is a trainable two-dimensional position code. Comparing each selection feature representation corresponding image block with each blank image block, and responding to the fact that the position of a certain selection feature representation corresponding image block in the image to be identified is the same as the position of a certain blank image block in the blank image block, and taking the position information of the blank image block as the position information corresponding to the selection feature representation. In some application scenarios, the position fusion processing is performed on each selection feature representation and each position information to obtain the initial feature representation, or the stitching is performed on each selection feature representation and each position information to obtain the initial feature representation. Specifically, for each selection feature representation, the position information corresponding to the selection feature representation is added to the selection feature representation as a feature tag, and the added selection feature representation is obtained. And combines the respective added selection feature representations into an initial feature representation. The initial feature representation may comprise a preset number of added selection feature representations.

It is considered that the initial feature representation at this time is a feature representation having a high importance level and positional information in the image to be recognized.

In some embodiments, step S13 described above may include the step of first inputting the initial feature representation and the feature representations of the plurality of image blocks into a cross-attention layer to obtain a feature representation of the cross-attention layer output. Wherein the initial feature represents a query vector as a cross-attention layer and the features of the plurality of image blocks represent a key vector and a value vector as a cross-attention layer. And then, carrying out feature fusion processing on the initial feature representation and the feature representation output by the cross attention layer to obtain a progressive feature representation. And then, carrying out normalization processing on the advanced feature representation to obtain a target feature representation.

The characteristic representation of the plurality of image blocks is a characteristic representation of all image blocks in the image to be identified. The inputs to the cross attention layer are an initial feature representation and a feature representation of the plurality of image blocks, respectively. In particular, the Cross Attention layer may be a Cross Attention module. Illustratively, the cross-attention layer may be Scaled Dot-Product Attention of the cross-attention layers.

The cross attention layer takes the input initial characteristic representation as a query vector of the cross attention layer, and the characteristic representations of the image blocks are taken as key vectors and value vectors of the cross attention layer, so that the initial characteristic representation and the characteristic representations of the image blocks are subjected to information fusion, and the characteristic representations output by the cross attention layer are obtained, and the extracted image information is added to the characteristic representations of the image blocks except for the image blocks corresponding to the initial characteristic representations. It is understood that the initial feature representation may be sampled tokens. The characteristic representation of the plurality of image blocks may be image tokens of the original input. With sampled tokens as a Query vector Query, the information fusion of the initial feature representation and the feature representations of the plurality of image blocks may be performed by self-Attention computation with the original input image tokens as a Key vector Key and a Value vector Value. And carrying out feature fusion processing on the initial feature representation and the feature representation output by the cross attention layer to obtain a further feature representation, wherein the step-by-step addition of the initial feature representation and the feature representation output by the cross attention layer can be carried out to obtain the further feature representation. And then carrying out normalization processing on the advanced feature representation to obtain a target feature representation, wherein the step of carrying out layer normalization processing on the advanced feature representation can obtain the target feature representation. The target feature is represented as image tokens of the last sample.

The target feature representation may be considered to have image information extracted from feature representations of other image blocks than the image block to which the initial feature representation corresponds. Since the Query vector Query contains tokens numbers ofIn general, to be compared withThe number is much smaller, and the calculation amount of self-Attention can be greatly reduced. Taking the typical input size 224×224 of the image encoder CLIP-ViT as an example, assuming that the image to be identified is an image scaled to a fixed size 1600×2048, the size of the image block is denoted as patch_size, and the sizes of the image blocks all take values of 16×16, the number of token in the Query is determined by the originalBecomes as followsThe whole network can keep high-resolution input under the scene text recognition task. In addition, the target feature representation directly utilizes a pre-trained image encoder CLIP-ViT model, and feature alignment can be omitted between feature representations of all image blocks in the target feature representation, so that fine adjustment of visual instructions is realized.

In some embodiments, the step S14 may include the steps of first performing feature extraction on the target feature representation by using the image encoder to obtain a target encoded representation corresponding to the target feature representation. Then, performing second dimension transformation processing on the target coding representation to obtain second dimension characteristics. And then, carrying out fusion processing on the second dimension characteristic and a preset prompt word to obtain a target fusion representation. Then, a text recognition result is determined based on the target fusion representation.

The image encoder is a ViT network with aligned features in the CLIP, and the image encoder is denoted as CLIP-ViT model. And extracting the characteristics of the target characteristic representation by using an image encoder, wherein the process of obtaining the target coding representation corresponding to the target characteristic representation is a process of aligning the characteristics among the characteristic representations of the image blocks in the target characteristic representation. In some application scenarios, the feature representation of each image block in the target feature representation may comprise an image vector and a text vector. And mapping each image vector and each text vector to the same modal space, and calculating by using the similarity to obtain the target coding representation with aligned features. The second dimension transformation processing enables the dimension corresponding to the second dimension feature to meet the dimension requirement in the text recognition process, and the accuracy of the output text recognition result is high. The preset hint word may be a prompt. The preset hint word may be set to "Recognize the text IN THE IMAGE AND GIVE THE corresponding center coordinates", i.e. "identify text in an image and give the corresponding center coordinates". And carrying out fusion processing on the second dimension characteristic and the preset prompting word, wherein the process of obtaining the target fusion representation can be to splice the second dimension characteristic and the preset prompting word to obtain the target fusion representation. Based on the target fusion representation, the process of determining the text recognition result may be to input the target fusion representation into a pre-trained large language model to obtain the text recognition result. In particular, the large language model may be a large language model LLM. Illustratively, the large language model LLM can employ an open source pre-trained large language model of Llama3, opt, T5, etc.

In some application scenarios, the acquired image to be identified is scaled to the size of the image。It may be indicated that the image to be identified is high,Is the width of the image to be identified. It will be appreciated that the image to be identified is a scaled image, for which the image before scaling may be of high resolution size. The size of the image to be identified is a preset size, which can be expressed as. To keep small-scale object information in a scene from being lost, a preset size in an image to be identifiedHigh resolution sizes are used. For example, a preset sizeCan be usedThe size is entered.

In some application scenarios, the blocking module may be represented as Patch Embedding modules. The step of dividing the image blocks of the image to be identified to obtain the characteristic representation of the plurality of image blocks is executed by the block dividing module. The image block size patch size of the plurality of image blocks output by the blocking module is the same as the corresponding image block size in the target feature representation input by the image encoder CLIP-ViT model. The image block size of an image block according to the present application can be expressed as。Image blocks may be represented.The height of an image block in the image block size may be represented.The width of an image block in the image block size may be represented. The target feature input by the image encoder is represented as a fixed size, which may be represented as。The target feature may be represented to represent a high of the corresponding image.The target feature may be represented to represent the width of the corresponding image. The fixed dimension may be, for exampleOr (b). The size of each image block in the image corresponding to the target feature representation can be expressed asThe number of feature representations of each image block in the corresponding image represented by the target feature input by the image encoder is。

It is considered that feature alignment between feature representations of each image block in the target feature representation is not required again, and the calculation amount can be reduced by achieving fine adjustment of the visual instruction. On the basis that the image to be recognized is large-scale image input, the accuracy of a text recognition result can be improved.

In some embodiments, the step of extracting the features of the image to be identified and determining the sampling weight of each image block in the image to be identified in at least one preset direction is performed by a frequency adaptation module, and the text recognition method may further include the steps of firstly, acquiring the label distribution information of the sample image in at least one preset direction. And secondly, carrying out feature extraction on the sample image by utilizing a frequency adaptation module, and determining sample sampling weights of all image blocks in the sample image in at least one preset direction. Then, gaussian distribution processing is carried out on the sample sampling weight, and prediction distribution information is determined. Next, a target loss is determined based on the difference between the labeling distribution information and the prediction distribution information. Then, according to the target loss, parameters of the frequency adaptation module are adjusted.

The sample image may contain text information from using the disclosed scene text dataset. For example, the text dataset may be a CTW dataset, shopSign dataset, or the like. And formatting the original labeling information in the text data set into a sample style. And the sample patterns and the preset prompt words have an association relation. The sample pattern may be. Wherein in the sample patternText information may be represented. In sample patternRepresenting the center position of the text. In the process of formatting the original labeling information in the text data set into the sample style, when a sample image contains a plurality of text targets, a semicolon is used as a separator. The preset hint words may be set with reference to the above. And projecting the text center coordinates in the sample image at the image scale onto the feature map scale. Is assumed to be in At the image scale, a certain text center coordinate isIts projection coordinates are. Wherein, For the width of the image block divided by the image block,The height of the image block divided for the image block. With the point as the center, the Gaussian distribution expansion is applied along the column direction and the row direction respectively, namely,The value range isObtaining a first target distribution value and the sameThe value range of Y isAs a second target distribution value. And respectively superposing each first target distribution value and each second target distribution value under the condition that a plurality of text targets exist in a single image to obtain a fusion distribution value in at least one preset direction. And taking the fusion distribution value in the at least one preset direction as the labeling distribution information in the at least one preset direction. The fused distribution values in the at least one preset direction may include a first fused distribution value and a second fused distribution value. The first fused distribution value and the second fused distribution value may correspond to different preset directions. In some application scenarios, each first target distribution value is superimposed, resulting in a first fused distribution value in a first direction. The first direction may be transverse or longitudinal. In other application scenes, each second target distribution value is overlapped to obtain a second fusion distribution value in the second direction. The second direction may be transverse or longitudinal.

It may be appreciated that the step of extracting features of the sample image by using the frequency adaptation module and determining the sample sampling weight of each image block in the sample image in at least one preset direction may refer to the step of extracting features of the image to be identified and determining the sampling weight of each image block in the image to be identified in at least one preset direction. The sample image may be the same image or a similar image as the image to be identified. The sample image may be the same as the image to be identified, and the sample image may be an image containing text information.

In some application scenarios, the sampling weights and the labeling distribution information in the same direction are directly compared, and the target loss is determined based on the difference between the sampling weights and the labeling distribution information in each same direction. And adjusting parameters of the frequency adaptation module according to the target loss. In other application scenarios, the gaussian distribution processing is performed on the sample sampling weights, and the determining the prediction distribution information may be performed on the sample sampling weights in at least one preset direction, so as to determine the prediction distribution information in the at least one preset direction. The target loss is then determined based on the difference between the labeling distribution information and the prediction distribution information in each same direction. And adjusting parameters of the frequency adaptation module according to the target loss. In some application scenarios, the target loss may be represented as. The target loss may be a KL divergence loss.

In some embodiments, the step of dividing the image to be identified into image blocks to obtain feature representations of a plurality of image blocks is performed by a block dividing module, the step of extracting features of the image to be identified, determining a sampling weight of each image block in the image to be identified in at least one preset direction is performed by a frequency adapting module, the step of sampling the feature representations of each image block along at least one preset direction according to the sampling weight to obtain an initial feature representation is performed by the sampling module, and the text identifying method may further include a training step of the block dividing module, the frequency adapting module and the sampling module, and the training step may include the steps of firstly acquiring a sample image and text labeling information corresponding to the sample image. And secondly, carrying out image block division on the sample image by utilizing a block division module to obtain sample characteristic representations of a plurality of sample image blocks. And then, carrying out feature extraction on the sample image by utilizing a frequency adaptation module, and determining sample sampling weights of each sample image block in the sample image in at least one preset direction. And then, sampling the sample characteristic representation of each image block along at least one preset direction by utilizing a sampling module according to the sample sampling weight to obtain a sample initial characteristic representation. And fusing the sample initial characteristic representation with the sample characteristic representations of all the image blocks by utilizing a sampling module to obtain a sample target characteristic representation. And carrying out text recognition on the sample target characteristic representation to obtain a sample text recognition result about the sample image. Based on the difference between the sample text recognition result and the text labeling information, parameters of the blocking module, the frequency adapting module and the sampling module are adjusted. It can be understood that each step in the training process may refer to the step of performing image block division on the image to be identified in the application process to obtain feature representations of a plurality of image blocks, the step of performing feature extraction on the image to be identified, and determining a sampling weight of each image block in the image to be identified in at least one preset direction, and the step of sampling the feature representations of each image block along at least one preset direction according to the sampling weight to obtain initial feature representations, and the step S13 and the step S14 are not repeated herein.

Wherein the text recognition penalty is determined based on a difference between the sample text recognition result and the text label information. In some application scenarios, the parameters of the segmentation module and the frequency adaptation module and the sampling module are adjusted directly using text recognition loss. In some application scenarios, the text recognition loss and the target loss are weighted and summed to obtain a final loss. The parameters of the blocking module and the frequency adaptation module and the sampling module are then adjusted based on the final loss.

In some embodiments, the step of performing text recognition on the sample target feature representation to obtain a sample text recognition result about the sample image may be sequentially performed by the image encoder, the second dimension transform module, and the decoder. Specifically, the step of extracting the features of the target feature representation by using the image encoder to obtain the target encoded representation corresponding to the target feature representation is performed by the image encoder. The step of performing second dimension transformation processing on the target coded representation to obtain second dimension characteristics is performed by a second dimension transformation module. After the step of performing fusion processing on the second dimension feature and the preset prompt word to obtain a target fusion representation, the step of determining the text recognition result based on the target fusion representation is performed by a decoder. When each module is subjected to parameter training, parameters of an image encoder and a decoder can be frozen in advance, and parameter adjustment is performed on the other modules. The other modules can be a block module, a frequency adaptation module, a sampling module and a second dimension conversion module. Illustratively, in the network training phase, parameters of the decoder LLM large language model and the image encoder CLIP-ViT are frozen. The size of the input sample image is scaled. The frequency adaptation module may be denoted as a Domain adapter module. The sampling module may be denoted Token Resample module. Based on the difference between the sample text recognition result and the text label information, the process of determining the text recognition penalty may be to compare the sample text recognition result predicted by the decoder with the text label information for the penalty. Specifically, the predicted text token id in the sample text recognition result is compared with the token id of the real text in the text label information to obtain text recognition loss. Updating parameters of the trainable portion of the model based on text recognition loss gradient back propagation until the loss reaches a set threshold, completing training of the trainable portion of the model. The model trainable portion may be the remaining modules described above.

In some application scenarios, the text recognition loss and the target loss are weighted and summed to obtain a final loss. Specifically, the process of obtaining the final loss may refer to formula (1):

Formula (1);

Wherein, The final loss may be represented.A target loss may be represented.A text recognition penalty may be represented. Wherein, For the loss corresponding to the LLM model, a cross entropy loss function is mostly adopted for determination. Wherein, The loss weight may be represented.In order to set the value of the preset value,May be set to 0.1. In other application scenarios, the penalty weights may also be set dynamically. For example, the size ratio between the text recognition loss and the target loss may be set according to the size ratio. For example, the text recognition loss and the target loss are summed to obtain a loss sum. The loss weight may be equal to the ratio between the target loss and the sum of losses.

Referring to fig. 3, fig. 3 is a flow chart illustrating a text recognition method according to an embodiment of the application.

Step S301 is performed to acquire an image to be recognized. After step S301, the above-mentioned step of performing image block division on the image to be identified to obtain feature representations of a plurality of image blocks may also be performed.

And S302, extracting features of the image to be identified by using a frequency adaptation module, and determining the sampling weight of each image block in the image to be identified in at least one preset direction. It will be appreciated that the above-mentioned steps of dividing the image to be identified into image blocks to obtain the feature representations of the plurality of image blocks and step S302 may be performed in parallel. After the completion of step S302, step S303, step S304, step S305, step S306, step S307, and step S308 may be sequentially performed. And step S303, sampling the characteristic representation of each image block along at least one preset direction by utilizing a sampling module according to the sampling weight to obtain an initial characteristic representation. And step S304, fusing the initial characteristic representation with the characteristic representations of the image blocks by utilizing a sampling module to obtain a target characteristic representation. And step S305, carrying out feature extraction on the target feature representation by using an image encoder to obtain a target coding representation corresponding to the target feature representation. And step S306, performing second dimension transformation processing on the target coding representation by using a second dimension transformation module to obtain second dimension characteristics. And step S307, carrying out fusion processing on the second dimension characteristic and the preset prompt word to obtain a target fusion representation.

And step 308, determining a text recognition result based on the target fusion representation by using the pre-trained large language model. It is to be understood that, the steps S301 to S308 may refer to the above, and are not described herein.

In some application scenarios, the second dimension transform module may be denoted as a project module. The second dimension transformation module is used for dimension mapping of the image feature representation image tokens in the target coding representation output by the image encoder so as to adapt to the input feature dimension required by the large language model LLM. The second dimension conversion module can adopt a multi-layer perceptron or single-line conversion as a connected structure. The second dimension transformation module can splice the image feature representation image tokens in the mapped second dimension feature and the text vector text tokens corresponding to the preset prompt word prompt to be used as the input of the large language model LLM, so that the fine adjustment of the visual instruction is realized. It will be appreciated that the image feature representation image tokens in the target encoded representation and the image feature representation image tokens in the second dimension feature each contain image information and text information of the image encoder after feature alignment.

It can be considered that identical and similar text segments may exist in a natural scene, and the center coordinates of the text can be learned by further presetting the prompt words so as to increase the interpretability of the text recognition result.

Referring to fig. 4, fig. 4 is a flowchart illustrating a text recognition method according to an embodiment of the application.

Steps S401 to S408 are sequentially performed. Step S401, preprocessing the image to be identified to obtain a preprocessed image. Step S402, gradient feature extraction and second downsampling processing are carried out on the preprocessed image, so that gradient information and downsampled images in all preset directions are obtained. Step S403, respectively carrying out normalization processing on the gradient information and the downsampled image in each preset direction to obtain normalized gradient information and normalized downsampled image in each preset direction. And step S404, respectively carrying out fusion processing on the normalized gradient information and the normalized downsampled image to obtain sampling weights corresponding to the preset directions. And step S405, performing gradient fusion processing on the target gradient information and the downsampled image to obtain a first fusion image. And step S406, extracting initial features of the first fused image to obtain initial image features. Step S407, performing first dimension transformation processing on the initial image characteristics to obtain first dimension characteristics. Step S408, inputting the first dimension characteristic into the perceptron module for advanced step characteristic extraction to obtain a target sampling weight. It is to be understood that, reference may be made to the above description in steps S401 to S408, and the description thereof will not be repeated here. In some application scenarios, the frequency adaptation module may sequentially perform steps S401 to S408 to obtain the target sampling weight. The target sampling weight is the sampling weight in the at least one preset direction.

For example, the step S401 may be to perform gray-scale and smoothing processing on the image to be identified, so as to obtain a preprocessed image, and average filtering or gaussian filtering may be used to smooth and reduce noise interference.

The step S402 and the step S403 may be to use wavelet transform processing and perform normalization processing to obtain a normalized down-sampled image and normalized gradient information. It is understood that the wavelet change process corresponds to the wavelet change dwt process. The wavelet transform process may refer to processing the preprocessed image using a wavelet function Biorthogonal.

The wavelet transform process includes gradient feature extraction and a second downsampling process. The normalized downsampled image may be represented as a downsampled image. The normalized gradient information includes normalized first gradient information and normalized second gradient information. Wherein the normalized first gradient information can be expressed as a horizontal direction spectrogram. The normalized second gradient information can be expressed as a spectrogram in the vertical direction. The normalized downsampled image size obtained after processing and normalization with wavelet function Biorthogonal as in fig. 3 becomes. In some application scenarios, the process from step S404 to step S408 may be to perform fusion processing on the normalized first gradient information and the normalized downsampled image to obtain a first sampling weight in the transverse direction, and perform fusion processing on the normalized second gradient information and the normalized downsampled image to obtain a second sampling weight in the longitudinal direction. And splicing the normalized downsampled image and the normalized first gradient information along the channel dimension for the preset direction as the transverse direction to obtain a transverse first fusion image. Will <、And splicing along the channel dimension to obtain a first fusion image in the transverse direction. And splicing the normalized downsampled image and the normalized second gradient information along the channel dimension for the preset direction which is the longitudinal and transverse direction, so as to obtain a first transverse fused image. Will <、And splicing along the channel dimension respectively to obtain a first fusion image in the transverse direction and a first fusion image in the longitudinal direction. The first fused image in the transverse direction and the first fused image in the longitudinal direction are respectively sent to a feature extraction network ConvNet to perform initial feature extraction, and then respectively pass through branches of two perceptron modules MLP to obtain advanced first dimension features in the transverse direction output by one perceptron module and advanced first dimension features in the longitudinal direction output by the other perceptron module. The advanced first dimension feature in the lateral direction may be a feature representation in the horizontal direction. The number of features in the advanced first dimension features in the transverse direction is. The advanced first dimension feature in the longitudinal direction may be a feature representation in the vertical direction. The number of features in the advanced first dimension features in the longitudinal direction is. The convolution operations in ConvNet in the initial feature extraction process all use large-size convolution kernels_size and step size stride to expand the receptive field range.

It will be appreciated that the wavelet transform process analyzes the signal by using wavelet functions of different scales and directions, the result of which includes coefficients in each scale and direction. These coefficients can be used to represent the variation of the signal in different scales and directions, thereby enabling the initial feature representation to have rich features and improving the accuracy of the sampling weights. The sampling weight with high accuracy further improves the accuracy of the target feature representation which is screened out later.

It can be considered that when natural scene text recognition is performed, a visual multi-mode architecture is used, and under the premise of high-resolution input, a target feature representation is obtained by resampling a feature representation token of an image block, so that the target feature representation can directly perform text recognition by using the parameters of the CLIP-ViT and the LLM pre-training model to obtain a text recognition result, and the small target text can be recognized and positioned more accurately within a affordable calculation range, and the text recognition result is more accurate.

It can be considered that although the resolution of the image to be recognized is different from the resolution that can be input by the image encoder, after feature sampling, the target feature representation is the same as the resolution that can be input by the image encoder, so that the input adaptation is realized, and the feature alignment of the image and the text can be omitted. In addition, the parameters of the pre-trained CLIP-ViT and the LLM model are frozen in the training process, so that only the rest modules are trained, and the required training data size is reduced.

It can be considered that the generalization capability of the LLM can be fully utilized by a visual instruction fine adjustment mode, and the accuracy of text positioning and recognition in a natural scene is improved.

It can be considered that on the premise that the image to be identified is input into the image with high resolution, the small target text can provide more effective information, but a lot of interference information is introduced, so that the frequency domain adaptation module is used, and frequency information of at least one preset direction of the image is added explicitly. Specifically, frequency information in the row direction and the column direction. The problem of background interference under high-resolution input is relieved, and the region of interest is further focused. In addition, by carrying out text recognition on the target feature representation instead of directly carrying out text recognition on the image to be recognized, the number of feature representations in the text recognition process can be reduced, and therefore the text recognition efficiency is improved.

And the training of the network is stabilized by using the target loss, so that the difficulty of model training is simplified.

It can be considered that the importance analysis is performed by disassembling the image to be identified into the row and column directions, and the resampling is combined to obtain the corresponding selection feature representation. The selection feature representation may be represented as Query tokens. The method ensures that the calculation amount of the whole network is not obviously increased under high-resolution input. While adding a blank image capable of position encoding to the sampled select feature representation. The position information of the blank image block in the blank image may be represented as 2D Position Embedding. The position information of the blank image block is a trainable two-dimensional position code. And carrying out position fusion processing on each selected feature representation and each position information to obtain an initial feature representation. And inputting the initial feature representation and the feature representations of the image blocks into a cross attention layer to obtain the feature representations output by the cross attention layer to realize further feature fusion, so that the target feature representation has the feature representation with higher importance degree and position information in the image to be identified.

Referring to fig. 5, fig. 5 is a flowchart illustrating a text recognition method according to an embodiment of the application.

Steps S501 to S504 are sequentially performed. Step S501, sampling the characteristic representation of each image block along at least one preset direction according to the sampling weight to obtain an initial characteristic representation. Step S502, inputting the initial characteristic representation and the characteristic representations of the image blocks into a cross attention layer to obtain the characteristic representation output by the cross attention layer. And step S503, performing feature fusion processing on the initial feature representation and the feature representation output by the cross attention layer to obtain a progressive feature representation. And step S504, carrying out normalization processing on the advanced feature representation to obtain a target feature representation. The initial feature representation in step S501 is a feature representation with a higher importance level and position information in the image to be identified. In some application scenarios, the sampling module may sequentially perform steps S501 to S504 to obtain the target feature representation.

In some application scenarios, the sampling module may be denoted as Token Resample module. The sampling module is used for sampling the characteristic representations image tokens of the plurality of image blocks passing through the blocking module, and taking the characteristic representations in the horizontal and vertical directions, namely the advanced first dimension characteristic in the transverse direction and the advanced first dimension characteristic in the longitudinal direction, obtained by the frequency adapting module, as the screening basis. And respectively carrying out normalization processing or activation function processing on the advanced first dimension characteristic in the transverse direction and the advanced first dimension characteristic in the longitudinal direction to obtain a mapping value of the characteristic extraction result. The map values of the feature extraction result include a map value in the lateral direction and a map value in the longitudinal direction. The mapped value in the transverse direction is taken as a first sampling weight. The mapping value in the longitudinal direction is taken as a second weight. The process of performing normalization processing or activation function processing on the advanced first dimension feature in the transverse direction and the advanced first dimension feature in the longitudinal direction respectively to obtain the mapping value of the feature extraction result may be to obtain confidence scores on the advanced first dimension feature in the transverse direction and the advanced first dimension feature in the longitudinal direction through Sigmoid operation. The confidence score may be a mapped value of the feature extraction result. The step S501 may be that the row direction reserves the first M1 row with the highest confidence, the column direction reserves the first M2 column with the highest confidence, and finally a preset number of selection feature representations are obtained. Wherein M1 may be represented as. M2 can be represented as. The step of determining the initial feature representation based on the selected feature representations may include performing a first filtering and a second filtering on the trainable two-dimensional position codes to obtain position information of each image block after the position codes. And splicing each selected feature representation with each position information to obtain an initial feature representation.

Referring to fig. 6, fig. 6 is a schematic structural diagram of a text recognition device according to an embodiment of the application. The text recognition device 60 includes an acquisition module 61, a sampling module 62, a fusion module 63, and a recognition module 64. The image recognition device comprises an acquisition module 61 for acquiring an image to be recognized, a sampling module 62 for performing first downsampling on feature representations of a plurality of image blocks obtained by dividing the image to be recognized to obtain an initial feature representation, wherein a combination of the feature representations of the image blocks is the feature representation of the image to be recognized, the initial feature representation carries at least part of the feature representations of the image blocks, a fusion module 63 for fusing the initial feature representation with the feature representations of the image blocks to obtain a target feature representation, the number of the feature representations of the image blocks contained in the target feature representation is the same as the number of the feature representations of the image blocks contained in the initial feature representation, and a recognition module 64 for performing text recognition on the target feature representation to obtain a text recognition result about the image to be recognized.

For the functions executed by each module, please refer to a text recognition method, and will not be described herein.

Referring to fig. 7, fig. 7 is a schematic structural diagram of an electronic device according to an embodiment of the application. The electronic device 70 comprises a memory 71 and a processor 72, the processor 72 being arranged to execute program instructions stored in the memory 71 for carrying out the steps of the above described embodiments of the text recognition method. In one specific implementation scenario, electronic device 70 may include, but is not limited to, a multi-camera device, a microcomputer, a server, and further, electronic device 70 may also include a mobile device such as a notebook computer, a tablet computer, etc., without limitation.

In particular, the processor 72 is adapted to control itself and the memory 71 to implement the steps in the text recognition method embodiments described above. The processor 72 may also be referred to as a CPU (Central Processing Unit ). The processor 72 may be an integrated circuit chip having signal processing capabilities. The Processor 72 may also be a general purpose Processor, a digital signal Processor (DIGITAL SIGNAL Processor, DSP), an Application SPECIFIC INTEGRATED Circuit (ASIC), a Field-Programmable gate array (Field-Programmable GATE ARRAY, FPGA) or other Programmable logic device, a discrete gate or transistor logic device, a discrete hardware component. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like. In addition, the processor 72 may be commonly implemented by an integrated circuit chip.

Referring to fig. 8, fig. 8 is a schematic structural diagram of an embodiment of a computer readable storage medium according to the present application. A computer readable storage medium 80 having stored thereon program instructions 801, which program instructions 801 when executed by a processor implement the steps of any of the text recognition method embodiments described above.

In some embodiments, functions or modules included in an apparatus provided by the embodiments of the present disclosure may be used to perform a method described in the foregoing method embodiments, and specific implementations thereof may refer to descriptions of the foregoing method embodiments, which are not repeated herein for brevity.

The foregoing description of various embodiments is intended to highlight differences between the various embodiments, which may be the same or similar to each other by reference, and is not repeated herein for the sake of brevity.

In the several embodiments provided in the present application, it should be understood that the disclosed method and apparatus may be implemented in other manners. For example, the apparatus embodiments described above are merely illustrative, e.g., the division of modules or units is merely a logical functional division, and there may be additional divisions of actual implementation, e.g., units or components may be combined or integrated into another system, or some features may be omitted, or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be an indirect coupling or communication connection via some interfaces, devices or units, which may be in electrical, mechanical, or other forms.

In addition, each functional unit in the embodiments of the present application may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit. The integrated units may be implemented in hardware or in software functional units.

The integrated units, if implemented in the form of software functional units and sold or used as stand-alone products, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present application may be embodied in essence or a part contributing to the prior art or all or part of the technical solution in the form of a software product stored in a storage medium, including several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) or a processor (processor) to execute all or part of the steps of the methods of the embodiments of the present application. The storage medium includes a U disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), a magnetic disk, an optical disk, or other various media capable of storing program codes.

Claims

1. A text recognition method, characterized in that the method comprises:

Obtain an image to be recognized;

Performing a first downsampling on the feature representations of the plurality of image blocks obtained by dividing the image to be identified, to obtain an initial feature representation, wherein a combination of the feature representations of the plurality of image blocks is a feature representation of the image to be identified, and the initial feature representation carries the feature representations of at least part of the image blocks;

fusing the initial feature representation with the feature representations of the plurality of image blocks to obtain a target feature representation, wherein the number of feature representations of the image blocks included in the target feature representation is the same as the number of feature representations of the image blocks included in the initial feature representation;

Performing text recognition on the target feature representation to obtain a text recognition result on the image to be recognized;

The first downsampling of the feature representations of the plurality of image blocks obtained by dividing the image to be identified to obtain an initial feature representation includes:

Performing feature extraction on the image to be identified, determining a sampling weight of each image block in the image to be identified in at least one preset direction, wherein the sampling weight is a mapping value of the feature extraction result obtained by normalizing or activating a feature extraction result corresponding to each image block in the image to be identified;

The feature representation of each image block is sampled along at least one preset direction according to the sampling weight to obtain the initial feature representation, including: sampling and sorting the feature representation of each image block to obtain an importance ranking corresponding to the feature representation of each image block; based on the importance ranking, the feature representation of each image block in each preset direction is screened to obtain a candidate feature representation; based on each candidate feature representation, the initial feature representation is determined.

2. The method according to claim 1, wherein the at least one preset direction is multiple, and the step of extracting features from the image to be identified and determining a sampling weight of each image block in the image to be identified in the at least one preset direction comprises:

Performing gradient feature extraction and a second downsampling process on the image to be identified to obtain gradient information in each of the preset directions and a downsampled image;

The gradient information is fused with the downsampled image to obtain sampling weights corresponding to the preset directions.

3. The method according to claim 2, characterized in that the at least one preset direction includes a horizontal direction and a vertical direction, each of the gradient information includes a first gradient information in the horizontal direction and a second gradient information in the vertical direction, and the fusion processing of each of the gradient information with the downsampled image to obtain the sampling weight corresponding to each of the preset directions comprises:

Using the first gradient information as target gradient information and the sampling weight corresponding to the horizontal direction as the target sampling weight, or using the second gradient information as target gradient information and the sampling weight corresponding to the vertical direction as the target sampling weight;

Performing gradient fusion processing on the target gradient information and the downsampled image to obtain a first fused image;

Performing initial feature extraction on the first fused image to obtain initial image features;

Performing a first dimensional transformation process on the initial image features to obtain first dimensional features;

The first dimensional feature is input into the perceptron module for advanced feature extraction to obtain the target sampling weight.

4. The method according to claim 2, wherein the at least one preset direction includes a horizontal direction and a vertical direction, the sampling weight corresponding to the horizontal direction is used as a first sampling weight, the sampling weight corresponding to the vertical direction is used as a second sampling weight, and the sampling of the feature representation of each image block along the at least one preset direction according to the sampling weight to obtain the initial feature representation comprises:

Acquire a first selected number of image blocks of the image to be identified in the horizontal direction and a second selected number of image blocks of the image to be identified in the vertical direction;

According to the first sampling weight, a first screening is performed on the feature representation of each of the image blocks in the horizontal direction to obtain the first selected number of candidate horizontal feature representations; and according to the second sampling weight, a first screening is performed on the feature representation of each of the image blocks in the vertical direction to obtain the second selected number of candidate vertical feature representations;

Performing a second screening on each of the candidate horizontal feature representations and each of the candidate vertical feature representations to obtain a preset number of selected feature representations, where the preset number is equal to the product of the first selected number and the second selected number;

Based on each of the selected feature representations, the initial feature representation is determined.

5. The method according to claim 4, characterized in that the determining the initial feature representation based on each of the selected feature representations comprises:

Acquire position information corresponding to each of the selected feature representations, where the position information is used to indicate a positional relationship between the selected feature representation and each of the image blocks of the image to be identified;

Each of the selected feature representations is subjected to position fusion processing with each of the position information to obtain the initial feature representation.

6. The method according to any one of claims 1 to 5, characterized in that the step of extracting features from the image to be identified and determining the sampling weight of each image block in the image to be identified in at least one preset direction is performed by a frequency adaptation module, and the method further comprises:

Acquire a sample image and annotation distribution information of the sample image in the at least one preset direction;

Using the frequency adaptation module to perform feature extraction on the sample image, and determine a sample sampling weight of each image block in the sample image in at least one preset direction;

Performing Gaussian distribution processing on the sample sampling weights to determine predicted distribution information;

Determining a target loss based on a difference between the labeled distribution information and the predicted distribution information;

According to the target loss, the parameters of the frequency adaptation module are adjusted.

7. The method according to any one of claims 1 to 5, characterized in that fusing the initial feature representation with the feature representations of the plurality of image blocks to obtain a target feature representation comprises:

Inputting the initial feature representation and the feature representations of the multiple image blocks into a cross-attention layer to obtain a feature representation output by the cross-attention layer, wherein the initial feature representation serves as a query vector of the cross-attention layer, and the feature representations of the multiple image blocks serve as a key vector and a value vector of the cross-attention layer;

Performing feature fusion processing on the initial feature representation and the feature representation output by the cross attention layer to obtain an advanced feature representation;

The advanced feature representation is normalized to obtain the target feature representation.

8. An electronic device, comprising: a memory and a processor, wherein the memory stores program instructions, and the processor retrieves the program instructions from the memory to execute the method according to any one of claims 1 to 7.

9. A computer-readable storage medium, characterized in that it comprises: a program file stored therein, wherein the program file is used to implement the method according to any one of claims 1 to 7 when executed by a processor.