CN118262364A

CN118262364A - Image recognition method, apparatus, device, medium, and program product

Info

Publication number: CN118262364A
Application number: CN202410353620.7A
Authority: CN
Inventors: 徐琪; 张树湘; 廖江亮; 周鹏飞
Original assignee: Industrial and Commercial Bank of China Ltd ICBC
Current assignee: Industrial and Commercial Bank of China Ltd ICBC
Priority date: 2024-03-26
Filing date: 2024-03-26
Publication date: 2024-06-28

Abstract

The present disclosure provides an image recognition method, which may be applied to the fields of artificial intelligence and financial technology. The image recognition method comprises the following steps: and acquiring an image to be identified. And carrying out feature extraction on the image to be identified by adopting a pre-trained image identification model to obtain a multi-scale feature map, wherein a deformable convolution kernel is arranged in a feature extraction network of the image identification model, the feature map of each scale and the pooled feature map are subjected to primary feature fusion, and the multi-scale feature map is subjected to secondary feature fusion of different scales for a plurality of times. At least one candidate target box within the multi-scale feature map is determined. And dividing the target frame to obtain the target in the image to be identified. The present disclosure also provides an image recognition apparatus, device, storage medium, and program product.

Description

Image recognition method, apparatus, device, medium, and program product

Technical Field

The present disclosure relates to the field of artificial intelligence and financial technology, and more particularly to the field of image recognition, and more particularly to an image recognition method, apparatus, device, medium, and program product.

Background

Image recognition is an important area of artificial intelligence, which refers to techniques for processing, analyzing, and understanding images with computers to identify targets and objects in various different modes. It is a specific application of pattern recognition technology in the image field, and its mathematical nature belongs to the mapping problem of pattern space to category space. The development of image recognition technology has undergone the stages of character recognition, digital image processing and recognition, object recognition, etc.

In banking, there is a need to identify various images, such as check images. To ensure payment accuracy and compliance, the need to process the various information on the handwritten check image often requires significant time and labor costs, so automatic processing of such information is critical to improving efficiency, reducing errors and fraud detection.

However, existing algorithms often exhibit instability in the face of scenes such as image layout diversity of check images, image blurring, undersize of images, etc., with problems of false detection, missed detection, and difficulty in handling small sized check areas. Erroneous results are often output, especially when processing handwritten checks or being disturbed by image quality problems.

Disclosure of Invention

In view of the foregoing, embodiments of the present disclosure provide image recognition methods, apparatuses, devices, media, and program products that improve image recognition accuracy.

According to a first aspect of the present disclosure, there is provided an image recognition method including: and acquiring an image to be identified. And carrying out feature extraction on the image to be identified by adopting a pre-trained image identification model to obtain a multi-scale feature map, wherein a deformable convolution kernel is arranged in a feature extraction network of the image identification model, the feature map of each scale and the pooled feature map are subjected to primary feature fusion, and the multi-scale feature map is subjected to secondary feature fusion of different scales for a plurality of times. At least one candidate target box within the multi-scale feature map is determined. And dividing the target frame to obtain the target in the image to be identified.

According to an embodiment of the present disclosure, performing feature extraction on an image to be identified using a pre-trained image recognition model, obtaining a multi-scale feature map includes: and acquiring an initial feature map of the image to be identified. And carrying out pooling operation on the initial feature map to obtain a pooled feature map. And adopting a plurality of deformable convolution cores to perform feature extraction on the initial feature map to obtain a multi-scale convolution feature map. And respectively carrying out feature fusion on the pooled feature map and the multi-scale convolution feature map to obtain a multi-scale first fusion feature map. And performing scale transformation on the first fused feature map of each scale to obtain a first transformed scale feature map. And re-fusing the first transformation scale feature map and the first fusion feature map with the same scale. And repeating the steps of scale transformation and re-fusion for a plurality of times to obtain a multi-scale characteristic diagram.

According to an embodiment of the present disclosure, the types of at least two of the upscaling are different.

In accordance with an embodiment of the present disclosure, determining at least one candidate target box within the multi-scale feature map comprises: a plurality of label boxes of the multi-scale feature map are determined. And adopting a sliding window to perform target selection on the multi-scale feature map to obtain a plurality of anchor frames. And determining at least one candidate target frame according to the offset of the anchor frame and the annotation frame, wherein the candidate target frame represents the classification probability and the position offset of the target.

According to an embodiment of the present disclosure, segmenting a target within an image to be identified according to a candidate target frame includes: and carrying out target classification on the multi-scale feature map to obtain a plurality of groups of target types. According to the target type, initial bounding boxes of targets of all types are respectively determined in the candidate target boxes. And performing mask segmentation on the initial bounding box to determine the target in the image to be identified.

According to an embodiment of the present disclosure, before classifying the targets of the multi-scale feature map, the image recognition method further includes: and unifying the sizes of the multi-scale feature images to obtain the target feature image. And performing target classification on the target feature map to obtain a plurality of groups of target types.

According to an embodiment of the present disclosure, the multi-scale feature map includes a multi-scale feature map of an identified candidate target frame and a multi-scale feature map of an unidentified candidate target frame.

According to an embodiment of the present disclosure, before the feature extraction of the image to be identified using the pre-trained image recognition model, the image recognition method further includes: preprocessing the image to be identified, wherein the preprocessing comprises any one or more of wavelet transformation, affine transformation and histogram equalization.

According to an embodiment of the present disclosure, training of an image recognition model includes: and carrying out iterative optimization on the image recognition model by adopting a joint loss function, wherein the joint loss function comprises target classification loss, bounding box regression loss and mask segmentation loss.

According to an embodiment of the present disclosure, the image recognition method is applied to recognition of check images.

According to an embodiment of the present disclosure, the image to be identified includes personal information of the user, and the image identification method further includes: the authorization of the user to obtain personal information of the user is obtained. And after obtaining the authorization of the user to acquire the personal information of the user, acquiring the image to be identified.

A second aspect of the present disclosure provides an image recognition apparatus, comprising: and the acquisition module is used for acquiring the image to be identified. The extraction module is used for carrying out feature extraction on the image to be identified by adopting a pre-trained image recognition model to obtain a multi-scale feature map, a deformable convolution kernel is arranged in a feature extraction network of the image recognition model, the feature map of each scale and the pooled feature map are subjected to primary feature fusion, and the multi-scale feature map is subjected to secondary feature fusion of different scales for a plurality of times. A determination module for determining at least one candidate target frame within the multi-scale feature map. And the segmentation module is used for segmenting the target in the image to be identified according to the candidate target frame.

A third aspect of the present disclosure provides an electronic device, comprising: one or more processors; and a memory for storing one or more programs, wherein the one or more programs, when executed by the one or more processors, cause the one or more processors to perform the image recognition method described above.

A fourth aspect of the present disclosure also provides a computer-readable storage medium having stored thereon executable instructions that, when executed by a processor, cause the processor to perform the above-described image recognition method.

A fifth aspect of the present disclosure also provides a computer program product comprising a computer program which, when executed by a processor, implements the above-described image recognition method.

Compared with the prior art, the image recognition method, the device, the electronic equipment, the storage medium and the program product provided by the disclosure have at least the following beneficial effects:

(1) The image recognition method combines the deformable convolution and respectively performs multiple feature fusion on the feature images with the same scale and different scales, can greatly improve the accuracy of feature extraction on the images, and is suitable for recognizing images which are difficult to distinguish, such as blurred images, small-size images, handwriting and the like.

(2) According to the image recognition method, the deformable convolution is adopted to check the initial feature map for feature extraction, and the information of complex structures and shapes in the image can be captured more flexibly, so that the accuracy of image recognition is improved. Meanwhile, the pyramid network is adopted to fuse the feature images with different scales for multiple times, so that the information loss of the feature images with different scales is made up, and the recognition accuracy of the images is further improved.

(3) According to the image recognition method, the target contour in the candidate target frame is accurately segmented through the mask segmentation method, so that effective target information in the image to be recognized is accurately obtained.

Drawings

The foregoing and other objects, features and advantages of the disclosure will be more apparent from the following description of embodiments of the disclosure with reference to the accompanying drawings, in which:

FIG. 1A schematically illustrates an application scenario diagram of an image recognition method, apparatus, device, medium, and program product according to an embodiment of the present disclosure; FIG. 1B schematically illustrates an image to be identified according to an embodiment of the present disclosure;

FIG. 2 schematically illustrates a flow chart of an image recognition method according to an embodiment of the present disclosure;

FIG. 3 schematically illustrates a flow chart of a method of feature extraction of an image to be identified according to an embodiment of the disclosure;

FIG. 4 schematically illustrates a flow chart of a method of determining candidate target boxes according to an embodiment of the disclosure;

FIG. 5 schematically illustrates a method flow diagram for segmenting a target within an image to be identified in accordance with an embodiment of the present disclosure;

FIG. 6 schematically illustrates a flow chart of an image recognition method according to another embodiment of the present disclosure;

FIG. 7 schematically illustrates a flow chart of an image recognition method according to yet another embodiment of the present disclosure;

FIG. 8A schematically illustrates a block diagram of an image recognition model according to an embodiment of the present disclosure; FIG. 8B schematically illustrates a flow diagram of image preprocessing according to an embodiment of the present disclosure; fig. 8C schematically illustrates a block diagram of a residual block according to an embodiment of the present disclosure; FIG. 8D schematically illustrates a block diagram of a feature pyramid network in accordance with an embodiment of the present disclosure;

FIG. 9 schematically illustrates a flow chart of a method of training an image recognition model in accordance with an embodiment of the present disclosure;

FIG. 10 schematically illustrates a flow chart of an image recognition method according to yet another embodiment of the present disclosure;

fig. 11 schematically illustrates a block diagram of a structure of an image recognition apparatus according to an embodiment of the present disclosure; and

Fig. 12 schematically illustrates a block diagram of an electronic device adapted to implement an image recognition method according to an embodiment of the present disclosure.

Detailed Description

Hereinafter, embodiments of the present disclosure will be described with reference to the accompanying drawings. It should be understood that the description is only exemplary and is not intended to limit the scope of the present disclosure. In the following detailed description, for purposes of explanation, numerous specific details are set forth in order to provide a thorough understanding of the embodiments of the present disclosure. It may be evident, however, that one or more embodiments may be practiced without these specific details. In addition, in the following description, descriptions of well-known structures and techniques are omitted so as not to unnecessarily obscure the concepts of the present disclosure.

The terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the disclosure. The terms "comprises," "comprising," and/or the like, as used herein, specify the presence of stated features, steps, operations, and/or components, but do not preclude the presence or addition of one or more other features, steps, operations, or components.

All terms (including technical and scientific terms) used herein have the same meaning as commonly understood by one of ordinary skill in the art unless otherwise defined. It should be noted that the terms used herein should be construed to have meanings consistent with the context of the present specification and should not be construed in an idealized or overly formal manner.

Where a convention analogous to "at least one of A, B and C, etc." is used, in general such a convention should be interpreted in accordance with the meaning of one of skill in the art having generally understood the convention (e.g., "a system having at least one of A, B and C" would include, but not be limited to, systems having a alone, B alone, C alone, a and B together, a and C together, B and C together, and/or A, B, C together, etc.).

Embodiments of the present disclosure provide an image recognition method, apparatus, device, medium, and program product, which may be used in the financial field or other fields. It should be noted that the image recognition method, apparatus, device, medium and program product of the present disclosure may be used in the financial field, and may also be used in any field other than the financial field, and the application fields of the image recognition method, apparatus, device, medium and program product of the present disclosure are not limited.

In the technical solution of the present disclosure, the related user information (including, but not limited to, user personal information, user image information, user equipment information, such as location information, etc.) and data (including, but not limited to, data for analysis, stored data, displayed data, etc.) are information and data authorized by the user or sufficiently authorized by each party, and the related data is collected, stored, used, processed, transmitted, provided, disclosed, applied, etc. in compliance with relevant laws and regulations and standards, necessary security measures are taken, no prejudice to the public order colloquia is provided, and corresponding operation entries are provided for the user to select authorization or rejection.

In the scenario of using personal information to make an automated decision, the method, the device and the system provided by the embodiment of the disclosure provide corresponding operation inlets for users, so that the users can choose to agree or reject the automated decision result; if the user selects refusal, the expert decision flow is entered. The expression "automated decision" here refers to an activity of automatically analyzing, assessing the behavioral habits, hobbies or economic, health, credit status of an individual, etc. by means of a computer program, and making a decision. The expression "expert decision" here refers to an activity of making a decision by a person who is specializing in a certain field of work, has specialized experience, knowledge and skills and reaches a certain level of expertise.

The embodiment of the disclosure provides an image recognition method, which comprises the following steps: and acquiring an image to be identified. And carrying out feature extraction on the image to be identified by adopting a pre-trained image identification model to obtain a multi-scale feature map, wherein a deformable convolution kernel is arranged in a feature extraction network of the image identification model, the feature map of each scale and the pooled feature map are subjected to primary feature fusion, and the multi-scale feature map is subjected to secondary feature fusion of different scales for a plurality of times. At least one candidate target box within the multi-scale feature map is determined. And dividing the target frame to obtain the target in the image to be identified. The method can greatly improve the accuracy of feature extraction of the images, and is suitable for identifying images which are difficult to distinguish, such as blurred images, small-size images, handwritten words and the like.

Fig. 1A schematically illustrates an application scenario diagram of an image recognition method, apparatus, device, medium and program product according to an embodiment of the present disclosure. Fig. 1B schematically illustrates an image to be identified according to an embodiment of the present disclosure.

As shown in fig. 1A, the application scenario 100 according to this embodiment may include an image 110 to be recognized, an image recognition model 120, and an image recognition result 130. The image recognition model 120 includes, for example, a feature extraction network 121, a region generation network 122, and a feature classification and segmentation network 123. The feature extraction network 121 is used to extract target features in the image 110 to be identified and output a feature map. The region generation network 122 is configured to determine a target region (e.g., a target box) of the extracted feature map. The feature classification and segmentation network 123 is used for classifying and segmenting the feature map with the determined target frame, and finally outputting the image recognition result 130.

The image 110 to be identified may be various types of images, such as a binary image: the two-dimensional matrix of such an image consists of only two values, 0 and 1, where "0" represents black and "1" represents white. Since the value of each pixel is only two possible, the data type of the binary image in the computer is usually 1 binary bit. It is mainly used for the scanning recognition (OCR) of characters and line drawings and the storage of mask images.

Gray scale image: the matrix elements of the gray scale image typically range from 0 to 255. Where "0" represents pure black and "255" represents pure white, with the middle number representing a different level of gray, forming a black to white transition. In some software, the gray scale image may also be represented using a double precision data type, with a pixel value range of 0 to 1, where 0 represents black, 1 represents white, and a decimal between 0 and 1 represents a different gray scale.

Index image: the file structure of the index image is relatively complex, and there is a two-dimensional array called a color index matrix MAP, in addition to the two-dimensional matrix containing the image.

True color RGB image: RGB images are a way of representing color images, which are different from index images, but are also used to present color information.

The feature extraction network 121 may also include one or more of the following networks, such as convolutional neural networks (Convolutional Neural Networks, CNN), depending on the characteristics of the image to be processed. CNN is one of the feature extraction networks, particularly in the field of image recognition. The hierarchical features are extracted from the image through structures such as a convolution layer and a pooling layer. The CNN structure includes AlexNet, VGG, resNet, inception, efficientNet and the like, for example.

A feature pyramid network (Feature Pyramid Network, FPN). FPN is a feature extractor aimed at improving accuracy and speed by combining bottom-up and top-down paths to generate feature map pyramids. FPN is not only used for target detection, but also can be combined with other network structures, such as RPN, fast R-CNN or Fast R-CNN, so as to improve the performance of target detection.

A recurrent neural network (Recurrent Neural Networks, RNN) and variants thereof. RNNs are suitable for processing sequence data, such as text or time series. It extracts features by capturing time-dependent relationships in the sequence. LSTM (long-term memory) and GRU (gated loop unit) are two variants of RNN that better address the long-term dependency problem.

An antagonism network (GENERATIVE ADVERSARIAL Networks, GANs) is generated. GANs is composed of a generator and a discriminator, and data features are extracted and generated in a mutual countermeasure mode. GANs is excellent in tasks such as image generation and super-resolution reconstruction, and can also be used for feature extraction.

A self-encoder (Autoencoders). The self-encoder is a neural network that does not supervise learning, learning a compressed representation of data by the structure of the encoder and decoder. Variations such as convolutional self-encoders, sparse self-encoders, etc., may be applied to feature extraction of images or specific types of data.

The transducer captures global dependencies in the input data through a self-attention mechanism, thereby extracting useful features.

Deep belief networks (Deep Belief Networks, DBN). A DBN is a generative model of a stack of multiple boltzmann machines (RBMs) that can be used for feature extraction and classification.

It will be appreciated that other specifically designed feature extraction networks are possible, such as private networks for use in the fields of face recognition, medical image analysis, remote sensing image analysis, etc., depending on the requirements of a particular task or dataset. Each network has its unique advantages and applicable scenarios, and the choice of which network depends on the specific application requirements and data characteristics. In practical applications, these network structures may also be combined or modified to accommodate different task requirements.

The region generation network 122 may be, for example, an RPN network, collectively Region Proposal Network, which is one of the key components for target detection in deep learning. Its main function is to generate candidate target areas for target detection in subsequent networks. The RPN may be a Convolutional Neural Network (CNN) model whose input is a convolutional signature, typically the middle layer output of the convolutional neural network from the entire image. The output of the RPN includes two parts: coordinate correction information of the candidate frames and whether each candidate frame contains a score of the target.

The feature classification and segmentation network 123 may also include one or more of the following networks according to the processing requirements of the feature map, for example, the feature classification network includes: VGG networks have a variety of network structures, and are mainly distinguished based on different layers of the network or the same layers but different structures. VGG networks are trained with a large number of data sets to extract features for classification.

ResNet network: resNet introduces a concept of residual learning, and solves the problem of gradient disappearance or explosion in deep neural network training by constructing a deep residual network. ResNet networks perform well in image classification tasks and are often used for other computer vision tasks.

The feature segmentation network includes: FCN (Fully Convolutional Network), a full convolution network, was originally used for semantic segmentation tasks. It allows the input image to be of arbitrary size by replacing the conventional full-join layer with a full-convolution layer. The FCN restores the low-resolution feature map to the resolution of the input image by an upsampling operation, generating a dense segmentation result.

U-Net: is a network of encoder-decoder structure with jump connection, which can be used in the fields of medical image segmentation, etc. The structure is symmetrical, the characteristics are extracted by the encoding part, and the decoding part gradually restores the image details.

SegNet: also a network of encoder-decoder structures, but unlike U-Net SegNet uses the index of the maximum pooling layer for upsampling to restore resolution.

DeepLab: hole convolution (Dilated Convolution) was used to increase receptive fields and ASPP (Atrous SPATIAL PYRAMID Pooling) modules were introduced to capture multi-scale information simultaneously. This enables DeepLab to handle objects of different sizes in the image segmentation task.

Mask segmentation, also known as instance segmentation, is a computer vision task that aims to identify different objects in an image and to generate an accurate pixel level Mask for each object. This technique combines object detection and semantic segmentation, not only to identify objects in the image, but also to distinguish the exact boundaries of each object.

In addition, there are networks that combine classification and segmentation functions, such as REFINENET, which use a codec structure and propose refinenet modules in the decoding section for residual stacking and feature fusion to achieve more accurate classification and segmentation.

It should be noted that these network types are not independent of each other, and they can be mutually referenced and combined to adapt to different tasks and data sets. In practical applications, it is important to select a suitable network structure and adjust and optimize the network structure according to specific requirements.

It is understood that the image recognition model 120 may be provided with other functions such as preprocessing, judging matching, etc., in addition to the feature extraction network 121, the region generation network 122, and the feature classification and segmentation network 123. The preprocessing mainly refers to denoising, smoothing, transformation and other operations in image processing, so that important characteristics of the image are enhanced.

As shown in fig. 1B, the image recognition method of the embodiment of the present disclosure is used, for example, to recognize a check image. The check image is identified by the image identification model 120, and an image identification result 130, such as date, payee, amount, payment account number and corresponding detailed information thereof, seal, etc., can be obtained.

First, the related terms of the embodiments of the present disclosure are explained as follows:

Data augmentation: the invention adopts wavelet transformation, histogram equalization, affine transformation and other modes to preprocess check image training data. The method can enable the model to more reliably identify important information areas in check images, is not affected by angles, resolution and noise, and improves accuracy of the model under the condition of processing different check images.

Deformable convolution: the deformable convolution is a convolution neural network variant which allows the convolution kernel shape to change with the position, and the convolution kernel shape can be adaptively adjusted to better capture key features on check images. The method is very useful in processing irregular shapes in check images, and can improve the accuracy of the convolutional neural network in check image region detection tasks.

Multi-scale feature transformation: multi-scale feature transformation is an important technique in check image area detection for handling features of different sizes and shapes. By constructing an image pyramid, extracting multi-scale features and finally fusing the multi-scale features, the loss of features with various sizes and shapes on check images can be effectively reduced, and the performance and the robustness of the detection system are further improved.

The image recognition method of the disclosed embodiment will be described in detail below with reference to fig. 2 to 10 based on the scene described in fig. 1.

Fig. 2 schematically illustrates a flowchart of an image recognition method according to an embodiment of the present disclosure.

According to an embodiment of the present disclosure, as shown in fig. 2, the image recognition method of this embodiment includes, for example, operations S210 to S240, and the image recognition method may be executed by a computer program on corresponding computer hardware.

In operation S210, an image to be recognized is acquired.

In operation S220, feature extraction is performed on the image to be identified by using a pre-trained image recognition model to obtain a multi-scale feature map, a deformable convolution kernel is arranged in a feature extraction network of the image recognition model, the feature map of each scale and the pooled feature map undergo primary feature fusion, and the multi-scale feature map undergoes secondary feature fusion of different scales for multiple times.

At operation S230, at least one candidate target frame within the multi-scale feature map is determined.

In operation S240, the object in the image to be identified is obtained by segmentation according to the candidate object frame.

In some embodiments, the image to be identified is first acquired. The image to be identified may be a color photograph containing a plurality of objects, such as a street scene graph containing cars, pedestrians, and buildings.

And then extracting the characteristics of the image to be identified by adopting a pre-trained image identification model. The image recognition model is a deep neural network, such as ResNet or VGG network. The model internally uses deformable convolution kernels that can adaptively adjust their convolution regions according to the shape and size of objects in the image.

The image recognition model extracts feature images of multiple scales through convolution layers with different depths. These feature maps capture information of different sizes and degrees of abstraction in the image.

The feature map of each scale is fused with the feature map after pooling operation. The pooling operation may be maximum pooling or average pooling, aimed at reducing the resolution of the feature map and retaining the most important information. The fusion operation can be simple addition or splicing, or weighted fusion can be performed by the weights obtained through learning.

In addition, the multi-scale feature map can be subjected to secondary feature fusion of different scales for a plurality of times. This means that feature maps of different scales are fused at multiple levels to make full use of information of different scales.

Candidate target frames are generated through RPN (Region Proposal Network) on the fused multi-scale feature map. These target boxes are areas that may contain targets.

Based on the candidate target frames, performing target segmentation by using models such as Mask R-CNN and the like. By predicting whether each pixel belongs to a target, an accurate target mask is obtained. For example, in a street scene, the exact contours of cars, pedestrians, and buildings may be segmented.

By implementing the image recognition method, different target objects can be accurately recognized and segmented from the image to be recognized. The method has wide application prospect in the fields of automatic driving, safety monitoring, medical image analysis and the like.

In operation S210, an image to be recognized is acquired, possibly involving acquisition of user information.

In embodiments of the present disclosure, the user's consent or authorization may be obtained prior to obtaining the user's information. For example, before operation S210, a request to acquire user information may be issued to the user. In case the user agrees or authorizes that the user information can be acquired, the operation S210 is performed.

In operation S220, a decision of performing a subsequent action is made using the user information, so as to implement an operation of the terminal or acquisition of information, and the like.

In the embodiment of the disclosure, a corresponding operation entrance can be provided for the user to choose to agree or reject the automated decision result. That is, before making a decision for the subsequent action execution on the user information, an instruction for the decision to be made by the user to be granted or denied through the corresponding operation portal may be obtained. If the user agrees to make the decision, a decision for performing a subsequent action on the user information is made, i.e., step S220 is performed. If the user refuses to make the decision, an expert decision flow is entered.

Fig. 3 schematically illustrates a flow chart of a method of feature extraction of an image to be identified according to an embodiment of the disclosure.

According to an embodiment of the present disclosure, as shown in fig. 3, for example, a multi-scale feature map is obtained by performing feature extraction on an image to be identified using a pre-trained image recognition model through operations S321 to S327.

In operation S321, an initial feature map of an image to be recognized is acquired.

In operation S322, the initial feature map is subjected to a pooling operation, resulting in a pooled feature map.

In operation S323, feature extraction is performed on the initial feature map using a plurality of deformable convolution kernels, resulting in a multi-scale convolution feature map.

In operation S324, feature fusion is performed on the pooled feature map and the multi-scale convolution feature map, respectively, to obtain a multi-scale first fusion feature map.

In operation S325, the first fused feature map of each scale is scaled to obtain a first scaled feature map.

In operation S326, the first transformed scale feature map is re-fused with the first fused feature map having the same scale.

In operation S327, the scaling and re-fusion steps are repeated multiple times to obtain a multi-scale feature map.

In some embodiments, first, the image to be identified is input into a pre-trained image identification model. This model may have been trained on a large number of data sets to learn how to effectively extract features from images. At the first layer or layers of the model, an initial feature map of the image is typically obtained, which captures basic information such as edges, colors, textures, etc. of the image.

Next, a pooling operation is performed on the initial feature map. Pooling is a downsampling technique that reduces the size of feature maps while preserving important feature information. Common pooling operations include maximum pooling and average pooling. By the pooling operation, pooled feature maps are obtained that are smaller in size than the original feature map, but retain key feature information.

The initial feature map is then convolved using a plurality of deformable convolution kernels. The deformable convolution kernels differ from the normal ones in that they can adaptively adjust the convolution region according to the shape and size of the object in the image, thereby extracting features more accurately. Through a plurality of deformable convolution kernels with different scales, a plurality of convolution feature images with different scales can be obtained, and feature information with different sizes and abstract degrees in the images is captured by the feature images.

After the multi-scale convolution feature maps are obtained, they are fused with the pooling feature maps. The fusion mode can be simple addition or splicing, or weighting fusion can be carried out by the weight obtained through learning. And obtaining a first multi-scale fusion feature map through fusion operation, wherein the feature map comprises detail information of an original image and abstract feature information of multiple scales.

Next, the first fused feature map of each scale is scaled. The scaling may be achieved by upsampling or downsampling operations in order to transform feature maps of different scales to the same size for subsequent feature fusion operations. And obtaining a first transformation scale characteristic diagram through scale transformation.

After the first transformation scale feature images are obtained, the first transformation scale feature images and the first fusion feature images with the same scale are recombined. The step can further integrate the feature information with different scales, and the feature richness and the feature accuracy are improved.

And finally, repeating the steps of scale transformation and re-fusion for a plurality of times until a final multi-scale characteristic diagram is obtained. The process can be performed in a plurality of iterations, and feature information of different scales can be further fused in each iteration, so that the representation capability of the features is improved. And through multiple iterations, a final multi-scale feature map is obtained, the feature map contains rich and comprehensive feature information in the image, and powerful support is provided for subsequent tasks such as target detection and segmentation.

Through the steps, the image recognition model trained in advance can be combined, and the multi-scale, rich and accurate characteristic information can be extracted from the image to be recognized by adopting various characteristic extraction methods. Such feature information is of great importance for subsequent image processing and analysis tasks.

In some embodiments, in a previous step, an initial feature map, a pooled feature map, and a multi-scale convolution feature map have been obtained and a first feature fusion has been performed. Next, the operations of multiple scaling and re-fusion will be performed, and the type of scaling is different at least twice.

The first sub-scale transformation and re-fusion includes, for example:

And (3) scaling: first, a downsampling operation is selected for a first fused feature map of a certain scale. Downsampling may be achieved by max pooling, average pooling, or convolution with a step size greater than 1 in order to reduce the size of the feature map while retaining critical information.

And (3) re-fusion: and fusing the feature map after downsampling with a first fused feature map with the same or similar scale. This may be done by simple addition, stitching or weighted fusion using learned weights.

The second sub-scaling and re-blending (using different types of scaling) for example includes:

And (3) scaling: this selection performs an upsampling operation on the feature map of another scale, unlike the first time. Upsampling may be achieved by bilinear interpolation, nearest neighbor interpolation or transposed convolution in order to increase the size of a feature map to match it with another feature map.

And (3) re-fusion: and fusing the up-sampled feature map with a corresponding first fused feature map or a feature map fused in the previous time. Likewise, the fusion may be additive, splice, or weighted fusion.

Subsequent scaling and re-integration may alternate between upsampling and downsampling, or introduce other types of scaling methods, such as feature integration between different levels in a pyramid structure. After each transformation, the image is fused with the corresponding feature image to enrich the feature representation.

Through the multi-scale transformation and re-fusion process, a series of multi-scale feature images fused with different scale information can be obtained. The feature images not only contain detail information of the original images, but also integrate features of different abstract levels, and powerful support is provided for subsequent tasks such as target detection and segmentation.

It should be noted that in practical applications, the specific scaling and fusion may need to be adjusted and optimized according to the specific task and data set to achieve the best performance and effect.

Fig. 4 schematically illustrates a flow chart of a method of determining candidate target boxes according to an embodiment of the disclosure.

According to an embodiment of the present disclosure, at least one candidate target frame within the multi-scale feature map is determined, for example, by operations S431-S433, as shown in fig. 4.

In operation S431, a plurality of label boxes of the multi-scale feature map are determined.

In operation S432, a sliding window is used to perform target selection on the multi-scale feature map, so as to obtain a plurality of anchor frames.

In operation S433, at least one candidate target frame is determined according to the offset of the anchor frame and the annotation frame, and the candidate target frame characterizes the classification probability and the positional offset of the target.

In some embodiments, during the training process, the dataset is labeled, and an accurate position is labeled in the image for each target object, forming a labeling frame (e.g., a rectangular frame). These callout boxes can be used as supervisory information to guide the model in learning how to detect the target. In the feature extraction stage, the labeling frames are mapped onto corresponding multi-scale feature images to form labeling frames of the multi-scale feature images.

Sliding windows are a method of object selection. Sliding windows of different sizes and aspect ratios are set on the multi-scale feature map, each position of the feature map is traversed, and each window position corresponds to a potential target area, namely an anchor frame. The number of anchor boxes is typically large, covering possible target locations on the feature map.

For each anchor box, the offset from the label box is calculated, including the offset of the center point, the offset of the width, the offset of the height, and so on. These offsets reflect the position and size differences between the anchor frame and the real target. The anchor frame is then adjusted according to these offsets to more closely approximate the actual annotation frame.

Meanwhile, the model predicts the classification probability corresponding to each anchor frame, namely whether the anchor frame contains the target and the class of the target. By combining the classification probability and the position offset, anchor frames with higher classification probability and more accurate positions can be screened out and used as candidate target frames. The candidate target frame not only characterizes the classification probability of the target, but also contains the accurate position of the target in the image (obtained by calculating the position offset). These candidate target boxes will serve as the basis for subsequent target detection or segmentation tasks.

For example, there is a multi-scale feature map that contains feature information of different scales. A series of anchor boxes are first generated on the feature map using a sliding window. These anchor boxes are then compared to the label boxes to calculate the offset between them.

For example, for a certain anchor frame, the center point of the anchor frame and the label frame is offset by (dx, dy), the width offset is dw, and the height offset is dh. The model will learn how to adjust the position and size of the anchor frame based on these offsets so that it is closer to the annotation frame. At the same time, the model predicts the classification probability of the anchor frame, such as whether it contains an automobile target.

A series of candidate target frames can be obtained by traversing all anchor frames and calculating the offset and classification probability of the anchor frames and the annotation frames. These candidate object boxes contain not only the classification information of the object, but also precisely indicate the position of the object in the image. Subsequent object detection or segmentation tasks will be performed based on these candidate object boxes.

Fig. 5 schematically illustrates a method flow diagram of segmenting an object within an image to be identified according to an embodiment of the disclosure.

According to an embodiment of the present disclosure, as shown in fig. 5, a target within an image to be identified is segmented according to a candidate target frame, for example, through operations S541 to S543.

In operation S541, object classification is performed on the multi-scale feature map, so as to obtain multiple groups of object types.

In operation S542, initial bounding boxes for each type of object are determined within the candidate object boxes, respectively, according to the object type. And

In operation S543, mask segmentation is performed on the initial bounding box to determine a target within the image to be identified.

In some embodiments, after the candidate target boxes are determined, the multi-scale feature map is first required to be subject to target classification. This may be achieved by a classifier that predicts the type of object that may be contained within each candidate object box based on information in the feature map. The classifier may be based on a deep learning model, such as a Convolutional Neural Network (CNN). Through training, the model can learn the characteristic representations of different target types, and output the category probability of each candidate target frame.

For example, in an image containing cars, pedestrians, and buildings, the classifier may assign one or more class labels, such as "car," "pedestrian," or "building," to each candidate target frame.

After the object type is obtained, the accurate boundary of the object needs to be further determined in each candidate object frame. This may be accomplished by a regression model that predicts the boundary offsets for each candidate target frame for adjusting the position and size of the candidate target frame to more accurately fit the actual boundary of the target.

The regression model may also be based on deep learning, which learns the mapping from feature maps to boundary offsets. By training, the model can predict the accurate boundary of the corresponding target of each candidate target frame.

Taking an automobile as an example, it is assumed that a candidate target frame initially covers most of the area of the automobile, but the boundary is not accurate enough. The regression model predicts the offset of the boundary based on the feature map information and then adjusts the position and size of the candidate target frame to more closely surround the car.

After the initial bounding box is obtained, the target is finely segmented, for example, by using a mask segmentation technology. Mask segmentation is a pixel-level segmentation method that is capable of outputting mask images of the same size as the input image, where each pixel is marked as belonging to a certain object or background.

During mask segmentation, a deep learning model (e.g., a full convolutional network FCN) is used to classify the regions within the initial bounding box pixel by pixel. The model predicts the class label of each pixel according to the information of the feature map and generates a corresponding mask image. Each pixel in the mask image indicates whether the corresponding position in the original image belongs to the target area.

And finally, fusing the mask image with the original image to obtain a segmented target image. In the fusion process, pixels belonging to the target region may be left, while pixels of the background region are set to be transparent or a specific color, thereby clearly displaying the outline and shape of the target.

For example, there is a street view image containing cars and pedestrians, and candidate target frames and their target types are obtained through the previous steps. For one of the candidate target frames labeled "car", its initial bounding box is further determined. Then, the region within the initial bounding box is classified pixel by pixel using the mask segmentation model to generate a mask image of the same size as the initial bounding box. Each pixel in this mask image is labeled "car" or "background". And finally, fusing the mask image with the original image to obtain the segmented automobile target image. Likewise, similar segmentation operations may be performed on other types of targets.

Through the steps, the targets in the image to be identified can be accurately segmented according to the candidate target frames, and powerful support is provided for subsequent tasks such as target identification and scene understanding.

Fig. 6 schematically illustrates a flowchart of an image recognition method according to another embodiment of the present disclosure.

According to an embodiment of the present disclosure, as shown in fig. 6, the image recognition method further includes, for example, operations S610 to S620 before classifying the objects of the multi-scale feature map.

In operation S610, the multi-scale feature map is unified in size to obtain a target feature map.

And

In operation S620, the object classification is performed on the object feature map, so as to obtain multiple groups of object types.

In some embodiments, this may present difficulties for subsequent object classification tasks, as the multi-scale feature map may have different sizes and resolutions in practical applications. Therefore, it is desirable to unify the dimensions of the multi-scale feature map so that efficient target features can be uniformly processed and extracted.

There are various methods of size unification including, for example, interpolation (e.g., bilinear interpolation), max pooling, or average pooling, etc. The method can scale or cut the feature map according to the set target size, so that the feature map has uniform size and resolution.

For example, assuming multiple multi-scale feature maps of different sizes, a fixed target size (e.g., 256 x 256 pixels) may be selected and each feature map scaled to that target size using bilinear interpolation. In this way, a series of uniformly sized target feature maps are obtained.

After the target feature map is obtained, the target feature map can be classified by using a trained classifier. The selection of the classifier may be determined based on specific tasks and data sets, including, for example, support Vector Machines (SVMs), random forests, deep learning models, and the like.

Taking the deep learning model as an example, a neural network structure comprising a plurality of convolution layers, pooling layers and full-connection layers can be constructed. Through training, the model can learn the characteristic representations of different target types, and output the class probability corresponding to each target characteristic graph.

During the training process, the data set with the labeling information needs to be used to supervise the learning process of the model. The annotation information includes, for example, the location (e.g., bounding box) of each object in the image and a category label. By continuously optimizing parameters of the model, the target feature images can be accurately classified.

Finally, for each target feature map, the classifier outputs a class probability vector, where each element represents the probability that the feature map belongs to a particular class. The most likely category may be selected as the type of target based on a probability threshold or ranking, etc.

For example, there is an image dataset containing three types of objects, automobile, pedestrian, and building. Firstly, extracting multi-scale feature images, and unifying the multi-scale feature images to obtain a series of 256×256-pixel target feature images.

These target feature maps are then classified using a pre-trained deep learning model (e.g., convolutional neural network). The model predicts each feature map based on the learned feature representation and outputs a class probability vector.

For example, for a target feature map, the model may output the following category probability vectors: [0.1,0.8,0.1] this means that the probability of the feature map belonging to a pedestrian is 0.8 and the probability of belonging to an automobile and a building is 0.1, respectively. Based on the probability vector, the target type corresponding to the target feature map can be determined to be a pedestrian.

By performing similar classification operation on all the target feature graphs, multiple groups of target types can be obtained finally, so that a basis is provided for subsequent target detection, identification or segmentation tasks.

According to embodiments of the present disclosure, a multi-scale feature legend, for example, includes a multi-scale feature map of identified candidate target boxes and a multi-scale feature map of unidentified candidate target boxes.

In some embodiments, for multi-scale feature maps in which candidate target boxes have been identified, information of these candidate target boxes may be utilized to assist in target classification. Specifically, the multi-scale features of the corresponding region may be extracted according to the position and size of the candidate target frame. These features may include a variety of information such as color, texture, shape, etc., which may help to more accurately determine the class of the object.

For example, in an image containing a plurality of pedestrians, several candidate target frames have been identified, for example, by some method. For the candidate target frames, the features of the corresponding areas of the candidate target frames can be extracted from the multi-scale feature map and input into a classifier for classification. The classifier may consider, for example, feature information at different scales to more fully evaluate the class of the object.

For multi-scale feature maps in which candidate target frames are not identified, potential target positions need to be determined in other ways, and corresponding features are extracted for classification. This relates for example to the application of object detection algorithms such as sliding windows, regional Proposal Networks (RPNs), etc.

Taking a sliding window as an example, windows of different sizes and aspect ratios may be set on the multi-scale feature map and the entire feature map traversed. And extracting corresponding features from each window position, and inputting the features into a classifier for classification. The classifier judges whether the window contains the target or not according to the characteristics, and outputs corresponding class probability.

Finally, the classification results of the identified and unidentified candidate target frames are combined to obtain a final target classification result. This may be achieved by fusing or weighting the classification results of both. For example, the results of the fusion of the two may be weighted according to the confidence level or classification probability of the candidate target frame to obtain a more accurate classification result.

For example, there is an image containing various objects such as automobiles, pedestrians, and buildings. Candidate target frames are first identified by some method (e.g., a target detection algorithm), and the multi-scale feature maps corresponding to the frames are classified. Meanwhile, the sliding window traversal and classification are also performed on the areas where the candidate target frames are not identified.

For the identified candidate target frames, their corresponding multi-scale features are extracted and input into a deep learning classifier for classification. The classifier outputs the probabilities that each target box belongs to a different class.

And traversing the multi-scale feature map through the sliding window for the unrecognized area, and extracting the features of each window position for classification. Likewise, the classifier outputs the probabilities that each window belongs to a different class.

And finally, fusing the classification results of the identified and unidentified candidate target frames. For example, the classification probabilities of the two may be weighted averaged to obtain a final target classification result. Therefore, the information of the multi-scale feature map can be fully utilized to accurately classify the targets in the image.

By the method, the target classification can be carried out by combining the multi-scale feature images of the identified and unidentified candidate target frames, so that the accuracy and the reliability of classification are improved.

Fig. 7 schematically illustrates a flowchart of an image recognition method according to a further embodiment of the present disclosure.

According to an embodiment of the present disclosure, as shown in fig. 7, the image recognition method further includes, for example, operation S710 before feature extraction is performed on an image to be recognized using a pre-trained image recognition model.

In operation S710, an image to be identified is preprocessed, the preprocessing including any one or more of wavelet transformation, affine transformation, and histogram equalization.

In some embodiments, preprocessing is a critical step in the image recognition task, which can help improve image quality, reduce noise and distortion, and enhance useful information in the image, thereby improving the accuracy of subsequent feature extraction and recognition. Preprocessing techniques are numerous, including but not limited to wavelet transformation, affine transformation, and histogram equalization. The method of image preprocessing will be exemplified below in conjunction with these techniques.

Wavelet transformation is a multi-scale analysis tool for signal and image processing that works by decomposing a signal or image into a series of wavelet coefficients. For image preprocessing, wavelet transforms are commonly used for image compression and denoising.

For example, there is an image to be identified that contains noise due to the device or environment. Wavelet transforms may be applied to denoise the image. First, the image is decomposed into wavelet coefficients of different frequencies. Then, a threshold is set to filter out a high frequency coefficient corresponding to noise. And finally, reconstructing an image through inverse wavelet transformation to obtain a denoised image.

Affine transformation is a linear transformation that can maintain the "flatness" of an image (i.e., the original straight lines and parallel lines remain straight and parallel after transformation). It is commonly used for geometric corrections of images such as rotation, scaling and tilt corrections.

For example, for an image to be recognized, if the camera is tilted or the image zooms in upon photographing, the object may exhibit a phenomenon of tilting or unparallel. By applying affine transformation, the direction and scale of the image can be adjusted to fit the normal visual perception. For example, if the image is tilted, the tilt angle may be calculated and the corresponding rotation matrix applied to correct the image.

Histogram equalization is a method for improving the contrast of an image that works by stretching the pixel intensity distribution of the image so that the contrast of the image is more uniform.

For example, if the contrast of the image to be identified is low, the target details may not be sufficiently sharp. Through histogram equalization, the contrast of the image can be increased, and the target features are more prominent. This process typically involves computing a histogram of the image and then applying a transform function to redistribute the pixel intensities so that the histogram distribution is more uniform.

It will be appreciated that in practical applications, these pretreatment techniques may be used alone or in combination to achieve a better pretreatment effect. For example, the geometric distortion of the image may be corrected by affine transformation, then wavelet transformation is applied to remove noise, and finally the contrast of the image is enhanced by histogram equalization. In this way, the preprocessed image will be better suited for subsequent feature extraction and image recognition tasks.

By way of example above, it can be seen that image preprocessing plays an important role in the image recognition process. By selecting and applying a proper preprocessing technology, the image quality can be improved, and the accuracy and reliability of identification can be improved.

Fig. 8A schematically illustrates a block diagram of an image recognition model according to an embodiment of the present disclosure. Fig. 8B schematically illustrates a flow diagram of image preprocessing according to an embodiment of the present disclosure. Fig. 8C schematically illustrates a block diagram of a residual block according to an embodiment of the present disclosure. Fig. 8D schematically illustrates a block diagram of a feature pyramid network in accordance with an embodiment of the present disclosure.

In some embodiments, check image areas are detected, for example, using a Mask R-CNN model, as shown in FIG. 8A. Wherein the check image identified by the feed-in model is preprocessed before feature extraction of the image.

In addition, in order to increase the number of samples of training check images while enhancing data characteristics of the check images during model training, for example, using various check image augmentation algorithms, includes: wavelet transformation, affine transformation, histogram equalization, preprocessing the check image to obtain a normalized check image.

Aiming at check images, the defects that a large amount of noise is frequently generated and check area characteristics are not obvious are overcome, a wavelet augmentation mode is adopted, the quality of check images is improved, important area characteristics are highlighted, image noise is reduced, and better input is provided for subsequent check image area detection and recognition tasks. The wavelet transformation augmentation of check images comprises the following three main steps:

first, an original image is loaded and wavelet decomposition is performed, and a specific formula is as follows:

I＝A+H+V+D (1)

Where I is the original check image and A, H, V, D represents the wavelet transformed low frequency component, the horizontal direction high frequency component, the vertical direction high frequency component, and the diagonal direction high frequency component, respectively.

Second, to effectively remove noise in the check image, the dynamic range of the image may be adjusted, the contrast in the check image may be improved, and important details and structures in the check image may be preserved. This is critical in the identification and analysis of check image areas, particularly in the area localization of numbers and text.

By adopting a wavelet threshold denoising method, the wavelet coefficients smaller than the threshold are set to zero by setting the threshold, and larger coefficients are reserved, so that the effect of removing noise from check images is achieved. The specific formula is as follows:

Wherein A, H, V, D is a low-frequency component, a horizontal-direction high-frequency component, a vertical-direction high-frequency component, and a diagonal-direction high-frequency component of wavelet transformation of an original check image, a ', H', V ', D' are wavelet coefficients after denoising, sgn (a) is a sign function of a, |a| is an absolute value of the low-frequency component, threshold is a threshold value, and (x) ₊ represents a maximum value of taking x and zero.

Again, by adjusting the weight of the high frequency components, the quality of the check image is enhanced while the important information in the image is more emphasized. The specific calculation formula is as follows:

I′＝A′+α·(H′+V′+D′) (3)

where α is a weight coefficient for adjusting the contribution of the high frequency component to the amplified image. I' is the wavelet transformed augmented check image.

In addition, the check image may be distorted, rotated, or perspective transformed due to scanning, photographing, etc. Affine transformation is used to simulate rotation, warping, and deformation of check images to increase the adaptation of the model to checks of different shapes and poses. The specific implementation formula is as follows:

Wherein θ is a rotation angle, b is a translation vector, I' represents a check image sample after affine transformation augmentation, a (θ) represents a rotation matrix, and a specific calculation formula is as follows:

In addition, contrast and brightness of check images are important quality indicators in check image preprocessing. Lower contrast and brightness can result in unclear image details, affecting the visual effect and recognition accuracy of the image. The contrast and brightness uniformity of check images are enhanced by adopting a histogram equalization mode, so that details of the images can be clearer and brighter.

Histogram equalization is described as the process whereby any pixel x _i,j in the original check image represents a pixel of the image sample with coordinates (i, j). Where the gray values of all pixels are between 0, l-1, the image gray standard probability function p (k) can be described as the equation:

where N is the total number of pixels in the original check image sample, N _k represents the number of pixels with gray level k, and the cumulative distribution function c (k) of gray level of the original check image sample can be expressed as:

the gray scale distribution f (I) of the amplified check image samples generated by the histogram equalization is as follows:

f(I)＝(L-1)×c(k) (8)

As shown in fig. 8B, the original check image may also be subjected to the three preprocessing steps to obtain an amplified check image, that is, an affine transformed check image, a wavelet transformed check image, and a histogram equalized check image, where the affine transformed check image, the wavelet transformed check image, and the histogram equalized check image are respectively subjected to affine transformation, wavelet transformation, and histogram equalization.

After the check image preprocessing, the preprocessed check image is first resized to 224 x 224 pixel images.

Then, the processing of ResNet-DCN backbone network includes a convolution layer and a pooling layer, and four stages of residual blocks, each stage containing 3,4, 6, and 3 residual blocks, respectively. The improvement over the ResNet network in which the ResNet-DCN network is in the replacement of the 3 x 3 convolution kernel in each residual block with a deformable convolution kernel.

The deformable convolution kernel has an adaptive shape, and can capture the information of complex structures and shapes in the image more flexibly, thereby helping to improve the performance of the model, especially in the scene with irregular shapes, such as processing check images. The ResNet-DCN network is more suitable for check image processing tasks, and the sensitivity of the detection model to check image detail and shape change is enhanced.

ResNet50-DCN backbone network concrete implementation process:

The first convolution layer and the pooling layer use convolution layers of 7 x 7 and 3 x 3 size convolution kernels, where the 7 x 7 size convolution kernel is used to capture larger features and the 3 x 3 size convolution kernel is used to capture smaller features. Each convolution kernel will generate a feature map, and then after the convolution operation, a ReLU activation function is applied to the feature map generated by each convolution kernel, so that nonlinearity is introduced, and the representation capability of the model is improved.

Next, a maximum pooling operation with a 3×3 size and a step size of 2 is applied to process the activated feature map, reduce the spatial dimension, reduce the calculation amount, and simultaneously retain important features. ReLu the specific formula for activation is as follows:

f(x)＝max(0，x) (9)

As shown in FIG. 8C, after the check image passes through the first convolution kernel and the pooling layer, the generated signature first proceeds to the first stage of construction of ResNet, which contains 3 modified residual blocks. Each residual block reduces the number of channels of the input feature map by a1 x1 convolution kernel to reduce the computational cost.

Then, in order to more flexibly sense the non-rigid deformation in the check image, the 3×3 convolution kernel is replaced by a deformable convolution kernel, so that the convolution kernel can dynamically adjust the shape in a local area to adapt to the complex structure and shape change in the image.

The resulting profile is then passed as input to the next layer, adjusting the number of channels by another 1x 1 convolution layer, in preparation for the residual connection.

And finally, carrying out average pooling on the input check image feature diagram through short circuit connection, and adding the feature diagram to the convolved output to form a residual structure.

In ResNet-DCN networks, the first stage of construction goes through the process of repeating the residual block three times. The feature map output size at this stage is 112×112.

The generated signature is then fed into the other three stages of ResNet: for example, phase 2 includes 4 modified residual blocks, phase 3 includes 6 modified residual blocks, and phase 4 includes 3 modified residual blocks. The output feature map of each stage gradually decreases in size, 56×56, 28×28, and 14×14, respectively.

In check image processing, after feature extraction is performed through ResNet-DCN backbone network, the backbone part of the network generates four-size feature graphs, which respectively correspond to the outputs of the last four stages of the feature extraction network, and the output sizes are as follows: c1 (112×112), C2 (56×56), C3 (28×28), C4 (14×14).

Because objects with different sizes exist in the check image, such as characters and seals, in order to better cope with the detection requirement of the multi-scale irregular object, as shown in fig. 8D, a feature pyramid network (PA-FPN) with improved path aggregation structure is adopted, and feature graphs with four sizes are used as input.

Check feature maps from different sizes are better integrated in the PA-FPN network architecture. The operation of using both top-down and bottom-up paths, as well as path aggregation.

In the top-down part of the PA-FPN, four size feature maps (P4, P3, P2, P1) are generated. Wherein, the feature map P4 is directly from the feature map C4 without additional processing. And the generation of P3 is twice up-sampled by P4 and added with the characteristic diagram C3, so that transverse connection is realized, and the characteristic diagrams P2 and P1 are obtained in the same way.

The bottom-up portion of PA-FPN also produces four dimensional profiles (N4, N3, N2, N1). The feature map N1 is directly generated from the feature map P1 and the feature map N2 without additional processing, and the feature map N1 is firstly subjected to 2 times downsampling and then added with the feature map P2 (and then averaged), so that transverse connection is realized, and the feature maps N3 and N4 are obtained in the same way.

And (3) performing feature extraction through ResNet-DCN backbone network and performing multi-scale feature fusion on PA-FPN network, wherein the output feature map is (N1, N2, N3, N4).

As shown in fig. 8A, the feature map output by the feature pyramid network (PA-FPN) is input to the RPN network to determine candidate target frames.

First, a set of Anchor Boxes (the reference sizes of these Anchor Boxes are [64×64, 128×128, 256×256, 512×512] and the aspect ratios are [ 1:1, 2:1, 1:2 ] are generated using a 3×3 sliding window on a feature map (N1, N2, N3, N4) of each size outputted from the PA-FPN network, to form an Anchor Boxes.

Next, ioU (Intersection over Union, cross-over) between the anchor boxes and the label boxes is calculated, dividing each anchor box into a positive sample (IoU > 0.7, target present) and a negative sample (IoU < 0.3, background or no target). For an anchor box that matches a positive sample, the offset between it and the label box is calculated.

Then, a label is assigned to each anchor box based on the matching IoU, including positive, negative, and ignore samples (IoU between 0.3 and 0.7). Positive sample anchor boxes with higher probabilities are selected and converted to RoIs (candidate target boxes) for subsequent processing.

Finally, RPN outputs RoIs, each RoI has a corresponding classification probability and positional offset.

As shown in fig. 8A, the feature map with the candidate target frame and the feature map without the candidate target frame output by the feature pyramid network (PA-FPN) are input to the ROI alignment layer, so as to Align the feature maps with different scales.

The ROI alignment layer shares the extracted RoIs feature map with ResNet-DCN backbone network, achieving feature alignment for each ROI by:

The feature map generated by the feature pyramid network (PA-FPN) and the interior of the ROI generated by the RPN network are uniformly divided into 4×4 grid cells.

Coordinates of the feature points inside each grid cell are calculated, and floating point coordinates of the RoI are calculated by using bilinear interpolation.

Then, the feature map is sampled at each feature point using bilinear interpolation to obtain a feature value for each grid cell inside the RoI.

Finally, the sampled features inside all 4×4 grid cells are averaged to form the final ROI Align feature map of 7×7 size.

As shown in fig. 8A, detection classification and example segmentation are performed on the target on the basis of ROI alignment, to obtain a final image recognition result.

The 7 x 7 feature map after passing through the ROI alignment layer is input to the target classification network, typically by a full connection layer (ROI header). This full connectivity layer applies a softmax activation function to generate probability distributions for each RoI belonging to different target classes, thereby completing the target classification of the rois.

Next, a fully connected layer is additionally applied on the 7×7 feature map of the ROI alignment output for performing regression of the target bounding box. The output of this fully connected layer represents the coordinate offsets of the bounding box, which are applied to the initial bounding box coordinates of the candidate target box to obtain a more accurate target position.

Finally, mask segmentation branches for instance segmentation are introduced. A convolution layer and an up-sampling layer are added on the basis of ROIAlign outputs to convert the 7 x 7 feature map into a high resolution feature map. For each RoI, a binary cross entropy loss function is employed to perform a pixel-level Mask prediction to generate a binary Mask for the object instance, identifying the specific location of the object in the image.

Fig. 9 schematically illustrates a flow chart of a method of training an image recognition model according to an embodiment of the present disclosure.

According to an embodiment of the present disclosure, as shown in fig. 9, an image recognition model is trained, for example, through operation S921.

In operation S921, the image recognition model is iteratively optimized using a joint loss function including a target classification loss, a bounding box regression loss, and a mask segmentation loss.

In some embodiments, training of the image recognition model is a complex process involving the combination and optimization of multiple loss functions. When the model needs to complete tasks such as target classification, bounding box regression, mask segmentation and the like at the same time, the iterative optimization is an effective method by adopting a joint loss function. The training method of the model will be exemplified below in conjunction with these loss functions.

The joint loss function is, for example, formed by weighted combination of a plurality of sub-loss functions, each corresponding to a task that the model needs to perform. In this example, the joint loss function includes a target classification loss, a bounding box regression loss, and a mask segmentation loss.

The target classification penalty is used to measure the performance of the model on the classification task. The classification loss function includes, for example, cross entropy loss, etc. During the training process, the model receives the tagged image data and outputs a predicted class probability for each object. The target classification loss calculates the difference between the predicted probability and the true label and updates the weights of the model by a back propagation algorithm to reduce this difference.

The bounding box regression penalty is used to optimize the performance of the model in locating the target location. When the model outputs a predicted bounding box of the target, the bounding box regression loss calculates the difference between the predicted bounding box and the true bounding box. The bounding box regression loss function includes, for example, a smoothed L1 loss, etc. By optimizing this loss function, the model is able to learn more about the location of the target.

Mask segmentation loss is used to evaluate the ability of the model to segment the object at the pixel level. In an instance segmentation task, the model needs to output a pixel level mask for each object. Mask segmentation loss measures the difference between the predicted mask and the real mask, for example using a binary cross entropy loss or a Dice loss, etc. By optimizing this loss function, the model can promote the segmentation capability for the target details.

For example, during training, a joint loss function is used for model optimization, including a target classification loss (cross entropy loss) L _cls, a bounding box regression loss L _box (smoothed L1 loss), and Mask segmentation loss L _mask (pixel-level binary cross entropy loss). The specific formula is as follows:

Where N _cls、N_box、N_mask is the positive sample number for object classification, bounding box regression and Mask segmentation, respectively, y _i,c is the true class label, P _i,c is the model predicted class probability, t _i,j and t ^* _i,j are the predicted and true bounding box coordinates, respectively, AndThe probability of real label and model prediction for Mask segmentation, respectively.

The loss functions are subjected to gradient descent through counter propagation, the whole training process is iterated continuously, the model is gradually optimized on a training set, and meanwhile, the performance is evaluated on a verification set, so that the model can accurately detect and segment targets on check image area detection tasks, and the model has generalization capability on unseen data.

For example, in an actual training process, the following steps may be used to train the model:

data preparation: a labeled image dataset is collected, including the class of the object, bounding box location, and mask information. The data is divided into a training set and a validation set.

Model initialization: an appropriate image recognition model structure is selected and the weight parameters of the model are initialized. A pre-trained model may be used as a starting point to accelerate the training process and improve performance.

Forward propagation: the images in the training set are input into a model that outputs the predicted class probabilities, bounding box positions, and masks for the target.

Calculating loss: and calculating the value of the joint loss function according to the output of the model and the real label. This includes a weighted sum of the target classification loss, bounding box regression loss, and mask segmentation loss.

Back propagation: the gradient of the loss function to the model parameters is calculated by a back propagation algorithm.

Parameter updating: the weight parameters of the model are updated using an optimization algorithm (e.g., gradient descent, adam, etc.) to reduce the value of the loss function.

Iterative training: the steps of forward propagation, calculation of the loss, backward propagation and parameter updating are repeated until the performance of the model on the validation set reaches a satisfactory level or a preset number of training rounds.

Model evaluation: and evaluating the performance of the model on a verification set, wherein the performance comprises indexes such as classification accuracy, bounding box positioning accuracy, mask segmentation quality and the like. Further adjustments and optimizations may be made to the model based on the evaluation.

By the method, a high-performance image recognition model can be effectively trained by combining the joint loss function of the target classification loss, the bounding box regression loss and the mask segmentation loss, and tasks such as target classification, positioning and segmentation can be completed at the same time.

According to an embodiment of the present disclosure, the image recognition method is applied to recognition of check images, for example.

In some embodiments, when the image recognition method is applied to the recognition of check images, its purpose is, for example, to automatically extract key information on the check, such as amount, date, signature, payee, payer, and so forth. Check types include, for example, bank checks of various types including transfer checks, cash checks, regular checks, and the like. The following is a specific example showing how the image recognition method can be applied to recognition of check images.

Image preprocessing: including, for example, denoising: noise and background interference in the check image are removed using filters or wavelet transforms. Binarization: the check image is converted to a black and white binary image to better identify the text and numbers. Inclination correction: if the check image is oblique, an affine transformation is used for correction.

Text detection and localization: including, for example, edge detection: edges of the letters and numbers on the check are identified using an edge detection algorithm such as Canny. Region segmentation: the check image is divided into different areas, such as an amount area, a date area, a signature area, and the like, according to the edge information.

Feature extraction and recognition: examples include OCR recognition: the text information on the check, such as the amount of money, date, etc., is extracted using Optical Character Recognition (OCR) technology. And (3) handwriting signature recognition: for signature areas, a deep learning model (e.g., convolutional neural network CNN) may be used to identify and verify handwritten signatures. Template matching: for fields of fixed format, such as bank identification codes, check numbers, etc., a template matching technique may be used for quick identification.

Information integration and verification: including, for example, logic verification: it is checked whether the extracted information accords with the logic rules of checks, such as whether the amount is legal, whether the date is reasonable, etc. Database comparison: and comparing the extracted key information with records in a database, and verifying the validity of the check.

For example, a bank needs to process a large number of check images for quick clearing and recording. By applying the image recognition method, the bank can automatically complete the following tasks:

the system automatically reads the amount on the check and compares the amount with account information in the database to ensure that the transfer amount is correct.

The date on the check is identified by OCR technology to ensure that the check has not expired.

The signature on the check is identified and verified, and is compared with a pre-stored signature sample to ensure the authenticity and legitimacy of the check.

The system can automatically input the identified information into the database to generate an electronic record, so that the subsequent inquiry and management are convenient.

By the method, the bank can greatly improve the efficiency and accuracy of check processing, reduce the errors and cost of manual operation and improve the customer service experience.

It should be noted that recognition of check images may present challenges such as blurriness of writing, background interference, diversity of handwritten signatures, etc. Thus, in practical applications, it may be desirable to combine a variety of image recognition techniques and methods, as well as a large amount of training data, to optimize the performance of the model.

Fig. 10 schematically illustrates a flowchart of an image recognition method according to a further embodiment of the present disclosure.

According to an embodiment of the present disclosure, as shown in fig. 10, the image to be identified includes personal information of the user, and the image identification method of this embodiment further includes operations S1010 to S1020, for example, before acquiring the user data.

In operation S1010, the user 'S authorization to acquire the user' S personal information is acquired.

In operation S1020, after obtaining the user 'S authorization to acquire the user' S personal information, an image to be identified is acquired.

In some embodiments, ensuring compliance and user privacy protection are followed when obtaining user data from multiple platforms prior to image recognition. The following is an example of how to legally and properly acquire an image to be recognized in combination with "acquire user authorization to acquire personal information of a user" and "acquire an image to be recognized after acquiring user authorization".

Acquiring the user's authorization to acquire the user's personal information includes, for example:

explicitly informing the user of personal information usage:

the purpose is as follows: ensuring that the user knows why their data is needed and how that data will be used.

Examples: in the privacy policy or user agreements of an application, it is explicitly stated that user data is intended to be obtained from multiple platforms (e.g., social media, e-commerce platforms, etc.) in order to provide personalized services. At the same time, it is explained how these data will be handled, stored and protected.

Requesting user authorization:

the purpose is as follows: in case of explicit consent of the user, rights to access and acquire their data are acquired.

Examples: an explicit authorization request is provided during registration or setup of an application. The user may choose to accept or reject. If the user chooses to accept, they will be required to log onto the corresponding platform so that the application can access the data.

Verifying user authorization:

the purpose is as follows: it is ensured that the authorization request is issued by the real user and that they do agree to share data.

Examples: upon receiving user authorization, the application may verify the identity and authorization status of the user using an associated authentication protocol. This may ensure that only authorized user data is acquired.

The acquisition of personal information of a user after being authorized by the user includes, for example:

Securely acquiring user data:

The purpose is as follows: ensuring that the best security practices are followed when the user data is acquired to protect the security and privacy of the user data.

Examples: secure API calls and encryption techniques (such as HTTPS) are used to obtain data from a user-authorized platform. At the same time, the server side of the application program is ensured to adopt proper security measures (such as firewall, data encryption and the like).

Integrating and storing user data:

The purpose is as follows: user data acquired from different platforms are integrated into a unified database for subsequent image recognition processing.

Examples: and setting a database on the server of the application program for storing the integrated user data. The data may be indexed according to user ID or other unique identifier for subsequent retrieval and use.

Compliance with data protection and privacy regulations:

The purpose is as follows: ensuring compliance with relevant data protection and privacy regulations when processing user data.

Examples: ensuring that the application complies with the requirements of the relevant regulations. This may include limiting the time of storage of the data, providing rights to access and delete user data, etc.

Through the steps, the user data can be obtained legally and in compliance, and a necessary data basis is provided for subsequent image recognition. At the same time, this also helps to maintain trust and privacy rights of the user.

Based on the image recognition method, the disclosure also provides an image recognition device. The image recognition apparatus will be described in detail below with reference to fig. 11.

Fig. 11 schematically shows a block diagram of the structure of an image recognition apparatus according to an embodiment of the present disclosure.

As shown in fig. 11, the image recognition apparatus 1100 of this embodiment includes, for example: an acquisition module 1110, an extraction module 1120, a determination module 1130, and a segmentation module 1140.

The acquiring module 1110 is configured to acquire an image to be identified. In an embodiment, the acquiring module 1110 may be configured to perform the operation S210 described above, which is not described herein.

The extraction module 1120 is configured to perform feature extraction on an image to be identified by using a pre-trained image recognition model to obtain a multi-scale feature map, a deformable convolution kernel is disposed in a feature extraction network of the image recognition model, the feature map of each scale and the pooled feature map undergo primary feature fusion, and the multi-scale feature map undergo secondary feature fusion of multiple different scales. In an embodiment, the extracting module 1120 may be used to perform the operation S220 described above, which is not described herein.

The determination module 1130 is operable to determine at least one candidate target box within the multi-scale feature map. In an embodiment, the determining module 1130 may be configured to perform the operation S230 described above, which is not described herein.

The segmentation module 1140 is configured to segment the target in the image to be identified according to the candidate target frame. In an embodiment, the segmentation module 1140 may be used to perform the operation S240 described above, which is not described herein.

Any of the acquisition module 1110, the extraction module 1120, the determination module 1130, and the segmentation module 1140 may be combined in one module to be implemented, or any of them may be split into a plurality of modules, according to embodiments of the present disclosure. Or at least some of the functionality of one or more of the modules may be combined with, and implemented in, at least some of the functionality of other modules. According to embodiments of the present disclosure, at least one of the acquisition module 1110, the extraction module 1120, the determination module 1130, and the segmentation module 1140 may be implemented at least in part as a hardware circuit, such as a Field Programmable Gate Array (FPGA), a Programmable Logic Array (PLA), a system on a chip, a system on a substrate, a system on a package, an Application Specific Integrated Circuit (ASIC), or may be implemented in hardware or firmware in any other reasonable way of integrating or packaging the circuits, or in any one of or a suitable combination of any of the three. Or at least one of the acquisition module 1110, the extraction module 1120, the determination module 1130, and the segmentation module 1140 may be at least partially implemented as computer program modules that, when executed, perform the corresponding functions.

As shown in fig. 12, an electronic device 1200 according to an embodiment of the present disclosure includes a processor 1201, which can perform various appropriate actions and processes according to a program stored in a Read Only Memory (ROM) 1202 or a program loaded from a storage section 1208 into a Random Access Memory (RAM) 1203. The processor 1201 may include, for example, a general purpose microprocessor (e.g., a CPU), an instruction set processor and/or an associated chipset and/or a special purpose microprocessor (e.g., an Application Specific Integrated Circuit (ASIC)), or the like. Processor 1201 may also include on-board memory for caching purposes. The processor 1201 may include a single processing unit or multiple processing units for performing the different actions of the method flows according to embodiments of the disclosure.

In the RAM 1203, various programs and data required for the operation of the electronic apparatus 1200 are stored. The processor 1201, the ROM 1202, and the RAM 1203 are connected to each other through a bus 1204. The processor 1201 performs various operations of the method flow according to the embodiments of the present disclosure by executing programs in the ROM 1202 and/or RAM 1203. Note that the program may be stored in one or more memories other than the ROM 1202 and the RAM 1203. The processor 1201 may also perform various operations of the method flow according to embodiments of the present disclosure by executing programs stored in the one or more memories.

According to an embodiment of the disclosure, the electronic device 1200 may also include an input/output (I/O) interface 1205, the input/output (I/O) interface 1205 also being connected to the bus 1204. The electronic device 1200 may also include one or more of the following components connected to the I/O interface 1205: an input section 1206 including a keyboard, a mouse, and the like; an output portion 1207 including a Cathode Ray Tube (CRT), a Liquid Crystal Display (LCD), and the like, a speaker, and the like; a storage section 1208 including a hard disk or the like; and a communication section 1209 including a network interface card such as a LAN card, a modem, or the like. The communication section 1209 performs communication processing via a network such as the internet. The drive 1210 is also connected to the I/O interface 1205 as needed. A removable medium 1211 such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, or the like is installed as needed on the drive 1210 so that a computer program read out therefrom is installed into the storage section 1208 as needed.

The present disclosure also provides a computer-readable storage medium that may be embodied in the apparatus/device/system described in the above embodiments; or may exist alone without being assembled into the apparatus/device/system. The computer-readable storage medium carries one or more programs which, when executed, implement methods in accordance with embodiments of the present disclosure.

According to embodiments of the present disclosure, the computer-readable storage medium may be a non-volatile computer-readable storage medium, which may include, for example, but is not limited to: a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the context of this disclosure, a computer-readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. For example, according to embodiments of the present disclosure, the computer-readable storage medium may include the ROM 1202 and/or the RAM 1203 and/or one or more memories other than the ROM 1202 and the RAM 1203 described above.

Embodiments of the present disclosure also include a computer program product comprising a computer program containing program code for performing the methods shown in the flowcharts. The program code, when executed in a computer system, causes the computer system to perform the methods provided by embodiments of the present disclosure.

The above-described functions defined in the system/apparatus of the embodiments of the present disclosure are performed when the computer program is executed by the processor 1201. The systems, apparatus, modules, units, etc. described above may be implemented by computer program modules according to embodiments of the disclosure.

In one embodiment, the computer program may be based on a tangible storage medium such as an optical storage device, a magnetic storage device, or the like. In another embodiment, the computer program can also be transmitted, distributed over a network medium in the form of signals, and downloaded and installed via a communication portion 1209, and/or from a removable medium 1211. The computer program may include program code that may be transmitted using any appropriate network medium, including but not limited to: wireless, wired, etc., or any suitable combination of the foregoing.

In such an embodiment, the computer program can be downloaded and installed from a network via the communication portion 1209, and/or installed from the removable media 1211. The above-described functions defined in the system of the embodiments of the present disclosure are performed when the computer program is executed by the processor 1201. The systems, devices, apparatus, modules, units, etc. described above may be implemented by computer program modules according to embodiments of the disclosure.

According to embodiments of the present disclosure, program code for performing computer programs provided by embodiments of the present disclosure may be written in any combination of one or more programming languages, and in particular, such computer programs may be implemented in high-level procedural and/or object-oriented programming languages, and/or assembly/machine languages. Programming languages include, but are not limited to, such as Java, c++, python, "C" or similar programming languages. The program code may execute entirely on the user's computing device, partly on the user's device, partly on a remote computing device, or entirely on the remote computing device or server. In the case of remote computing devices, the remote computing device may be connected to the user computing device through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to an external computing device (e.g., connected via the Internet using an Internet service provider).

The flowcharts and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams or flowchart illustration, and combinations of blocks in the block diagrams or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

Those skilled in the art will appreciate that the features recited in the various embodiments of the disclosure and/or in the claims may be provided in a variety of combinations and/or combinations, even if such combinations or combinations are not explicitly recited in the disclosure. In particular, the features recited in the various embodiments of the present disclosure and/or the claims may be variously combined and/or combined without departing from the spirit and teachings of the present disclosure. All such combinations and/or combinations fall within the scope of the present disclosure.

The embodiments of the present disclosure are described above. These examples are for illustrative purposes only and are not intended to limit the scope of the present disclosure. Although the embodiments are described above separately, this does not mean that the measures in the embodiments cannot be used advantageously in combination. The scope of the disclosure is defined by the appended claims and equivalents thereof. Various alternatives and modifications can be made by those skilled in the art without departing from the scope of the disclosure, and such alternatives and modifications are intended to fall within the scope of the disclosure.

Claims

1. An image recognition method, the method comprising:

Acquiring an image to be identified;

carrying out feature extraction on the image to be identified by adopting a pre-trained image identification model to obtain a multi-scale feature map, wherein a deformable convolution kernel is arranged in a feature extraction network of the image identification model, the feature map of each scale and the pooled feature map are subjected to primary feature fusion, and the multi-scale feature map is subjected to secondary feature fusion of different scales for a plurality of times;

Determining at least one candidate target box within the multi-scale feature map; and

And dividing the target in the image to be identified according to the candidate target frame.

2. The method of claim 1, wherein the feature extraction of the image to be identified using a pre-trained image recognition model to obtain a multi-scale feature map comprises:

acquiring an initial feature map of the image to be identified;

carrying out pooling operation on the initial feature map to obtain a pooled feature map;

Adopting a plurality of deformable convolution cores to perform feature extraction on the initial feature map to obtain a multi-scale convolution feature map;

respectively carrying out feature fusion on the pooled feature map and the multi-scale convolution feature map to obtain a multi-scale first fusion feature map;

performing scale transformation on the first fusion feature map of each scale to obtain a first transformation scale feature map;

Re-fusing the first transformation scale feature map and the first fusion feature map with the same scale; and

Repeating the steps of scale transformation and re-fusion for a plurality of times to obtain the multi-scale characteristic map.

3. The method of claim 2, wherein at least two of the scale transformations are of different types.

4. The method of claim 1, wherein the determining at least one candidate target box within the multi-scale feature map comprises:

Determining a plurality of labeling frames of the multi-scale feature map;

Performing target selection on the multi-scale feature map by adopting a sliding window to obtain a plurality of anchor frames; and

And determining the at least one candidate target frame according to the offset of the anchor frame and the annotation frame, wherein the candidate target frame represents the classification probability and the position offset of the target.

5. A method according to claim 3, wherein the segmenting the object in the image to be identified according to the candidate object box comprises:

Performing target classification on the multi-scale feature map to obtain a plurality of groups of target types;

According to the target type, respectively determining an initial boundary box of each type of target in the candidate target boxes; and

And performing mask segmentation on the initial boundary box to determine a target in the image to be identified.

6. The method of claim 5, wherein prior to classifying the multi-scale feature map as an object, the method further comprises:

Unifying the sizes of the multi-scale feature images to obtain a target feature image; and

And carrying out target classification on the target feature map to obtain the multiple groups of target types.

7. The method of claim 6, wherein the multi-scale feature map comprises a multi-scale feature map that has identified the candidate target frame and a multi-scale feature map that has not identified the candidate target frame.

8. The method of claim 5, wherein prior to feature extraction of the image to be identified using a pre-trained image recognition model, the method further comprises:

preprocessing the image to be identified, wherein the preprocessing comprises any one or more of wavelet transformation, affine transformation and histogram equalization.

9. The method according to any one of claims 1 to 8, wherein the training of the image recognition model comprises:

and carrying out iterative optimization on the image recognition model by adopting a joint loss function, wherein the joint loss function comprises target classification loss, bounding box regression loss and mask segmentation loss.

10. The method of claim 1, wherein the method is applied to the identification of check images.

11. The method of claim 1, wherein the image to be identified contains personal information of the user, the method further comprising:

acquiring authorization of a user to acquire personal information of the user;

And after obtaining the authorization of the user to acquire the personal information of the user, acquiring the image to be identified.

12. An image recognition apparatus, the apparatus comprising:

the acquisition module is used for acquiring the image to be identified;

The extraction module is used for carrying out feature extraction on the image to be identified by adopting a pre-trained image identification model to obtain a multi-scale feature map, a deformable convolution kernel is arranged in a feature extraction network of the image identification model, the feature map of each scale and the pooled feature map are subjected to primary feature fusion, and the multi-scale feature map is subjected to secondary feature fusion of different scales for a plurality of times;

A determining module for determining at least one candidate target frame within the multi-scale feature map; and

And the segmentation module is used for segmenting and obtaining the target in the image to be identified according to the candidate target frame.

13. An electronic device, comprising:

One or more processors;

storage means for storing one or more computer programs,

Characterized in that the one or more processors execute the one or more computer programs to implement the steps of the method according to any one of claims 1 to 11.

14. A computer-readable storage medium, on which a computer program is stored, characterized in that the computer program, when being executed by a processor, realizes the steps of the method according to any one of claims 1-11.

15. A computer program product comprising a computer program, characterized in that the computer program, when executed by a processor, implements the steps of the method according to any one of claims 1-11.