CN113887615A

CN113887615A - Image processing method, apparatus, device and medium

Info

Publication number: CN113887615A
Application number: CN202111156550.9A
Authority: CN
Inventors: 谌强
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2021-09-29
Filing date: 2021-09-29
Publication date: 2022-01-04

Abstract

The present disclosure provides an image processing method, apparatus, device and medium, which relate to the field of artificial intelligence, and in particular, to the technical fields of computer vision and deep learning. The specific implementation scheme of the image processing method is: based on the image to be detected, using the self-attention network of the image detection model to obtain the first image feature; based on the image to be detected, using the convolutional network of the image detection model to obtain the second image feature; The fusion feature of the first image feature and the second image feature uses the prediction network of the image detection model to obtain the detection result.

Description

Image processing method, apparatus, device and medium

Technical Field

The present disclosure relates to the field of artificial intelligence technologies, and in particular, to the field of computer vision and deep learning technologies, and in particular, to an image processing method, apparatus, device, and medium.

Background

With the development of computer technology and network technology, deep learning technology has been widely used in many fields. Currently, in the field of images, in order to improve accuracy, a self-attention mechanism is generally adopted instead of a convolution operation. However, the self-attention mechanism has the disadvantages of low processing efficiency and low convergence rate because the global information is modeled.

Disclosure of Invention

Based on this, the present disclosure provides an image processing method, apparatus, device, and medium that improve processing efficiency.

According to an aspect of the present disclosure, there is provided an image processing method including: based on an image to be detected, obtaining a first image characteristic by adopting a self-attention network of an image detection model; based on the image to be detected, acquiring a second image characteristic by adopting a convolution network of an image detection model; and obtaining a detection result by adopting a prediction network of the image detection model based on the fusion characteristic of the first image characteristic and the second image characteristic.

According to another aspect of the present disclosure, there is provided an image processing apparatus including: the first image characteristic obtaining module is used for obtaining first image characteristics by adopting a self-attention network of an image detection model based on an image to be detected; the second image characteristic obtaining module is used for obtaining second image characteristics by adopting a convolution network of an image detection model based on the image to be detected; and the detection result obtaining module is used for obtaining a detection result by adopting a prediction network of the image detection model based on the fusion characteristic of the first image characteristic and the second image characteristic.

According to another aspect of the present disclosure, there is provided an electronic device including: at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor, the instructions being executable by the at least one processor to enable the at least one processor to perform the image processing method provided by the present disclosure.

According to another aspect of the present disclosure, there is provided a non-transitory computer-readable storage medium storing computer instructions for causing a computer to perform an image processing method provided by the present disclosure.

According to another aspect of the present disclosure, a computer program product is provided, comprising a computer program which, when executed by a processor, implements the image processing method provided by the present disclosure.

It should be understood that the statements in this section do not necessarily identify key or critical features of the embodiments of the present disclosure, nor do they limit the scope of the present disclosure. Other features of the present disclosure will become apparent from the following description.

Drawings

The drawings are included to provide a better understanding of the present solution and are not to be construed as limiting the present disclosure. Wherein:

fig. 1 is a schematic view of an application scenario of an image processing method and apparatus according to an embodiment of the present disclosure;

FIG. 2 is a schematic flow diagram of an image processing method according to an embodiment of the present disclosure;

FIG. 3 is a schematic diagram of an image processing method according to a first embodiment of the present disclosure;

FIG. 4 is a schematic diagram of an image processing method according to a second embodiment of the present disclosure;

FIG. 5 is a schematic diagram of an image processing method according to a third embodiment of the present disclosure;

FIG. 6 is a schematic diagram of an image processing method according to a fourth embodiment of the present disclosure;

fig. 7 is a block diagram of the structure of an image processing apparatus according to an embodiment of the present disclosure; and

fig. 8 is a block diagram of an electronic device for implementing an image processing method of an embodiment of the present disclosure.

Detailed Description

Exemplary embodiments of the present disclosure are described below with reference to the accompanying drawings, in which various details of the embodiments of the disclosure are included to assist understanding, and which are to be considered as merely exemplary. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present disclosure. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.

The present disclosure provides an image processing method comprising a first image feature obtaining stage, a second image feature obtaining stage and a detection stage. In the first image characteristic obtaining stage, based on the image to be detected, the first image characteristic is obtained by adopting a self-attention network of an image detection model. And in a second image characteristic obtaining stage, based on the image to be detected, obtaining a second image characteristic by adopting a convolution network of an image detection model. In the detection stage, a detection result is obtained by adopting a prediction network of an image detection model based on the fusion characteristic of the first image characteristic and the second image characteristic.

An application scenario of the method and apparatus provided by the present disclosure will be described below with reference to fig. 1.

Fig. 1 is a schematic view of an application scenario of an image processing method and apparatus according to an embodiment of the present disclosure.

As shown in fig. 1, the application scenario 100 includes an electronic device 110, and the electronic device 110 may be any electronic device with processing functionality, including but not limited to a smartphone, a tablet, a laptop, a desktop computer, a server, and so on.

The electronic device 110 may, for example, detect the input image 120 to obtain a detection result 130. The detection of the image 120 may include detecting objects in the image 120, with the resulting detection results 130 varying from one prediction task to another. For example, if the prediction task is substantially a task of predicting the type of the image 120, a task of predicting the target in the image 120, a semantic segmentation task, and the like, the detection result 130 may be a prediction probability of the image 120 with respect to each of a plurality of predetermined types, a prediction position and probability of the target in the image 120, a prediction position and type of each entity in the image 120, and the like.

According to an embodiment of the present disclosure, as shown in fig. 1, the application scenario 100 may further include a server 140. The electronic device 110 may be communicatively coupled to the server 140 via a network, which may include wireless or wired communication links.

Illustratively, the server 140 may be configured to train an image detection model, and send the trained image detection model 150 to the electronic device 110 in response to a model acquisition request sent by the electronic device 110, so as to facilitate the electronic device 110 to detect an image. In an embodiment, the electronic device 110 may further send the image to the server 140 through a network, and the server detects the obtained image according to the trained image detection model.

According to an embodiment of the disclosure, as shown in fig. 1, the application scenario 100 may further include a database 160, the database 160 may maintain a large number of images, and the images may have tags corresponding to the image detection model, for example, the tags may indicate the actual category of the image, the actual position of the target object in the image, or the positions and categories of multiple entities in the image. Server 140 may access database 160 and extract partial images from database 160 to train the image detection model.

Illustratively, the image detection model may be constructed based on a Transformer using the Self-Attention mechanism (Self-Attention), for example. By using a Transformer in the image domain, an effect superior to that of a convolutional network can be obtained with a sufficient amount of data. The image detection model constructed based on the Transformer is large in parameter quantity and calculation quantity, and is slow in convergence due to modeling global information.

Illustratively, in the image detection model constructed based on the Transformer, the convergence rate may be increased by a Neural Architecture Search (NAS), an improved blocking scheme, or an improved extraction scheme of location information. The convergence rate can also be improved by using a Local Self-Attention mechanism (Local Self-Attention) instead of a global Self-Attention mechanism. The convergence efficiency can also be improved by replacing the pure self-attention mechanism with a cascade structure of hybrid convolution and global self-attention mechanisms. However, these methods do not fully utilize available information, and thus there is a problem that the model accuracy is reduced to some extent. Although the cascade structure can improve the convergence speed of the model, in the cascade structure, the convolution characteristic and the attention characteristic cannot be efficiently fused, so that the performance is still reduced to a certain degree.

Based on this, the present disclosure aims to provide an image processing method capable of improving the accuracy of a model as much as possible while ensuring processing efficiency. Please refer to the following description for a specific image processing method.

It should be noted that the image processing method provided by the present disclosure may be executed by the electronic device 110 or the server 140. Accordingly, the image processing apparatus provided by the present disclosure may be provided in the electronic device 110 or the server 140.

It should be understood that the number and type of electronic devices, servers, and databases in FIG. 1 are merely illustrative. There may be any number and type of terminal devices, servers, and databases, as the implementation requires.

The image processing method provided by the present disclosure will be described in detail by the following fig. 2 to 6 in conjunction with fig. 1.

Fig. 2 is a flow chart schematic diagram of an image processing method according to an embodiment of the present disclosure.

As shown in fig. 2, the image processing method 200 of this embodiment may include operations S210 to S230. The image processing method 200 may be implemented in reliance on image detection models, which may include self-attention networks, convolutional networks, and predictive networks. Wherein, the self-attention network and the convolution network are used for extracting the characteristics of the image. The prediction network is used for performing an image classification task, an object detection task, a semantic segmentation task and the like based on features of an image.

In operation S210, a first image feature is obtained using a self-attention network of an image detection model based on an image to be detected.

According to the embodiment of the present disclosure, the image to be detected may be processed through an embedded (Embedding) network in advance to obtain vectorized representation of the image to be detected. The vectorized representation is then input into a self-attention network from which the first image feature is output.

For example, the self-attention network may be the global self-attention network described above or the local self-attention network described above. For the self-attention network, the input information should include Query, Key and Value. The embodiment can copy vectorization representation of an image to be detected into three copies, which are respectively used as Query (also expressed as Q feature), Key (also expressed as K feature) and Value (also expressed as V feature). In a scenario with high requirements on model accuracy, the self-attention network may be a global self-attention network. In scenarios with lower requirements on model accuracy, the self-attention network may be a local self-attention network.

In operation S220, a second image feature is obtained using a convolution network of an image detection model based on an image to be detected.

According to an embodiment of the present disclosure, the vectorized representation of the image to be detected described above may be taken as an input to a convolutional network. The vectorized representation may be processed via a convolutional network, which may output a second image feature.

For example, a convolutional network may be constructed of a plurality of convolutional layers connected in sequence, each convolutional layer may contain a plurality of convolutional kernels. The front convolutional layer in the convolutional network is used to capture local, detailed information of the image, and has a small field of view, i.e., each pixel of the output image only uses a small range of the input image. The subsequent convolutional layer reception fields are enlarged layer by layer and are used for capturing more complex and abstract information of the image. The second image feature can be obtained by sequentially processing the plurality of convolution layers. The number of convolutional layers and the number of convolutional cores included in each convolutional layer may be set according to actual requirements, which is not limited in this disclosure.

In operation S230, a detection result is obtained using a prediction network of an image detection model based on a fusion feature of the first image feature and the second image feature.

According to the embodiment of the disclosure, the first image feature and the second image feature can be spliced in the channel dimension to obtain the fusion feature. The concat () function may also be employed to implement the fusion of the first image feature and the second image feature.

According to embodiments of the present disclosure, a prediction network may be determined, for example, from a prediction task. For example, if the prediction task is a task of a type of a predicted image, a task of a target in a predicted image, a semantic segmentation task, or the like, the detection result may be a prediction probability of the image with respect to each of a plurality of predetermined types, a prediction position and probability of the target in the image, a prediction position and type of each entity in the image, or the like. Similarly, the architecture of the prediction network may vary based on different prediction tasks. For example, for the task of predicting the class of pictures, the prediction network may be a fully connected layer. For the task of predicting the target in the image and the task of semantic segmentation, the prediction network comprises two branches, one of which is a classifier for predicting the probability of the target in the image and the other of which is a normalized fully-connected layer for predicting the position of the bounding box of the target in the image.

It should be noted that, in the method of the embodiment of the present disclosure, operation S210 and operation S220 may be executed in parallel. The timing of obtaining the first image feature and the second image feature may be different in consideration of the difference in the amount of calculation data between the attention network and the convolution network. After the first image feature and the second image feature are obtained, operation S230 is performed.

According to the image processing method, the self-attention mechanism and the convolution network are used for processing the image to be detected in the form of two branches, and the two image features obtained by processing the two branches are fused and then the prediction task is executed, so that the advantages of the self-attention model and the convolution model in the correlation technology can be integrated. Therefore, the local features and the global features can be integrated, extraction of the multi-scale features is achieved, information can be fully utilized, and the accuracy of the detection result is improved on the basis of not increasing the calculated amount.

Fig. 3 is a schematic diagram of an image processing method according to a first embodiment of the present disclosure.

According to embodiments of the present disclosure, a lightweight convolutional network may be selected as a specific architecture of the convolutional network in the foregoing. For example, the convolutional network in the foregoing may comprise a depth-wise convolution (depth-wise convolution) network. By the method, the image processing model has stronger expression capability under the condition of less parameters and computation. Specifically, by adopting the parallel design of the deep separation convolutional network and the self-attention network, compared with the prior art in which only a model of the self-attention network is arranged, the accuracy of the detection result can be further improved on the basis that the calculation amount is not increased.

Illustratively, the aforementioned convolutional network may comprise a deep-split convolutional subnetwork. When the convolution network is used to obtain the second image feature, the depth separation convolution sub-network may be used to obtain the local feature of the image to be detected based on the image to be detected. A second image feature is then determined based on the local feature. Wherein the vectorized representation obtained by embedding the image to be processed through the network is input into a depth separation convolution sub-network, and local features are output by the depth separation convolution sub-network. The local feature may then be mapped to a second image feature of the same size as the first image feature.

For example, a deeply separated convolutional subnetwork may be a 3 x 3 network architecture. One convolution kernel in the deep-separation convolution sub-network is responsible for one channel, and one channel is convolved by only one convolution kernel. The number of channels of the local features output via this deep separation convolution sub-network is the same as the number of channels of the vectorized representation of the input deep separation convolution sub-network.

In an embodiment, the convolution network may further include a point-wise convolution (point-wise convolution) sub-network, and after the depth-wise convolution sub-network outputs the local feature, the local feature may be input into the point-wise convolution sub-network, so as to obtain a fused local feature of channel fusion. The point separation convolution sub-network is arranged, so that the characteristic information of different channels at the same spatial position can be effectively utilized. Therefore, the accuracy of the obtained second image feature can be improved to some extent. For example, the point-separated convolutional subnetwork may be a 1 × 1 network architecture.

Specifically, as shown in fig. 3, in an embodiment 300, the convolution network includes a deep separation convolution sub-network 321 and a point separation convolution sub-network 322, and a local self-attention network 330 is selected from the self-attention network, so as to reduce the number of parameters involved in the processing and the amount of computation in the processing on the basis of ensuring the processing accuracy.

The specific flow of the image processing method in this embodiment 300 may first input the image to be processed 301 into the embedded network 310, and obtain the vectorized representation 302 of the image to be processed through the embedded network 310. Subsequently, convolution processing and self-attention processing may be performed in parallel. In the convolution process, the vectorized representation 302 is input to the depth separation convolution sub-network 321 and the point separation convolution sub-network 322 to obtain a second image feature. In the self-attention process, input features for the self-attention network may be determined first. The input features are then input into the local self-attention network 330. The first image feature 306 is output after processing via the local self-attention network 330. After the first image feature 306 and the second image feature 307 are obtained, the concat () function may be used to perform a fusion process on the two image features, so as to obtain a fused feature input prediction network 340. The fused features may be processed via the prediction network 340 to obtain the detection result 308.

The input features from the attention network may include, among others, a Q feature, a K feature, and a V feature. This embodiment can be copied in triplicate with the vectorized representation 302 of the image 301 to be processed, as Q-feature 303, K-feature 304 and V-feature 305 respectively. Alternatively, the vectorized representation 302 of the image 301 to be processed may also be linearly transformed, resulting in Q-features 303, K-features 304 and V-features 305. The linear transformation used to obtain the Q-feature 303, the K-feature 304, and the V-feature 305 may have the same or different parameters, and is not limited in this disclosure. For example, the vectorized representation 302 of the image 301 to be processed may be multiplied by the weight matrix W obtained by pre-training respectively^Q、W^K、W^VResulting in Q-feature 303, K-feature 304, and V-feature 305, respectively.

Fig. 4 is a schematic diagram of an image processing method according to a second embodiment of the present disclosure.

According to embodiments of the present disclosure, local features of the depth-separated convolution sub-network output may be considered in determining the first image feature, since local features facilitate self-attention learning. In this way, the accuracy of the obtained first image feature can be improved to some extent.

Illustratively, as shown in fig. 4, in embodiment 400, the convolutional network comprises a deep-separation convolutional subnetwork 421 and a point-separation convolutional subnetwork 422, the local self-attention network 430 is selected from the attention network, and the self-attention network 430 may comprise an attention layer 431 and a fusion layer 432.

When the embodiment 400 executes the image processing method, similar to the foregoing embodiment, the to-be-processed image 401 may be processed through the embedded network 410 to obtain the vectorized representation 402 of the to-be-processed image 401. The vectorized representation 402 is then input into a deep separation convolution sub-network 421, resulting in local features. While Q-feature 403, K-feature 404 and V-feature 405 may be derived from the vectorized representation 402. The Q-feature 403 and the K-feature 404 may then be entered into the attention layer 431. The attention layer 431 may dot multiply the Q-feature 403 and the K-feature 404 to obtain a similar relationship between the Q-feature 403 and the K-feature 404. The attention layer 431 may also normalize the similarity relationship using, for example, a softmax function. Meanwhile, the local features and the V features can be fused to obtain fused V features. The fused V-feature is then input into the fusion layer 432 as a similarity relationship between Value and normalization, and the first image feature 406 is obtained after processing by the fusion layer 432. Local feature input point separation convolution sub-network 422, and second image feature 407 may be output by point separation convolution sub-network 422. And finally, fusing the first image feature 406 and the second image feature 407 by using a concat () function to obtain a fused feature. The fused features belong to the prediction network 440, and the detection result 408 can be output after being processed by the prediction network 440.

When fusing the local feature and the V feature, the local feature may be processed by using a 1 × 1 convolution to project the local feature to the dimension of the V feature, so as to obtain the projected local feature. The projected local features are then added to the V-feature using a fusion network 450 as shown in FIG. 4 to obtain a fused V-feature.

The fusion layer 432 may be used to multiply the fused V feature and the feature output by the attention layer 431 to obtain the first image feature 406, for example. The multiplication operation may include, for example, converting the feature output by the attention layer 431 and the fused V feature to make the two-part feature have the same size, and then performing a dot product operation on the two-part feature having the same size.

Fig. 5 is a schematic diagram of an image processing method according to a third embodiment of the present disclosure.

According to the embodiment of the present disclosure, after the first image feature is obtained, the first image feature may be further fused with the feature output by the point separation convolution sub-network, and the fused feature may be used as the second image feature. This is because the image features obtained from the attention network generally contain a relatively larger range of local features than those obtained from the convolution network, and the convolution network can be helped to learn a larger range of features by the fusion of the first image feature with the point-separated convolution sub-network output features. In this way, the expressive power of the obtained second image features can be improved to some extent, and thus the accuracy of the detection result can be improved.

Illustratively, as shown in FIG. 5, in embodiment 500, the convolution network includes a deep-split convolution sub-network 521, a point-split convolution sub-network 422, and a fusion sub-network 523.

When the embodiment 500 executes the image processing method, similar to the embodiment described above, the image 501 to be processed may be processed through the embedded network 510 to obtain the vectorized representation 502 of the image 501 to be processed. The vectorized representation 502 is then input into the depth separation convolution sub-network 521 to obtain local features, the local features are input into the point separation convolution sub-network 522, and the features output by the point separation convolution sub-network 522 are used as post-channel fusion local features. At the same time, Q-feature 503, K-feature 504, and V-feature 505 may be derived from the vectorized representation 502, and Q-feature 403, K-feature 404, and V-feature 505 may be input into the local self-attention network 530. The first image feature 506 is output after processing via the local self-attention network 530. The first image feature 506 may then be fused with the fused local feature to obtain a second image feature 507. And finally, fusing the first image feature 506 and the second image feature 507 by using a concat () function to obtain a fused feature. The fused features belong to the prediction network 540, and the detection result 508 can be output after being processed by the prediction network 540.

Where the first image feature 506 is merged with the fused local feature, the first image feature may be mapped to a scale factor with the empire dimension via convolution of the two connected 1 × 1 features. Then, the scale factor and the fused local feature are input into a fusion sub-network 523, and are processed by the fusion sub-network 523 to obtain a second image feature 507. The fusion subnetwork 523 may be used to multiply the scale factor and the fused local feature to obtain the second image feature 507, for example. The multiplication operation may include, for example, converting the scale factor and the fused local feature to make the two parts of features have the same size, and then performing a dot multiplication operation on the two parts of features having the same size.

Fig. 6 is a schematic diagram of an image processing method according to a fourth embodiment of the present disclosure.

According to the embodiment of the disclosure, before the vectorization expression of the image to be detected is input into the attention network and/or the convolution network, for example, the vectorization expression may be subjected to linear transformation, so that the features after linear transformation can better meet the processing requirements of the attention network and/or the convolution network, and therefore, the expression capability of the extracted image features is further improved, and the detection accuracy is improved.

In an embodiment, when determining the input features from the attention network, first linear transformation may be performed on a feature map of an image to be detected (i.e., a vectorized representation of the image to be detected) to obtain a first linear feature. Input features from the attention network are then determined based on the first linear feature. For example, the input features may include a Q feature, a K feature, and a V feature. For example, the Q-feature, K-feature, and V-feature may be obtained in any of the various manners described above. The first linear transformation process may specifically multiply the feature map of the image to be detected with a pre-trained first matrix, thereby obtaining a first linear feature. It should be noted that the first linear feature is named only to distinguish different features, and does not limit the dimension of the feature.

In an embodiment, when the local features of the image to be detected are obtained by using the depth separation convolution sub-network, the second linear transformation may be performed on the feature map of the image to be detected to obtain the second linear features. The second linear feature is then input into a deep separation convolution sub-network to obtain a local feature. The second linear transformation process may specifically multiply the feature map of the image to be detected with a pre-trained second matrix, thereby obtaining a second linear feature.

It should be noted that the second matrix here and the previous first matrix are usually two different matrices. Thereby, the obtained first linear characteristic and the second linear characteristic respectively meet the requirements of the self-attention network and the convolution network.

As shown in fig. 6, in one embodiment 600, the convolution network may include a deep separation convolution sub-network 621, a point separation convolution sub-network 622, and a fusion sub-network 623. The self-attention network may include an attention layer 631 and a fusion layer 632. After the image to be processed is input into the embedded network to obtain the feature map 601, a first linear transformation and a second linear transformation may be performed on the feature map 601 in parallel, thereby obtaining a first linear feature 602 and a second linear feature 603. The first linear features 602 are respectively compared with the weight matrix W obtained by pre-training^Q、W^K、W^VBy multiplying, the Q-feature 604, the K-feature 605, and the V-feature 606 can be obtained, and then inputting the Q-feature 604 and the K-feature 605 to the attention layer 631, the similarity relationship between the Q-feature 604 and the K-feature 605 is obtained. Meanwhile, the second linear feature 603 is input to the deep separation convolution sub-network 621, and a local feature can be obtained. And fusing the local feature and the V feature 606 by using a fusion network 650 to obtain a fused V feature. The fused V-feature and similarity relationship may then be input into the fusion layer 632, and processed by the fusion layer 632 to obtain the first image feature 607. After the local features are obtained, the local feature input points are also separated into convolution sub-networks 622, resulting in fused local features. The fused local feature and the first image feature 607 are then input into a fusion subnetwork 623, and processed by the fusion subnetwork 623 to obtain a second image feature 608. Finally, the concat () function is used to fuse the first image feature 607 and the second image feature 608, so as to obtain a fused feature. The fused features belong to the prediction network 640, and the detection result 609 can be output after the fused features are processed by the prediction network 640.

Based on the image processing method provided by the disclosure, the disclosure also provides an image processing device. The apparatus will be described in detail below with reference to fig. 7.

Fig. 7 is a block diagram of the structure of an image processing apparatus according to an embodiment of the present disclosure.

As shown in fig. 7, the image processing apparatus 700 of this embodiment may include a first image feature obtaining module 710, a second image feature obtaining module 720, and a detection result obtaining module 730.

The first image feature obtaining module 710 is configured to obtain a first image feature based on the image to be detected by using a self-attention network of an image detection model. In an embodiment, the first image feature obtaining module 710 may be configured to perform the operation S210 described above, which is not described herein again.

The second image feature obtaining module 720 is configured to obtain a second image feature by using a convolution network of an image detection model based on the image to be detected. In an embodiment, the second image feature obtaining module 720 may be configured to perform the operation S220 described above, which is not described herein again.

The detection result obtaining module 730 is configured to obtain a detection result by using a prediction network of an image detection model based on a fusion feature of the first image feature and the second image feature. In an embodiment, the detection result obtaining module 730 may be configured to perform the operation S230 described above, which is not described herein again.

According to an embodiment of the present disclosure, the above-described convolution network includes a deep separation convolution sub-network. The second image feature obtaining module 720 may include a local feature obtaining sub-module and an image feature determining sub-module. The local feature obtaining submodule is used for obtaining the local features of the image to be detected by adopting a depth separation convolution sub-network based on the image to be detected. The image feature determination submodule is used for determining a second image feature based on the local feature.

According to an embodiment of the present disclosure, a self-attention network is constructed based on a local self-attention mechanism. The first image feature obtaining module 710 may include an input feature determination sub-module and an image feature obtaining sub-module. The input feature determination submodule is used for determining the input features of the self-attention network based on the image to be detected. The image feature obtaining sub-module is used for inputting the input features into the attention network to obtain first image features.

According to an embodiment of the present disclosure, the input feature determination submodule may include a first feature determination unit and a feature fusion unit. The first feature determination unit is used for determining Q features, K features and V features aiming at the self-attention network based on the image to be detected. The feature fusion unit is used for fusing the V feature and the local feature to obtain a fused V feature. The inputs from the attention network include the Q feature, the K feature, and the fused V feature.

According to an embodiment of the present disclosure, the input characteristic determination sub-module may include a first linear transformation unit and a second characteristic determination unit. The first linear transformation unit is used for performing first linear transformation on the feature map of the image to be detected to obtain a first linear feature. The second feature determination unit is configured to determine an input feature from the attention network based on the first linear feature.

According to an embodiment of the present disclosure, the convolutional network further comprises a point-separated convolutional subnetwork. The image feature determination submodule may include a channel fusion unit and a third feature determination unit. The channel fusion unit is used for separating the local feature input points from the convolution sub-network to obtain the fused local features of the channel fusion. And the third feature determining unit is used for fusing the first image feature and the fused local feature to obtain a second image feature.

According to an embodiment of the present disclosure, the local feature obtaining sub-module may include a second linear transformation unit and a local feature obtaining unit. The second linear transformation unit is used for performing second linear transformation on the feature map of the image to be detected to obtain second linear features. The local feature obtaining unit is used for inputting the second linear feature into the depth separation convolution sub-network to obtain a local feature.

In the technical scheme of the present disclosure, the processes of acquiring, collecting, storing, using, processing, transmitting, providing, disclosing and the like of the personal information of the related user all conform to the regulations of related laws and regulations, and do not violate the good custom of the public order.

The present disclosure also provides an electronic device, a readable storage medium, and a computer program product according to embodiments of the present disclosure.

FIG. 8 shows a schematic block diagram of an example electronic device 800 that may be used to implement the image processing methods of embodiments of the present disclosure. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular phones, smart phones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be examples only, and are not meant to limit implementations of the disclosure described and/or claimed herein.

As shown in fig. 8, the apparatus 800 includes a computing unit 801 that can perform various appropriate actions and processes according to a computer program stored in a Read Only Memory (ROM)802 or a computer program loaded from a storage unit 808 into a Random Access Memory (RAM) 803. In the RAM803, various programs and data required for the operation of the device 800 can also be stored. The calculation unit 801, the ROM 802, and the RAM803 are connected to each other by a bus 804. An input/output (I/O) interface 805 is also connected to bus 804.

A number of components in the device 800 are connected to the I/O interface 805, including: an input unit 806, such as a keyboard, a mouse, or the like; an output unit 807 such as various types of displays, speakers, and the like; a storage unit 808, such as a magnetic disk, optical disk, or the like; and a communication unit 809 such as a network card, modem, wireless communication transceiver, etc. The communication unit 809 allows the device 800 to exchange information/data with other devices via a computer network such as the internet and/or various telecommunication networks.

Computing unit 801 may be a variety of general and/or special purpose processing components with processing and computing capabilities. Some examples of the computing unit 801 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various dedicated Artificial Intelligence (AI) computing chips, various computing units running machine learning model algorithms, a Digital Signal Processor (DSP), and any suitable processor, controller, microcontroller, and the like. The calculation unit 801 executes the respective methods and processes described above, such as an image processing method. For example, in some embodiments, the image processing method may be implemented as a computer software program tangibly embodied in a machine-readable medium, such as storage unit 808. In some embodiments, part or all of the computer program can be loaded and/or installed onto device 800 via ROM 802 and/or communications unit 809. When the computer program is loaded into the RAM803 and executed by the computing unit 801, one or more steps of the image processing method described above may be performed. Alternatively, in other embodiments, the computing unit 801 may be configured to perform the image processing method by any other suitable means (e.g., by means of firmware).

Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuitry, Field Programmable Gate Arrays (FPGAs), Application Specific Integrated Circuits (ASICs), Application Specific Standard Products (ASSPs), system on a chip (SOCs), load programmable logic devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs that are executable and/or interpretable on a programmable system including at least one programmable processor, which may be special or general purpose, receiving data and instructions from, and transmitting data and instructions to, a storage system, at least one input device, and at least one output device.

Program code for implementing the methods of the present disclosure may be written in any combination of one or more programming languages. These program codes may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus, such that the program codes, when executed by the processor or controller, cause the functions/operations specified in the flowchart and/or block diagram to be performed. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package partly on the machine and partly on a remote machine or entirely on the remote machine or server.

In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. A machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and a pointing device (e.g., a mouse or a trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic, speech, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a back-end component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such back-end, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), Wide Area Networks (WANs), and the Internet.

The computer system may include clients and servers. A client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The Server may be a cloud Server, which is also called a cloud computing Server or a cloud host, and is a host product in a cloud computing service system, so as to solve the defects of high management difficulty and weak service extensibility in a traditional physical host and a VPS service ("Virtual Private Server", or simply "VPS"). The server may also be a server of a distributed system, or a server incorporating a blockchain.

It should be understood that various forms of the flows shown above may be used, with steps reordered, added, or deleted. For example, the steps described in the present disclosure may be executed in parallel, sequentially, or in different orders, as long as the desired results of the technical solutions disclosed in the present disclosure can be achieved, and the present disclosure is not limited herein.

The above detailed description should not be construed as limiting the scope of the disclosure. It should be understood by those skilled in the art that various modifications, combinations, sub-combinations and substitutions may be made in accordance with design requirements and other factors. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present disclosure should be included in the scope of protection of the present disclosure.

Claims

1. An image processing method comprising:

based on an image to be detected, obtaining a first image characteristic by adopting a self-attention network of an image detection model;

based on the image to be detected, adopting a convolution network of the image detection model to obtain a second image characteristic; and

and obtaining a detection result by adopting a prediction network of the image detection model based on the fusion characteristic of the first image characteristic and the second image characteristic.

2. The method of claim 1, wherein the convolutional network comprises a deep separation convolutional subnetwork; the obtaining of the second image characteristics by adopting the convolution network of the image detection model based on the image to be detected comprises:

based on the image to be detected, adopting the depth separation convolution sub-network to obtain local features of the image to be detected; and

determining the second image feature based on the local feature.

3. The method of claim 2, wherein the self-attention network is constructed based on a local self-attention mechanism; the obtaining of the first image feature by using the self-attention network of the image detection model based on the image to be detected comprises:

determining the input characteristics of the self-attention network based on the image to be detected; and

and inputting the input features into the self-attention network to obtain the first image features.

4. The method of claim 3, wherein the determining input features of the self-attention network based on the image to be detected comprises:

determining Q characteristics, K characteristics and V characteristics aiming at the self-attention network based on the image to be detected; and

fusing the V feature and the local feature to obtain a fused V feature,

wherein the input features include the Q feature, the K feature, and the fused V feature.

5. The method of claim 3, wherein determining input features of the self-attention network based on the image to be detected comprises:

performing first linear transformation on the feature map of the image to be detected to obtain first linear features; and

determining an input feature of the self-attention network based on the first linear feature.

6. The method of claim 2, wherein the convolutional network further comprises a point-separated convolutional subnetwork; the determining, based on the local feature, the second image feature comprises:

inputting the local features into the point separation convolution sub-network to obtain fused local features of channel fusion; and

and fusing the first image characteristic and the fused local characteristic to obtain the second image characteristic.

7. The method of claim 2, wherein the obtaining the local features of the image to be detected by using the depth separation convolution sub-network based on the image to be detected comprises:

performing second linear transformation on the feature map of the image to be detected to obtain second linear features; and

and inputting the second linear feature into the deep separation convolution sub-network to obtain the local feature.

8. An image processing apparatus comprising:

the first image characteristic obtaining module is used for obtaining first image characteristics by adopting a self-attention network of an image detection model based on an image to be detected;

the second image characteristic obtaining module is used for obtaining second image characteristics by adopting a convolution network of the image detection model based on the image to be detected; and

and the detection result obtaining module is used for obtaining a detection result by adopting a prediction network of the image detection model based on the fusion characteristic of the first image characteristic and the second image characteristic.

9. The apparatus of claim 8, wherein the convolutional network comprises a deep separation convolutional subnetwork; the second image feature obtaining module includes:

the local feature obtaining submodule is used for obtaining the local features of the image to be detected by adopting the depth separation convolution sub-network based on the image to be detected; and

an image feature determination sub-module for determining the second image feature based on the local feature.

10. The apparatus of claim 9, wherein the self-attention network is constructed based on a local self-attention mechanism; the first image feature obtaining module includes:

the input characteristic determining submodule is used for determining the input characteristic of the self-attention network based on the image to be detected; and

and the image feature obtaining submodule is used for inputting the input features into the self-attention network to obtain the first image features.

11. The apparatus of claim 10, wherein the input feature determination submodule comprises:

a first feature determination unit, configured to determine, based on the image to be detected, a Q feature, a K feature, and a V feature for the self-attention network; and

a feature fusion unit for fusing the V feature and the local feature to obtain a fused V feature,

12. The apparatus of claim 10, wherein the input feature determination submodule comprises:

the first linear transformation unit is used for carrying out first linear transformation on the feature map of the image to be detected to obtain a first linear feature; and

a second feature determination unit for determining an input feature of the self-attention network based on the first linear feature.

13. The apparatus of claim 9, wherein the convolutional network further comprises a point-separated convolutional subnetwork; the image feature determination sub-module includes:

the channel fusion unit is used for inputting the local features into the point separation convolution sub-network to obtain fused local features of channel fusion; and

and the third feature determining unit is used for fusing the first image feature and the fused local feature to obtain the second image feature.

14. The apparatus of claim 9, wherein the local feature acquisition sub-module comprises:

the second linear transformation unit is used for performing second linear transformation on the feature map of the image to be detected to obtain second linear features; and

a local feature obtaining unit, configured to input the second linear feature into the depth separation convolution sub-network, so as to obtain the local feature.

15. An electronic device, comprising:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-7.

16. A non-transitory computer readable storage medium having stored thereon computer instructions for causing the computer to perform the method of any of claims 1-7.

17. A computer program product comprising a computer program which, when executed by a processor, implements a method according to any one of claims 1 to 7.