CN113449586B

CN113449586B - Target detection method, device, computer equipment and storage medium

Info

Publication number: CN113449586B
Application number: CN202110387750.9A
Authority: CN
Inventors: 张少林; 宁欣; 田伟娟
Original assignee: Shenzhen Wave Kingdom Co ltd; Beijing Wave Wisdom Security And Safety Technology Co ltd
Current assignee: Shenzhen Wave Kingdom Co ltd; Beijing Wave Wisdom Security And Safety Technology Co ltd
Priority date: 2021-04-12
Filing date: 2021-04-12
Publication date: 2025-01-21
Anticipated expiration: 2041-04-12
Also published as: CN113449586A

Abstract

The present application relates to a target detection method, device, computer equipment and storage medium. The method comprises: obtaining an image to be detected; inputting the image to be detected into a trained target detection model, wherein the target detection model comprises a preprocessing unit, a feature extraction unit and a prediction unit; extracting a first feature map corresponding to the image to be detected by the preprocessing unit to obtain a first low-dimensional feature map corresponding to the first feature map; performing feature extraction on the first low-dimensional feature map by the feature extraction unit to obtain target capsule information corresponding to the image to be detected; performing target detection on the target capsule information by the prediction unit to obtain a target detection result corresponding to the image to be detected. The use of this method can reduce the computational complexity and memory complexity when performing target detection on smaller target objects or partially occluded target objects.

Description

Target detection method, device, computer equipment and storage medium

Technical Field

The present application relates to the field of computer technologies, and in particular, to a target detection method, apparatus, computer device, and storage medium.

Background

Target detection refers to detecting target objects in an image and predicting the position and class of each target object. The target detection is used as an important branch of computer vision and digital image processing, is widely applied to various fields such as robot navigation, intelligent video monitoring, industrial detection, aerospace and the like, reduces the consumption of human resources through the computer vision, and has important practical significance. Meanwhile, the target detection is also a basic algorithm in the field of identity recognition, and plays a vital role in subsequent tasks such as face recognition, gait recognition, crowd counting, instance segmentation and the like. Due to the wide application of deep learning, the target detection technology has been developed more rapidly. In the conventional target detection method, target detection is achieved by extracting a feature map corresponding to a target object in an image, for example, target detection is performed by a target detector DETR (Detection Transformer, target detection based on set prediction).

However, the detection of a small target object by detecting the target object on a high-resolution feature map in a conventional manner results in high computational complexity. Therefore, how to reduce the computational complexity of a smaller target object in the target detection process is called a technical problem that needs to be solved at present.

Disclosure of Invention

In view of the foregoing, it is desirable to provide a target detection method, apparatus, computer device, and storage medium that can reduce the computational complexity of a smaller target object in the target detection process.

A method of target detection, the method comprising:

Acquiring an image to be detected;

Inputting the image to be detected into a trained target detection model, wherein the target detection model comprises a preprocessing unit, a feature extraction unit and a prediction unit;

Extracting a first feature map corresponding to the image to be detected through the preprocessing unit to obtain a first low-dimensional feature map corresponding to the first feature map;

performing feature extraction on the first low-dimensional feature map through the feature extraction unit to obtain target capsule information corresponding to the image to be detected;

and carrying out target detection on the target capsule information through the prediction unit to obtain a target detection result corresponding to the image to be detected.

In one embodiment, extracting, by the preprocessing unit, a first feature map corresponding to the image to be detected includes:

And extracting the characteristics of the image to be detected through the convolutional neural network in the preprocessing unit, and determining the characteristic diagrams output by the last two convolutional layers of the convolutional neural network as first characteristic diagrams corresponding to the image to be detected.

In one embodiment, the performing the attention-based pooling processing on the first feature map to obtain a first low-dimensional feature map corresponding to the first feature map includes:

Performing multi-head attention calculation on the first feature map to obtain a multi-head attention value corresponding to the first feature map;

and normalizing the multi-head attention value to obtain a first low-dimensional feature map corresponding to the first feature map.

In one embodiment, the feature extraction unit includes an encoding unit and a decoding unit, and the feature extracting, by the feature extraction unit, the first low-dimensional feature map includes:

Global feature extraction is carried out on the first low-dimensional feature map through the coding unit, global feature information is obtained, and capsule conversion is carried out on the global feature information, so that initial capsule information is obtained;

Inputting the initial capsule information into the decoding unit, extracting category characteristics of the initial capsule information to obtain category characteristic information, and performing capsule conversion on the category characteristic information to obtain target capsule information.

In one embodiment, the performing, by the prediction unit, the target detection on the target capsule information, and obtaining a target detection result corresponding to the image to be detected includes:

Performing target detection on the target capsule information based on the attention route through the prediction unit to obtain a first detection result;

performing linear transformation on the target capsule information through the prediction unit to obtain a second detection result;

And fusing the first detection result and the second detection result to obtain a target detection result corresponding to the image to be detected.

In one embodiment, before the capturing the image to be detected, the method further includes:

Acquiring a sample image set;

Inputting the sample image set into a target detection model to be trained, extracting a second feature image corresponding to the sample image set through a preprocessing unit in the target detection model to be trained, and carrying out attention-based pooling treatment on the second feature image to obtain a second low-dimensional feature image corresponding to the second feature image;

Performing feature extraction on the second low-dimensional feature map through a feature extraction unit in the target detection model to be trained to obtain target capsule information corresponding to the sample image set;

performing target detection on target capsule information corresponding to the sample image set through a prediction unit in the target detection model to be trained to obtain a target detection result corresponding to the sample image set;

And calculating a loss value of the target detection model to be trained according to a target detection result corresponding to the sample image set, and updating network parameters of the target detection model to be trained according to the loss value until a preset condition is met, so as to obtain the trained target detection model.

In one embodiment, the sample image set is marked with target label information, and the calculating the loss value of the target detection model to be trained according to the target detection result corresponding to the sample image set includes:

performing binary matching on the target detection result corresponding to the sample image set and the target label information to obtain a matching result;

and calculating a loss value of the target detection model to be trained according to the matching result.

In one embodiment, the loss values of the target detection model to be trained include a target position offset loss value, a classification loss value, and a matching loss value.

An object detection apparatus, the apparatus comprising:

the image acquisition module is used for acquiring an image to be detected;

The feature extraction module is used for inputting the image to be detected into a trained target detection model, wherein the target detection model comprises a preprocessing unit, a feature extraction unit and a prediction unit, extracting a first feature image corresponding to the image to be detected through the preprocessing unit to obtain a first low-dimensional feature image corresponding to the first feature image;

and the target detection module is used for carrying out target detection on the target capsule information through the prediction unit to obtain a target detection result corresponding to the image to be detected.

A computer device comprising a memory storing a computer program executable on the processor and a processor implementing the steps of the method embodiments described above when the computer program is executed by the processor.

A computer readable storage medium having stored thereon a computer program which when executed by a processor performs the steps of the various method embodiments described above.

According to the target detection method, the target detection device, the computer equipment and the storage medium, the first feature map corresponding to the image to be detected is extracted through the preprocessing unit of the trained target detection model, and the first feature map is subjected to attention-based pooling processing to obtain the first low-dimensional feature map corresponding to the first feature map. And the target capsule information corresponding to the image to be detected is subjected to target detection by the prediction unit, so that a target detection result corresponding to the image to be detected is obtained. By carrying out attention-based pooling processing on the first feature map, irrelevant information in the first feature map can be removed, and only information related to target detection is concerned, so that the computational complexity is reduced, and the memory complexity is reduced by carrying out dimension reduction processing on the first feature map. When the target detection is carried out on a smaller target object or a target object which is partially blocked, the calculation complexity and the memory complexity can be reduced.

Drawings

FIG. 1 is a diagram of an application environment for a target detection method in one embodiment;

FIG. 2 is a flow chart of a method of detecting targets in one embodiment;

FIG. 3 is a flowchart illustrating steps of performing attention-based pooling on a first feature map to obtain a first low-dimensional feature map corresponding to the first feature map in one embodiment;

FIG. 4 is a flowchart illustrating a step of extracting features of the first low-dimensional feature map by the feature extraction unit to obtain target capsule information corresponding to the image to be detected in one embodiment;

FIG. 5 is a flowchart of a step of performing target detection on target capsule information by a prediction unit to obtain a target detection result corresponding to an image to be detected in one embodiment;

FIG. 6 is a block diagram of an object detection device in one embodiment;

Fig. 7 is an internal structural diagram of a computer device in one embodiment.

Detailed Description

The present application will be described in further detail with reference to the drawings and examples, in order to make the objects, technical solutions and advantages of the present application more apparent. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the application.

The target detection method provided by the application can be applied to computer equipment, and the computer equipment can be a terminal or a server. It can be understood that the target detection method provided by the application can be applied to a terminal, a server and a system comprising the terminal and the server, and is realized through interaction of the terminal and the server.

The target detection method provided by the application can reduce the calculation complexity of smaller target objects, and is suitable for a plurality of application scenes of target detection. For example, in the face recognition scene, the target detection method can improve the face detection precision, reduce the misjudgment rate, and more accurately identify the vehicles in the monitoring image in the vehicle detection scene.

The target detection method provided by the application can be applied to an application environment shown in figure 1. Wherein the terminal 102 communicates with the server 104 via a network. When target detection is required, the server 104 acquires an image to be detected sent by the terminal 102, the image to be detected is input into a trained target detection model, the target detection model comprises a preprocessing unit, a feature extraction unit and a prediction unit, a first feature image corresponding to the image to be detected is extracted through the preprocessing unit, the first feature image is subjected to attention-based pooling processing to obtain a first low-dimensional feature image corresponding to the first feature image, the feature extraction unit is used for extracting features of the first low-dimensional feature image to obtain target capsule information corresponding to the image to be detected, and the prediction unit is used for carrying out target detection on the target capsule information to obtain a target detection result corresponding to the image to be detected. The terminal 102 may be, but is not limited to, various personal computers, notebook computers, smart phones, tablet computers, and portable wearable devices. The server 104 may be implemented as a stand-alone server or as a server cluster of multiple servers.

In one embodiment, as shown in fig. 2, there is provided a target detection method, which is applied to a server for explanation by taking application of the method to the server as an example, and includes the following steps:

step 202, an image to be detected is acquired.

In step 204, the image to be detected is input into a trained target detection model, and the target detection model includes a preprocessing unit, a feature extraction unit and a prediction unit.

The image to be detected refers to an image that is required for target detection. Target detection refers to detecting target objects in an image, predicting the position and the category of each target object, and the target objects such as faces, vehicles or buildings and the like are determined according to actual application scenes.

The server can acquire a target detection request sent by the terminal, and analyze the target detection request to obtain an image to be detected. The pre-stored image to be detected in the terminal can be an image including the target object, which is acquired by the image sensor, and the position, the size, the acquisition angle of the target object and the like of the target object in the image to be detected can be arbitrary. For example, the target object may occupy only a small portion of the image to be detected, and the target object may be inclined in the image to be detected due to the influence of the acquisition angle, or the size ratio of the target object may be offset from the real size ratio thereof, for example, the lengths of two parallel sides of the rectangular target object in the image to be detected are different, and the other two sides may be non-parallel.

In one embodiment, the target object in the image to be detected may have a frame of a regular shape, the target object may have a fixed number of vertices, and the vertices are connected to form the frame of the target object. For example, the frame of the target object may be square, rectangular, etc. with four vertices.

After the server acquires the image to be detected, invoking a pre-stored trained target detection model, wherein the trained target detection model is obtained through training a sample image set marked with target class labels. The target detection model comprises a preprocessing unit, a feature extraction unit and a prediction unit, and can effectively reduce the computational complexity of a smaller target object.

And 206, extracting a first feature map corresponding to the image to be detected through a preprocessing unit, and carrying out attention-based pooling processing on the first feature map to obtain a first low-dimensional feature map corresponding to the first feature map.

The first feature map may be used to detect smaller target objects or partially occluded target objects in the image. The first low-dimensional feature map refers to a low-dimensional feature representation corresponding to the first feature map.

The preprocessing unit in the target detection model is used for acquiring a first low-dimensional characteristic map corresponding to the first characteristic map. Specifically, the preprocessing unit extracts a first feature map corresponding to the image to be detected, and performs attention pooling processing on the first feature to obtain a first low-dimensional feature map corresponding to the first feature map. The first feature map may include two feature maps of different resolutions, and the feature map of higher resolution is used to detect a smaller target object or a target object that is partially blocked in the image. The first feature map includes global image information of the image to be detected.

In order to reduce the computation complexity and the memory complexity of the feature map with higher resolution, namely the computation complexity of a smaller object or a partially blocked object in the process of target detection, the server performs attentive pooling processing on the first feature map before performing feature extraction on the first feature map, wherein attentive pooling processing refers to removing irrelevant points with smaller response values in the first feature map, namely the first feature map is thinned, only information related to target detection is required to be extracted, so that the computation amount is reduced, and the computation complexity and the memory complexity are reduced.

And step 208, performing feature extraction on the first low-dimensional feature map through a feature extraction unit to obtain target capsule information corresponding to the image to be detected.

The target capsule information refers to capsule representation of the target object in the image to be detected, namely, the characteristic information of the target object is represented by the capsule.

The feature extraction unit in the target detection model is configured to extract target capsule information of a target object in an image to be detected, where the target capsule information may include a plurality of capsule vectors, each capsule vector being configured to represent feature information of a corresponding target object, each capsule vector including a plurality of dimensions, each dimension being configured to represent pose information of one local feature of the corresponding target object. Therefore, the local characteristic information of the target object in the image to be detected can be accurately embodied through the target capsule information, so that the target object in the image to be detected is represented through the local characteristic information.

And 210, classifying the target capsule information through a prediction unit to obtain a target detection result corresponding to the image to be detected.

After extracting target capsule information corresponding to the image to be detected, the feature extraction unit takes the target capsule information as input of the prediction unit, and performs target detection according to the target capsule information through the prediction unit so as to predict the position and the corresponding type of the target object in the image to be detected. The position of the target object refers to a frame corresponding to the target object.

Based on the brain cognition principle, the brain firstly executes the significance attention driven by external stimulus in the cognition process, does not need active intervention, and is the screening process of the bottom layer to the upper layer, namely, the bottom-up inference should be executed firstly. Therefore, the corresponding lower layer information is screened according to the upper layer target information, namely, the information transmission from top to bottom is realized. Therefore, in this embodiment, the prediction unit may separately classify and process the target capsule information in a capsule transmission manner and a full-connection manner, the capsule transmission manner may be from bottom to top, the full-connection manner may be from top to bottom information transmission, the capsule transmission manner uses local feature information of the target object, the local feature information predicts the position and the type of the target object, the full-connection manner uses whole information of the target object, and the whole information predicts the position and the type of the target object, and by combining the two manners, the local feature information and the whole information of the target object are fully utilized, and the accuracy of target detection is effectively improved.

The conventional target detector DETR (Detection Transformer, target detection based on set prediction) has the defects of slow convergence and high computational complexity, and particularly, when target detection is performed on a smaller target object, detection is required to be performed in a high-resolution feature map, so that the computational complexity and the memory complexity are high. In this embodiment, a preprocessing unit of a trained target detection model extracts a first feature map corresponding to an image to be detected, and performs attention-based pooling processing on the first feature map to obtain a first low-dimensional feature map corresponding to the first feature map. And the target capsule information corresponding to the image to be detected is subjected to target detection by the prediction unit, so that a target detection result corresponding to the image to be detected is obtained. By carrying out attention-based pooling processing on the first feature map, irrelevant information in the first feature map can be removed, and only information related to target detection is concerned, so that the computational complexity is reduced, and the memory complexity is reduced by carrying out dimension reduction processing on the first feature map. When the target detection is carried out on a smaller target object or a target object which is partially blocked, the calculation complexity and the memory complexity can be reduced.

In one embodiment, extracting the first feature map corresponding to the image to be detected through the preprocessing unit comprises the steps of extracting features of the image to be detected through a convolutional neural network in the preprocessing unit, and determining the feature maps output by the last two convolutional layers of the convolutional neural network as the first feature map corresponding to the image to be detected.

A convolutional neural network (Convolutional Neural Network, CNN for short) may be included in the preprocessing unit of the trained object detection model. The convolutional neural network is used for extracting a first feature map corresponding to the image to be detected, so that a subsequent feature extraction unit extracts target capsule information corresponding to the image to be detected. The convolutional neural network may include multiple network layers, such as an input layer, multiple convolutional layers, a pooling layer, a fully-connected layer, and the like. The feature maps output by the last two convolution layers can be determined to be the first feature map corresponding to the image to be detected. The resolution of the output feature map gradually decreases as the sequence of processing the plurality of convolution layers. Thus, the resolution of the feature map output by the penultimate convolution layer is higher than the resolution of the feature map output by the last convolution layer. For ease of description, the penultimate convolutional layers may be referred to as the a-1 layer, the output profile of which may be referred to as F _a-1, size [ bs, d _a-1,h_a-1,w_a-1 ], where bs represents the batch size of F _a-1 (the number of samples in the batch), d _a-1 represents the characteristic dimension of F _a-1, h _a-1 represents the height of F _a-1, and w _a-1 represents the width of F _a-1. The last convolution layer is called layer a, and its output feature map may be called F _a, with size [ bs, d _a,h_a,w_a ], where bs represents the batch size of F _a (the number of samples in the batch process), d _a represents the feature dimension of F _a, h _a represents the height of F _a, and w _a represents the width of F _a. The sizes of F _a-1 and F _a satisfy h _a-1>h_a,w_a-1>w_a.

Further, since the structure of the feature extraction unit does not use any recursive structure or convolution structure, in order for the feature extraction unit to utilize the sequential information of the image to be detected, it is necessary to introduce information expressing the absolute or relative position of each element in the image to be detected. For example, the first feature map may be position-coded using a convolutional neural network, and the first feature map after coding may be subjected to attention-based pooling. The position coding means that the positions of the elements included in the first feature map are coded.

In this embodiment, the first feature map includes not only the feature map output by the last convolution layer, but also the feature map output by the penultimate convolution layer with higher resolution, so that the target detection on the smaller target object can be realized.

In one embodiment, as shown in fig. 3, the step of performing attention-based pooling processing on the first feature map to obtain a first low-dimensional feature map corresponding to the first feature map includes:

Step 302, performing multi-head attention calculation on the first feature map to obtain multi-head attention values corresponding to the first feature map.

And step 304, carrying out normalization processing on the multi-head attention values to obtain a first low-dimensional feature map corresponding to the first feature map.

The preprocessing unit in the trained target detection model may include a convolutional neural network and a pooling unit, where the convolutional neural network is used to extract a first feature map corresponding to an image to be detected, and the first feature map includes feature maps output by the last two convolutional layers and may be represented as a-1 layer output feature map F _a-1 and a layer output feature map F _a. The pooling unit is used for performing attention-based pooling processing on the feature map F _a-1 and the feature map F _a respectively. The pooling processing based on the attention is to firstly perform multi-head attention calculation on the feature map and perform normalization processing on multi-head attention values. The pooling unit may perform attention-based pooling on the feature map F _a-1 first, and then perform attention-based pooling on the feature map F _a.

Taking the pooling process based on the attention as an example for the feature map F _a-1, specifically, the pooling unit may use a multi-head attention mechanism to perform multi-head attention calculation on the feature map F _a-1 to obtain multi-head attention values corresponding to the first feature map, so as to perform normalization processing on the multi-head attention values to obtain a first low-dimensional feature map corresponding to the first feature map. The process of attention-based pooling (PMA (Z)) can be expressed by the following formula:

PMA(Z)=LayerNorm(S+Multihead(S,Z,Z)) (1)

Where S represents the first low-dimensional feature map, Z represents the key and value vectors, i.e., feature maps F _a-1, multihead (S, Z, Z) represent the multi-headed attention values, layerNorm represents the normalization process, Representing scale factors, dim represents the dimension of the feature map F _a-1.

It can be appreciated that the pooling process based on the attention is performed on the feature map F _a in the above manner, so as to obtain a first low-dimensional feature map corresponding to the feature map F _a. Thereby obtaining a first low-dimensional feature map corresponding to the feature map F _a-1 and a first low-dimensional feature map corresponding to the feature map F _a.

In this embodiment, by performing multi-head attention calculation and normalization processing on the first feature map, only information related to target detection needs to be extracted, so that the calculation amount is effectively reduced, and the calculation consumption caused by attention calculation in a subsequent feature extraction unit is reduced, thereby effectively reducing the calculation complexity and the memory complexity.

In one embodiment, as shown in fig. 4, the step of extracting features of the first low-dimensional feature map by using a feature extraction unit to obtain the target capsule information corresponding to the image to be detected includes:

Step 402, global feature extraction is performed on the first low-dimensional feature map through the coding unit, global feature information is obtained, and capsule conversion is performed on the global feature information, so that initial capsule information is obtained.

Step 404, inputting the initial capsule information into a decoding unit, extracting category characteristics of the initial capsule information to obtain category characteristic information, and performing capsule conversion on the category characteristic information to obtain target capsule information.

The feature extraction unit may be a capsule representation based transducer network. The feature extraction unit includes an encoding unit and a decoding unit. The encoding unit and the decoding unit each include a capsule converting unit for converting information into a capsule form.

The coding unit in the feature extraction unit is used for extracting global feature information in the first low-dimensional feature map, such as color features, texture features, shape features and the like, so that the global feature information is subjected to capsule conversion by the capsule conversion unit in the coding unit to obtain initial capsule information. The initial capsule information refers to a capsule representation corresponding to the global feature information. The capsule conversion is performed on the global feature information, namely, feature information of the same target object in the global feature information is gathered together to generate a capsule. The capsule is embodied in the form of a capsule vector, and thus includes a plurality of local features of the target object in the capsule vector. The modular length of the capsule vector corresponding to each capsule represents the probability of existence of each local feature in the target object, and the dimension of the capsule vector represents the gesture information corresponding to each local feature in the target object. The first low-dimensional feature map includes a first low-dimensional feature map corresponding to feature map F _a-1 and a first low-dimensional feature map corresponding to feature map F _a, and thus, initial capsule information includes an initial capsule representation corresponding to feature map F _a-1 and an initial capsule representation corresponding to feature map F _a, the initial capsule representation corresponding to feature map F _a-1 is P _a-1, and has a size of [ bs, mum _a-1,d_a-1/mum_a-1,s_a-1 ], where bs represents a batch size (a number of samples in a batch process) of the initial capsule representation corresponding to F _a-1, mum _a-1 represents a number of capsules of the initial capsule representation corresponding to F _a-1, and d _a-1/mum_a-1 represents a capsule vector of each capsule in the initial capsule representation corresponding to F _a-1. The initial capsule corresponding to feature map F _a is denoted as P _a, and has a size [ bs, mum _a,d_a/mum_a,s_a ], where bs represents the batch size (number of samples in batch processing) of the initial capsule corresponding to F _a, mum _a represents the number of capsules of the initial capsule representation corresponding to F _a, and d _a/mum_a represents the capsule vector of each capsule in the initial capsule representation corresponding to F _a. By performing capsule conversion on the global feature information, feature information of the same instance type in the global feature information can be classified into one type, such as eyes, mouth, nose, and the like of the same target object.

The decoding unit in the feature extraction unit is used for extracting category feature information in the initial capsule information and boundary information of the target object, and capsule conversion is carried out on the extracted category feature information and the boundary information of the target object through the capsule conversion unit in the decoding unit, so that the target capsule information is obtained. The target capsule information includes target capsule representation O _a-1 corresponding to P _a-1 and target capsule representation O _a corresponding to P _a. The size of the target capsule representation O _a-1 can be [100, bs, mum _a-1,d_a-1/mum_a-1 ], wherein 100 represents the number of capsules to be detected in the image to be detected. The target capsule representation O _a may be of size [100, bs, mum _a,d_a/mum_a ]. The extracted category characteristic information and the boundary information of the target object are subjected to capsule conversion, so that the characteristic information and the corresponding boundary information of the same target object can be clustered together, and therefore, the target capsule information comprises local characteristic information corresponding to each target object, characteristic information of the target object and the corresponding boundary information.

In this embodiment, by adding the capsule conversion process in the encoding unit and the decoding unit, different gestures of the target object can be accurately identified, so that the target object can be represented by the local feature information, the local representation of the target object is more accurate, and the accuracy of target detection is improved.

In one embodiment, as shown in fig. 5, the step of obtaining the target detection result corresponding to the image to be detected by performing target detection on the target capsule information through the prediction unit includes:

in step 502, target detection is performed on target capsule information based on the attention route through a prediction unit, and a first detection result is obtained.

And 504, performing linear transformation on the target capsule information through a prediction unit to obtain a second detection result.

And step 506, fusing the first detection result and the second detection result to obtain a target detection result corresponding to the image to be detected.

Since the target capsule information includes the target capsule representation O _a-1 corresponding to P _a-1 and the target capsule representation O _a corresponding to P _a, the prediction unit may perform target detection on the target capsule representation O _a-1 and the target capsule representation O _a, respectively, where the target detection refers to identifying target objects in the image to be detected according to the target capsule information, and predicting the position and the type of each target object. The target detection process of the prediction unit comprises two target detection modes, wherein one mode is to realize target detection according to local characteristic information in a capsule transmission mode, and the other mode is to realize target detection according to whole information in a full-connection mode. The prediction unit can process the target capsule information through two target detection modes at the same time, and fuse the results of the two target detection modes, so that the accuracy of target detection is higher. The first detection result comprises a first detection result corresponding to the target capsule representation O _a-1 and a first detection result corresponding to the target capsule representation O _a, and the second detection result comprises a second detection result corresponding to the target capsule representation O _a-1 and a second detection result corresponding to the target capsule representation O _a.

Taking the target capsule representation O _a-1 as an example for target detection, when the target detection is performed by means of capsule delivery, the prediction unit may perform target detection on the target capsule representation O _a-1 and the target capsule representation O _a respectively based on a bottom-up attention routing algorithm. The bottom-up attention routing algorithm refers to a routing algorithm based on a multi-head attention mechanism, which needs to acquire a probability value of assigning a lower capsule to an upper capsule, such as a probability of assigning a lower capsule such as eyes, nose, mouth, etc. to a face of a target object, where the lower capsule may be a capsule in the target capsule representation O _a-1, and the upper capsule may be a capsule in a first detection result after target detection. The bottom-up attention routing algorithm uses the number of capsules corresponding to the target capsule representation O _a-1 as the head of a multi-head attention mechanism, calculates the correlation between each upper-layer capsule and the lower-layer capsule after affine transformation by using the multi-head attention mechanism along the dimension of the number of capsules corresponding to the target capsule representation O _a-1, so that information transmission between the capsules is realized, a first detection result corresponding to the target capsule representation O _a-1 is obtained, and a calculation formula can be shown as (2). The first detection result may include a plurality of capsules, and a category and a position of the target object corresponding to each capsule. The number of capsules and the number of capsules to be detected in the target capsule information. The full-connection mode is a top-down information transmission mode, when the target detection is carried out through the full-connection mode, the prediction unit carries out linear transformation on the target capsule representation O _a-1, the type and the position of the target object are determined by utilizing the whole information of the target object, and a second detection result corresponding to the target capsule representation O _a-1 is obtained.

The first detection result and the second detection result corresponding to the target capsule representation O _a-1 and the target capsule representation O _a can be obtained through prediction through the target detection mode. The first detection result is obtained through the prediction of the local characteristic information of the target object, and the second detection result is obtained through the prediction of the whole information of the target object, so that the prediction unit fuses the first detection result and the second detection result, the local characteristic information and the whole information of the target object can be fully utilized, and the accuracy of the target detection result is effectively improved.

In one embodiment, before the image to be detected is acquired, the method further comprises the steps of acquiring a sample image set, inputting the sample image set into a target detection model to be trained, extracting a second feature image corresponding to the sample image set through a preprocessing unit in the target detection model to be trained, carrying out attention-based pooling processing on the second feature image to obtain a second low-dimensional feature image corresponding to the second feature image, carrying out feature extraction on the second low-dimensional feature image through a feature extraction unit in the target detection model to be trained to obtain target capsule information corresponding to the sample image set, carrying out target detection on target capsule information corresponding to the sample image set through a prediction unit in the target detection model to be trained to obtain a target detection result corresponding to the sample image set, calculating a loss value of the target detection model to be trained according to the target detection result corresponding to the sample image set, and updating network parameters of the target detection model to be trained according to the loss value until a preset condition is met, so as to obtain the trained target detection model.

The sample image set refers to a training sample for training the target detection model, and may include a plurality of sample images, and a plurality refers to two or more. The sample image set may be selected according to an application scene, such as a vehicle detection scene, in which a vehicle, a pedestrian, or the like may be included in the sample images in the sample image set. In one embodiment, the plurality of storage locations of the sample image sets may be stored in a database or in a terminal, so that when the target detection model is trained, the corresponding sample image set is obtained from the database or the terminal.

The target detection mode of the target detection model is the same in the training process and the application process. The target detection model to be trained comprises a preprocessing unit, a feature extraction unit and a prediction unit. The preprocessing unit in the target detection model to be trained is used for extracting second feature images corresponding to the sample images in the sample image set, wherein the second feature images refer to feature images output by the last two convolution layers of the convolution neural network when the convolution neural network in the preprocessing unit performs feature extraction on the sample images. And the prediction unit performs attention-based pooling processing on the second feature map to obtain a second low-dimensional feature map corresponding to the second feature map, wherein the second low-dimensional feature map refers to low-dimensional feature representation corresponding to the second feature map. And taking the second low-dimensional feature map as input of a feature extraction unit to perform feature extraction to obtain target capsule information corresponding to the sample image set. The target capsule information corresponding to the sample image set comprises local characteristic information of a target object in the sample image set. And then target detection is carried out on target capsule information corresponding to the sample image set through a prediction unit in the target detection model to be trained, specifically, the prediction unit can respectively classify and process the target capsule information in a capsule transmission mode and a full-connection mode, the capsule transmission mode can be used for estimating from bottom to top, local characteristic information of a target object is utilized, the full-connection mode can be used for realizing top-down information transmission, integral information of the target object is utilized, and the local characteristic information and the integral information of the target object are fully utilized by combining the two modes, so that the accuracy of a target detection result is effectively improved.

Further, when the prediction unit in the target detection model to be trained performs target detection in a capsule transmission mode, the prediction unit can perform target detection on target capsule information corresponding to the sample image set by using a bottom-up attention routing algorithm, so as to obtain a corresponding detection result. Specifically, the sample image set may include the category number, and the prediction unit may utilize a bottom-up attention routing algorithm to expand the target capsule information according to the category number, so that it may be ensured that the dimension of the output capsule corresponds to the category number, that is, the detection result includes the target object of the category number. The expansion mode can be to add 1 dimension based on the original dimension of the target capsule information, and copy the first 4 dimensions according to the number of categories. The number of the capsules corresponding to the target capsule information is used as the head of a multi-head attention mechanism, the multi-head attention mechanism is adopted to calculate the correlation between each output capsule after affine transformation and each capsule in the target capsule information along the dimension of the number of the capsules corresponding to the target capsule information, so that information transmission among the capsules is realized, and corresponding detection results are obtained.

After the target detection result is obtained, the loss value of the target detection model to be trained can be calculated according to the target detection result corresponding to the sample image set. The loss value is a parameter for evaluating the prediction effect of the model, and the smaller the network loss value is, the better the prediction effect of the model is. Correspondingly, the loss value of the target detection model to be trained is used for evaluating one parameter of the target detection effect of the target detection network, and the smaller the loss value is, the better the target detection effect is.

And updating the network parameters of the target detection model to be trained according to the loss value and a preset network parameter updating mode to obtain an updated target detection model. And after each update, judging whether the updated target detection model to be trained meets the preset condition. If yes, stopping model training, and taking the updated target detection model to be trained as a trained target detection model. If the target detection model does not meet the preset conditions, returning to the step of inputting the sample image set into the target detection model to be trained, and determining the updated target detection model to be trained as the trained target detection model. The preset network parameter updating mode can be any one of error correction algorithms such as a gradient descent method and a back propagation algorithm. For example, adam (Adaptive Moment Estimation ) algorithm. The preset condition may be that the network loss value reaches the loss threshold, or that the iteration number reaches the iteration number threshold, which is not limited herein.

In this embodiment, a preprocessing unit of the target detection model to be trained extracts a second feature map corresponding to the sample image set, and performs attention-based pooling processing on the second feature map to obtain a second low-dimensional feature map corresponding to the second feature map. And the second low-dimensional feature map is subjected to feature extraction through a feature extraction unit to obtain target capsule information corresponding to the sample image set, and target detection is performed on the target capsule information through a prediction unit to obtain a target detection result corresponding to the sample image set. By carrying out attention-based pooling processing on the second feature map, effective information can be extracted, meanwhile, the calculation complexity and the memory consumption are reduced, the training time is greatly shortened, and the model convergence speed is accelerated. The trained target detection model can reduce the computational complexity and the memory complexity when detecting the target of a smaller target object or a target object which is partially blocked.

In one embodiment, the sample image set is marked with target label information, and calculating the loss value of the target detection model to be trained according to the target detection result corresponding to the sample image set comprises the steps of performing binary matching on the target detection result corresponding to the sample image set and the target label information to obtain a matching result, and calculating the loss value of the target detection model to be trained according to the matching result.

The sample image set is marked with target label information, and the target label information comprises category labels of target objects in each sample image in the sample image set and frames corresponding to the target objects.

Specifically, a hungarian algorithm can be adopted to perform binary matching on the target detection result corresponding to the sample image set and the target label information, the target detection result and the target label information can be uniquely matched by the obtained matching result hungarian algorithm, and a plurality of target objects can be predicted in parallel by combining the prediction unit and the hungarian algorithm. And calculating the loss value of the target detection model to be trained according to the matching result. In one embodiment, the loss values of the object detection model to be trained include an object position offset loss value, a classification loss value, and a matching loss value. The target position offset loss value refers to the loss of position fitting between the frame of the target object in the target detection result and the frame of the corresponding target object in the target label information, and is used for improving the accuracy of target object frame detection. The target position offset penalty value may be an IOU (cross over ratio) penalty. The classification loss value is a class loss value, and can adopt common cross entropy loss for realizing the multi-classification process of the target detection model and directly outputting the class of the target object. The matching loss value is used for realizing the unique matching of the frame of the target object in the target detection result and the frame of the corresponding target object in the target label information, is obtained by measuring the distance between the matching results, and is used for improving the matching accuracy of the target detection result and the target label information.

It should be understood that, although the steps in the flowcharts of fig. 2 to 5 are shown in order as indicated by the arrows, these steps are not necessarily performed in order as indicated by the arrows. The steps are not strictly limited to the order of execution unless explicitly recited herein, and the steps may be executed in other orders. Moreover, at least some of the steps in fig. 2-5 may include multiple sub-steps or stages that are not necessarily performed at the same time, but may be performed at different times, nor does the order in which the sub-steps or stages are performed necessarily occur sequentially, but may be performed alternately or alternately with at least a portion of the sub-steps or stages of other steps or other steps.

In one embodiment, as shown in FIG. 6, there is provided an object detection apparatus comprising an image acquisition module 602, a preprocessing module 604, a feature extraction module 606, and an object detection module 608, wherein:

the image acquisition module 602 is configured to acquire an image to be detected.

The preprocessing module 604 is configured to input an image to be detected into a trained target detection model, where the target detection model includes a preprocessing unit, a feature extraction unit and a prediction unit, and extract a first feature map corresponding to the image to be detected by the preprocessing unit, so as to obtain a first low-dimensional feature map corresponding to the first feature map.

The feature extraction module 606 performs feature extraction on the first low-dimensional feature map through a feature extraction unit to obtain target capsule information corresponding to the image to be detected.

The target detection module 608 is configured to perform target detection on the target capsule information through the prediction unit, so as to obtain a target detection result corresponding to the image to be detected.

In one embodiment, the preprocessing module 604 is further configured to perform feature extraction on the image to be detected through a convolutional neural network in the preprocessing unit, and determine a feature map output by the last two convolutional layers of the convolutional neural network as a first feature map corresponding to the image to be detected.

In one embodiment, the preprocessing module 604 is further configured to perform multi-head attention calculation on the first feature map to obtain a multi-head attention value corresponding to the first feature map, and perform normalization processing on the multi-head attention value to obtain a first low-dimensional feature map corresponding to the first feature map.

In one embodiment, the feature extraction unit includes an encoding unit and a decoding unit, and the feature extraction module 606 is further configured to perform global feature extraction on the first low-dimensional feature map through the encoding unit to obtain global feature information, perform capsule conversion on the global feature information to obtain initial capsule information, input the initial capsule information to the decoding unit, perform category feature extraction on the initial capsule information to obtain category feature information, and perform capsule conversion on the category feature information to obtain target capsule information.

In one embodiment, the target detection module 608 is further configured to perform target detection on the target capsule information based on the attention route through the prediction unit to obtain a first detection result, perform linear transformation on the target capsule information through the prediction unit to obtain a second detection result, and fuse the first detection result and the second detection result to obtain a target detection result corresponding to the image to be detected.

In one embodiment, the apparatus further comprises:

and the sample acquisition module is used for acquiring a sample image set.

The sample preprocessing module is used for inputting the sample image set into the target detection model to be trained, extracting a second feature image corresponding to the sample image set through the preprocessing unit in the target detection model to be trained, and carrying out attention-based pooling processing on the second feature image to obtain a second low-dimensional feature image corresponding to the second feature image.

And the sample feature extraction module is used for carrying out feature extraction on the second low-dimensional feature map through a feature extraction unit in the target detection model to be trained to obtain target capsule information corresponding to the sample image set.

And the sample target detection module is used for carrying out target detection on the target capsule information corresponding to the sample image set through a prediction unit in the target detection model to be trained, so as to obtain a target detection result corresponding to the sample image set.

And the parameter updating module is used for calculating the loss value of the target detection model to be trained according to the target detection result corresponding to the sample image set, updating the network parameter of the target detection model to be trained according to the loss value until the preset condition is met, and obtaining the trained target detection model.

In one embodiment, the sample image set is marked with target label information, the parameter updating module is further used for performing binary matching on a target detection result corresponding to the sample image set and the target label information to obtain a matching result, and calculating a loss value of the target detection model to be trained according to the matching result.

In one embodiment, the parameter updating module is further configured to calculate a target position offset loss value, a classification loss value, and a matching loss value of the target detection model to be trained.

For specific limitations of the object detection device, reference may be made to the above limitations of the object detection method, and no further description is given here. The respective modules in the above-described object detection apparatus may be implemented in whole or in part by software, hardware, and combinations thereof. The above modules may be embedded in hardware or may be independent of a processor in the computer device, or may be stored in software in a memory in the computer device, so that the processor may call and execute operations corresponding to the above modules.

In one embodiment, a computer device is provided, which may be a server, the internal structure of which may be as shown in fig. 7. The computer device includes a processor, a memory, a unit interface, and a database connected by a system bus. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device includes a non-volatile storage medium and an internal memory. The non-volatile storage medium stores an operating system, computer programs, and a database. The internal memory provides an environment for the operation of the operating system and computer programs in the non-volatile storage media. The database of the computer device is used for storing data of an object detection method. The unit interface of the computer device is used for communicating with an external terminal through a unit connection. The computer program is executed by a processor to implement a method of object detection.

It will be appreciated by those skilled in the art that the structure shown in FIG. 7 is merely a block diagram of some of the structures associated with the present inventive arrangements and is not limiting of the computer device to which the present inventive arrangements may be applied, and that a particular computer device may include more or fewer components than shown, or may combine some of the components, or have a different arrangement of components.

In one embodiment, a computer device is provided comprising a memory storing a computer program and a processor implementing the steps of the various embodiments described above when the computer program is executed.

In one embodiment, a computer readable storage medium is provided, on which a computer program is stored which, when executed by a processor, implements the steps of the various embodiments described above.

Those skilled in the art will appreciate that implementing all or part of the above described methods may be accomplished by way of a computer program stored on a non-transitory computer readable storage medium, which when executed, may comprise the steps of the embodiments of the methods described above. Any reference to memory, storage, database, or other medium used in embodiments provided herein may include non-volatile and/or volatile memory. The nonvolatile memory can include Read Only Memory (ROM), programmable ROM (PROM), electrically Programmable ROM (EPROM), electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms such as Static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), double Data Rate SDRAM (DDRSDRAM), enhanced SDRAM (ESDRAM), synchronous link (SYNCHLINK) DRAM (SLDRAM), memory bus (Rambus) direct RAM (RDRAM), direct memory bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM), among others.

The technical features of the above embodiments may be arbitrarily combined, and all possible combinations of the technical features in the above embodiments are not described for brevity of description, however, as long as there is no contradiction between the combinations of the technical features, they should be considered as the scope of the description.

The above examples illustrate only a few embodiments of the application, which are described in detail and are not to be construed as limiting the scope of the application. It should be noted that it will be apparent to those skilled in the art that several variations and modifications can be made without departing from the spirit of the application, which are all within the scope of the application. Accordingly, the scope of protection of the present application is to be determined by the appended claims.

Claims

1. A method of target detection, the method comprising:

Acquiring an image to be detected;

Extracting a first feature map corresponding to the image to be detected through the preprocessing unit, and carrying out attention-based pooling on the first feature map to obtain a first low-dimensional feature map corresponding to the first feature map;

Performing feature extraction on the first low-dimensional feature map through the feature extraction unit to obtain target capsule information corresponding to the image to be detected, wherein the target capsule information is the feature information of a target object in the image to be detected, which is represented by a capsule;

2. The method according to claim 1, wherein the extracting, by the preprocessing unit, the first feature map corresponding to the image to be detected includes:

3. The method of claim 1, wherein performing the attention-based pooling on the first feature map to obtain a first low-dimensional feature map corresponding to the first feature map comprises:

4. The method according to claim 1, wherein the feature extraction unit includes an encoding unit and a decoding unit, and the feature extracting, by the feature extraction unit, the first low-dimensional feature map includes:

5. The method according to claim 1, wherein the performing, by the prediction unit, the target detection on the target capsule information to obtain a target detection result corresponding to the image to be detected includes:

6. The method of claim 1, wherein prior to the acquiring the image to be detected, the method further comprises:

Acquiring a sample image set;

7. The method of claim 6, wherein the sample image set is labeled with target label information, and wherein calculating the loss value of the target detection model to be trained based on the target detection result corresponding to the sample image set comprises:

8. The method of claim 7, wherein the loss values of the object detection model to be trained include an object position offset loss value, a classification loss value, and a matching loss value.

9. An object detection device, the device comprising:

the image acquisition module is used for acquiring an image to be detected;

The preprocessing module is used for inputting the image to be detected into a trained target detection model, wherein the target detection model comprises a preprocessing unit, a feature extraction unit and a prediction unit, extracting a first feature image corresponding to the image to be detected through the preprocessing unit, and carrying out attention-based pooling treatment on the first feature image to obtain a first low-dimensional feature image corresponding to the first feature image;

the feature extraction module is used for carrying out feature extraction on the first low-dimensional feature map through the feature extraction unit to obtain target capsule information corresponding to the image to be detected, wherein the target capsule information is the feature information of a target object in the image to be detected, which is represented by a capsule;

10. A computer device comprising a memory and a processor, the memory storing a computer program executable on the processor, characterized in that the processor implements the steps of the method of any one of claims 1 to 8 when the computer program is executed.

11. A computer readable storage medium, on which a computer program is stored, characterized in that the computer program, when being executed by a processor, implements the steps of the method of any of claims 1 to 8.