CN114612764B

CN114612764B - Method, device and readable medium for object detection based on millimeter wave images

Info

Publication number: CN114612764B
Application number: CN202210142734.8A
Authority: CN
Inventors: 陈敏达; 徐绍凯; 张帅; 王汉超; 贾宝芝; 何一凡
Original assignee: Xiamen Ruiwei Information Technology Co ltd
Current assignee: Xiamen Ruiwei Information Technology Co ltd
Filing date: 2022-02-16
Publication date: 2025-02-25
Anticipated expiration: 2042-02-16

Abstract

The invention discloses an article detection method, device and readable medium based on millimeter wave images, which are characterized in that millimeter wave images of the front side and the back side of a human body are acquired, the millimeter wave images of the front side and the back side of the human body are input into a trained backbone network to extract front features and back features respectively, the front features and the back features are subjected to interactive fusion expansion respectively based on an attention mechanism, the obtained expanded front feature images and/or expanded back feature images are input into a trained multi-task neural network architecture based on non-homologous data to output a full-image article detection result and a human skeleton detection result, the image is divided into a plurality of feature blocks based on the human skeleton detection result, each feature block is detected by an independent detection head respectively to obtain a secondary detection result, then position mapping is carried out on an original image to obtain a block article detection result, and the block article detection result and the full-image article detection result are subjected to non-maximum suppression to obtain a final article detection result.

Description

Article detection method and device based on millimeter wave image and readable medium

Technical Field

The invention relates to the field of computer vision and deep learning, in particular to an article detection method and device based on millimeter wave images and a readable medium.

Background

The millimeter wave human body imaging equipment has the characteristics of no harm to human body and strong penetrating power, the emission power is less than one thousandth of the electromagnetic wave radiation of the mobile phone, the carried articles of the human body can be accurately identified, the objectivity, the accuracy and the pertinence of the inspection are effectively improved, the labor intensity of security inspectors is reduced, and the security inspection efficiency is improved.

Millimeter waves are popular in the field of intelligent security inspection today because of their ability to penetrate clothing without damaging the human body. The millimeter wave technology and a downstream machine vision algorithm are combined, so that the efficiency and the reliability of a security inspection process can be greatly improved, and the labor investment in the security inspection process is reduced. However, in the existing intelligent security inspection system, the detection performance is bad and pleasant generally because of the problems of unobvious imaging of contraband, insufficient feature mining, low algorithm performance and the like, so that the intelligent security inspection system cannot bear the main responsibility of a security inspection process and can only be used as a supplementary means of manual security inspection.

The millimeter wave contraband detection in the prior art has the following problems:

1. in millimeter wave scanning at a single angle, the imaging of contraband is often incomplete, and the extracted features are often not rich enough, thereby resulting in a lower detection rate of the contraband.

2. Existing inspection schemes typically employ a model to handle a visual task. However, when the application scene requires multiple tasks to be combined, the algorithm speed can be seriously reduced by the inference of multiple models, so that the algorithm can not reach the speed requirement or has very strict requirement on hardware in some scenes requiring real time.

3. Due to the development of millimeter wave imaging technology, imaging of part of contraband is not clear enough (such as powder explosive with lower density, lighter with smaller volume and the like). These contraband items are often confused with certain background features (such as straps, buttons, muscle lines, imaging noise, etc.), resulting in false or missed detection.

4. The existing contraband detection method generally uses a convolutional neural network to extract characteristics, and the method can well acquire texture information of a target object, but lacks semantic understanding of local context of the target. In fact, in the context of contraband detection, such local semantic understanding is necessary, because most of the characteristics of pockets, tapes/bands, clothing folds, etc. appear around the contraband, and the combination of these characteristics for detection can significantly improve accuracy.

Disclosure of Invention

The technical problems mentioned above are solved. An object of an embodiment of the present application is to provide a method, an apparatus and a readable medium for detecting an article based on a millimeter wave image, which solve the technical problems mentioned in the background section above.

In a first aspect, an embodiment of the present application provides an article detection method based on millimeter wave images, including the steps of:

S1, acquiring a millimeter wave image of the front side of a human body and a millimeter wave image of the back side of the human body, inputting the millimeter wave image of the front side of the human body and the millimeter wave image of the back side of the human body into a trained backbone network, and respectively extracting front characteristics and back characteristics;

S2, respectively carrying out interactive fusion expansion on the front features and the back features based on an attention mechanism to obtain an expanded front feature image and an expanded back feature image;

s3, inputting the extended front characteristic image and/or the extended back characteristic image into a trained multi-task neural network architecture based on non-homologous data, and outputting a full-image object detection result and a human skeleton detection result;

S4, dividing the extended front characteristic image and/or the extended back characteristic image into a plurality of characteristic blocks based on the human skeleton detection result, respectively carrying out secondary detection on each characteristic block by adopting an independent detection head to obtain a secondary detection result, and carrying out position mapping on the extended front characteristic image and/or the extended back characteristic image by the secondary detection result to obtain a segmented article detection result;

s5, performing non-maximum suppression on the block article detection result and the full-image article detection result to obtain a final article detection result.

In a specific embodiment, the backbone network employs a swin-transducer structure.

In a specific embodiment, step S2 specifically includes:

The query vector generated by the front face features is subjected to similarity comparison with the key vector generated by the back face features, a first attention matrix is generated through softmax operation, the value vector of the back face features is adopted by the first attention matrix, the sampled back face features are turned over and spliced with the front face features, and an expanded front face feature image is obtained;

And performing similarity comparison on the query vector generated by the back feature and the key vector generated by the front feature, generating a second attention matrix through softmax operation, sampling the value vector of the front feature by the second attention matrix, turning over the sampled front feature and splicing the sampled front feature and the back feature to obtain an extended back feature image.

In particular embodiments, the non-homologous data based multi-tasking neural network architecture includes task heads corresponding to an item detection task, a human skeleton detection task, and other visual tasks, respectively.

In a specific embodiment, the training process of the non-homologous data-based multi-task neural network architecture in step S3 specifically includes:

sampling the non-homologous data set through a sampler to obtain task response vectors of each sample in the non-homologous data set, wherein the task response vectors correspond to labeling conditions of object detection tasks, human skeleton detection tasks and other visual tasks in the data set, and splicing the task response vectors of each sample to obtain a task response matrix;

Inputting the extended front characteristic image and the extended back characteristic image corresponding to the non-homologous data set into a multi-task neural network architecture based on non-homologous data and provided with a plurality of task heads, respectively carrying out forward propagation through the plurality of task heads to generate a loss value corresponding to each task head, and splicing the loss values corresponding to all the task heads into a loss vector;

and multiplying the task response matrix and the loss vector point to obtain final loss, and carrying out back propagation until the expected effect is achieved or the training ending condition is met.

In a specific embodiment, step S4 specifically includes:

s41, dividing the extended front characteristic image and/or the extended back characteristic image into a plurality of areas according to the human skeleton detection result;

S42, sampling the extended front characteristic image and/or the extended back characteristic image through RoIAlign operation based on the coordinates of each region, so as to obtain a plurality of characteristic blocks, and recording the initial position of each characteristic block in the extended front characteristic image and/or the extended back characteristic image;

s43, performing secondary detection on each characteristic block by adopting an independent detection head to obtain a secondary detection result of each characteristic block;

S44, adding the initial position of each feature block in the extended front feature image and/or the extended back feature image to the detection frame coordinates in the secondary detection result of each feature block to obtain a block article detection result.

In a specific embodiment, in step S43, the method further includes, in response to the first feature block and the second feature block having symmetrical portions, turning over the second feature block, and inputting the second feature block to the detection head corresponding to the first feature block for secondary detection.

In particular embodiments, the item comprises contraband.

In a second aspect, an embodiment of the present application provides an article detection apparatus based on a millimeter wave image, including:

the data acquisition module is configured to acquire millimeter wave images of the front side of the human body and millimeter wave images of the back side of the human body, input the millimeter wave images of the front side of the human body and the millimeter wave images of the back side of the human body into a trained backbone network, and extract front characteristics and back characteristics respectively;

The feature expansion module is configured to respectively perform interactive fusion expansion on the front features and the back features based on an attention mechanism to obtain an expanded front feature image and an expanded back feature image;

The comprehensive detection module is configured to input the extended front characteristic image and/or the extended back characteristic image into a trained multi-task neural network architecture based on non-homologous data, and output a full-image object detection result and a human skeleton detection result;

The block detection module is configured to divide the extended front characteristic image and/or the extended back characteristic image into a plurality of characteristic blocks based on the human skeleton detection result, and respectively perform secondary detection on each characteristic block by adopting an independent detection head to obtain a secondary detection result, and perform position mapping on the extended front characteristic image and/or the extended back characteristic image to obtain a block article detection result;

And the merging module is configured to carry out non-maximum suppression on the block article detection result and the full-image article detection result to obtain a final article detection result.

In a third aspect, embodiments of the present application provide an electronic device comprising one or more processors, storage means for storing one or more programs, which when executed by the one or more processors, cause the one or more processors to implement a method as described in any of the implementations of the first aspect.

In a fourth aspect, embodiments of the present application provide a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, implements a method as described in any of the implementations of the first aspect.

Compared with the prior art, the invention has the following beneficial effects:

(1) The invention adopts the trained backbone network to extract the front characteristic and the back contract respectively, thereby providing more comprehensive detection object characteristics and improving the detection rate of objects.

(2) According to the invention, a multi-task neural network architecture based on non-homologous data is adopted, each task in the model is detected by a geographical task head, and the characteristics extracted by a main network can be reused by the input of each task head, so that the training difficulty of the model is effectively reduced, the parallel processing speed of a plurality of tasks is greatly improved on the premise of not improving the labeling cost, and the algorithm meets the real-time requirement.

(3) The invention also carries out human skeleton detection on the detected current so as to facilitate subsequent blocking according to the human skeleton detection result, and then carries out secondary detection on a plurality of characteristic blocks obtained after blocking through an independent detection head respectively, and the input background characteristic of each detection head is relatively fixed, thereby reducing false detection and improving the detection rate.

(4) The backbone network adopts swin-transducer structure, and introduces focusing local features and layering feature pyramid forming capability unique to the convolution network, so that the backbone network combines the advantages of the time sequence network and the convolution network, thereby greatly improving the detection performance.

(5) The invention greatly improves the accuracy and efficiency of the contraband detection task, and enables the security inspection process led by the intelligent security inspection system to be possible.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings that are needed in the description of the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is an exemplary device frame pattern to which an embodiment of the present application may be applied;

Fig. 2 is a flow chart of a millimeter wave image-based article detection method according to an embodiment of the present invention;

fig. 3 is a schematic overall flow chart of contraband detection of an article detection method based on millimeter wave images according to an embodiment of the present invention;

Fig. 4 is a flowchart of step S2 of the millimeter wave image-based article detection method according to the embodiment of the present invention;

FIG. 5 is a schematic diagram of a training process of a non-homologous data based multi-task neural network architecture for an item detection method based on millimeter wave images according to an embodiment of the present invention;

fig. 6 is a full-view article detection result and a human skeleton detection result of an article detection method based on a millimeter wave image according to an embodiment of the present invention;

Fig. 7 is a result of dividing regions according to a human skeleton detection result of the article detection method based on a millimeter wave image according to the embodiment of the present invention;

fig. 8 is a schematic view of an article detection device based on millimeter wave images according to an embodiment of the present invention;

fig. 9 is a schematic structural view of a computer device suitable for use in an electronic apparatus for implementing an embodiment of the present application.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention will be described in further detail below with reference to the accompanying drawings, and it is apparent that the described embodiments are only some embodiments of the present invention, not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

Fig. 1 illustrates an exemplary device architecture 100 of a millimeter wave image-based item detection method or millimeter wave image-based item detection device to which embodiments of the present application may be applied.

As shown in fig. 1, the apparatus architecture 100 may include terminal devices 101, 102, 103, a network 104, and a server 105. The network 104 is used as a medium to provide communication links between the terminal devices 101, 102, 103 and the server 105. The network 104 may include various connection types, such as wired, wireless communication links, or fiber optic cables, among others.

The user may interact with the server 105 via the network 104 using the terminal devices 101, 102, 103 to receive or send messages or the like. Various applications, such as a data processing class application, a file processing class application, and the like, may be installed on the terminal devices 101, 102, 103.

The terminal devices 101, 102, 103 may be hardware or software. When the terminal devices 101, 102, 103 are hardware, they may be various electronic devices including, but not limited to, smartphones, tablets, laptop and desktop computers, and the like. When the terminal devices 101, 102, 103 are software, they can be installed in the above-listed electronic devices. Which may be implemented as multiple software or software modules (e.g., software or software modules for providing distributed services) or as a single software or software module. The present invention is not particularly limited herein.

The server 105 may be a server providing various services, such as a background data processing server processing files or data uploaded by the terminal devices 101, 102, 103. The background data processing server can process the acquired file or data to generate a processing result.

It should be noted that, the method for detecting an article based on a millimeter wave image according to the embodiment of the present application may be executed by the server 105, or may be executed by the terminal devices 101, 102, 103, and accordingly, the apparatus for detecting an article based on a millimeter wave image may be provided in the server 105, or may be provided in the terminal devices 101, 102, 103.

It should be understood that the number of terminal devices, networks and servers in fig. 1 is merely illustrative. There may be any number of terminal devices, networks, and servers, as desired for implementation. In the case where the processed data does not need to be acquired from a remote location, the above-described apparatus architecture may not include a network, but only a server or terminal device.

Fig. 2 shows an article detection method based on millimeter wave images, which includes the following steps:

S1, acquiring millimeter wave images of the front side and the back side of a human body, inputting the millimeter wave images of the front side and the back side of the human body into a trained backbone network, and extracting front characteristics and back characteristics respectively.

Specifically, in the embodiment of the application, millimeter wave images from two angles of the front and the back of a human body are acquired, so that the problems that the imaging of an article scanned by millimeter waves at a single angle is incomplete, the extracted characteristics are not abundant enough, the detection rate of the article is low and the like are avoided. Referring to fig. 3, in an embodiment of the present application, the detection of articles takes the detection of contraband as an example in a security inspection scenario, and in other embodiments, the detection of other articles is also applicable. In a specific embodiment, the backbone network employs a swin-transducer structure. The swin-transducer structure not only captures global dependence and strong semantic understanding capability inherent to the transducer, but also introduces focusing local features and layering feature pyramid forming capability inherent to the convolutional network through operations such as blocking (patching), sliding window (window), window offset (shift window) and the like, so that the main network combines the advantages of a time sequence network and the convolutional network, and detection performance is greatly improved. The backbone network can select any convolutional neural network with multi-level downsampling, and the swin-transducer structure is mainly selected because the best result is observed, and the capability of capturing local dependence in theory or practice truly solves the problem related to local semantic understanding of a scene of contraband detection, because the characteristics of pockets, adhesive tapes/binding belts, clothes folds and the like can appear around most of the contraband, and the detection is carried out by combining the characteristics, so that the accuracy can be remarkably improved.

S2, respectively carrying out interactive fusion expansion on the front features and the back features based on an attention mechanism to obtain an expanded front feature image and an expanded back feature image.

In a specific embodiment, step S2 specifically includes:

Specifically, referring to FIG. 4, taking front detection as an example, when front detection is performed, the back feature is also sampled through an attention mechanism, and the sampling process specifically includes comparing the similarity of a query vector (query) generated by the front feature with a key vector (key) generated by the back feature, generating a first attention matrix through softmax operation, and sampling a value vector (value) of the back feature by using the attention matrix. And then horizontally overturning the sampled back surface features and splicing the sampled back surface features with the front surface features to obtain an extended front surface feature image, and finally, carrying out downstream task detection on the extended front surface feature image. The same principle is adopted in the back detection to obtain an extended back characteristic image.

Therefore, the method firstly uses a pre-trained backbone network to extract the front surface features and the back surface features respectively, and when the front surface images are detected, the back surface features are used for expanding the front surface features, and vice versa. Thus, more comprehensive detection object characteristics can be provided, and particularly, the detection rate of the contraband is improved by only half of the imaged contraband on the front and back sides during human body side detection.

S3, inputting the extended front characteristic image and/or the extended back characteristic image into a trained multi-task neural network architecture based on non-homologous data, and outputting a full-image object detection result and a human skeleton detection result.

Specifically, if the conventional neural network training method is adopted, it is required to ensure that all the training data input have all the labels of the tasks. In the security inspection scene, taking contraband detection as an example, all data are marked with three types of detection of contraband, detection of human skeleton and other visual tasks. Obviously, the task difficulty of human skeleton detection is far less than that of contraband detection, and the data volume required for training the two tasks is also far less than that of contraband detection, so that the cost performance of completely labeling all training data is extremely low. Therefore, the method adopts a multi-task neural network architecture, namely each task has an independent task head, but the input of each task head multiplexes the extracted characteristics of the backbone network, namely the shared backbone network.

In a specific embodiment, the method adopts a multi-task neural network architecture for object detection and human skeleton detection and a mixed training method of a non-homologous data set, wherein the multi-task neural network architecture based on the non-homologous data comprises task heads respectively corresponding to an object detection task, a human skeleton detection task and other visual tasks. Compared with the traditional method that one model processes one task, the method adopts a framework that one model processes a plurality of tasks, each task multiplexes the extracted features in the main network, and the parameter quantity and the reasoning time of the neural network are greatly reduced.

To avoid redundant annotations, the method uses non-homologous datasets during training, i.e., data with different types of annotations may be mixed sampled as training inputs. The method provides a method for training a multi-task neural network architecture by mixedly using non-homologous data sets. On the premise of not improving the labeling cost, the multi-task neural network architecture based on non-homologous data is trained, so that the speed of parallel processing of a plurality of tasks is greatly improved, and the real-time requirement is met. The method additionally marks the task response vector of each sample in the non-homologous data set when sampling, and uses the task response vector to ensure that each sample only calculates the loss value of the sample in the response task when calculating the loss. In a specific embodiment, three visual tasks are assumed, contraband detection, human skeleton detection, and other visual tasks. A task response vector is set for each data set, the vector length being the number of tasks, e.g. 3, the values in the task response vector consisting of 0 or 1. If the data set has the label of the corresponding task, the corresponding position of the task response vector is marked as 1, otherwise, the corresponding position is marked as 0. For example, the task response vector of the A data set is [ 10] when only contraband information is marked, and the task response vector of the B data set is [ 01 ] when human skeleton information and other visual task information are marked.

Specifically, referring to fig. 5, a sampler performs random sampling from each data set in a non-homologous data set, records a task response vector of the data set to which each sample belongs, acquires a certain number of samples (batch size), inputs the samples into a backbone network for feature extraction, inputs the extracted front features or back features into a trained multi-task neural network architecture based on the non-homologous data after feature expansion in step S2, performs forward propagation on 3 task heads in the multi-task neural network architecture based on the non-homologous data, generates a loss value and forms a loss vector L, wherein the loss vector L is a two-dimensional vector with the size of batch size 3, splices the task response vectors of each sample to obtain a task response matrix M, and performs backward propagation on the loss vector L and the task response matrix M to obtain a final loss after the point multiplication. Because the task response vector is adopted, the characteristics extracted by the same backbone network can be reused by the input of each task head, the scale of the whole model is effectively reduced, and the model training difficulty is reduced. The multi-task neural network architecture based on non-homologous data refers to a neural network architecture formed by a plurality of task heads obtained by the training method, and is not a specific neural network model, and each task head can select a proper existing detection neural network model according to actual detection requirements.

The extended front characteristic image and/or the extended back characteristic image extracted from the millimeter wave image of the front surface of the human body and the millimeter wave image of the back surface of the human body acquired in real time are input into a trained multi-task neural network architecture based on non-homologous data, and a full-image object detection result and a human skeleton detection result are output, as shown in fig. 6.

S4, dividing the extended front characteristic image and/or the extended back characteristic image into a plurality of characteristic blocks based on the human skeleton detection result, respectively carrying out secondary detection on each characteristic block by adopting an independent detection head to obtain a secondary detection result, and carrying out position mapping on the extended front characteristic image and/or the extended back characteristic image by the secondary detection result to obtain a segmented article detection result.

In a specific embodiment, step S4 specifically includes:

Specifically, the method additionally performs human skeleton detection on the detected human body, is used for distinguishing limbs and trunk of the detected human body, and then segments the expanded front characteristic image and/or the expanded back characteristic image according to coordinates of each limb part, and is divided into head and neck, trunk, crotch/buttocks, large arm, small arm, large leg and small leg. Each feature block is then detected by a unique detection head. For each detection head, the input background features are relatively fixed, so that false detection is reduced, and the detection rate is improved.

Taking front detection as an example, the front is divided into 11 human body areas according to the human skeleton detection result. Referring to fig. 7, wherein the limb areas 8 are 2-4,4-6,3-5,5-7,8-10,10-12,9-11,11-13, the trunk area 3 is 1-2-3, the upwardly expanded area (head), 2-3-8-9 (torso), 8-9 (crotch/hip). Similarly, after the coordinates of the key points are turned horizontally, the back can be divided into 11 areas. And adopting RoIAlign operation, sampling the features in the extended front feature image and/or the extended back feature image according to the divided region coordinates, so as to obtain 11 feature blocks with different sizes, and recording the initial position (dx _i,dy_i) of each feature block in the extended front feature image and/or the extended back feature image, namely the upper left corner coordinate of the feature block. And (3) performing secondary detection on each feature block by using an independent detection head, wherein all symmetrical parts share the same detection head, but the input features need to be horizontally turned. Namely (2-4, 3-5), (4-6, 5-7), (8-10, 9-11), (10-12, 11-13), the rest having one head, so there are 7 different heads in total. Thereby obtaining a secondary detection result independent for each feature block. And adding the initial position of each corresponding feature block to the detection frame coordinates in the secondary detection result, so as to obtain the position mapping of the secondary detection result [ x _min,y_min,x_max,y_max ] in the extended front feature image and/or the extended back feature image [ x _min+dx_i,y_min+dy_i,x_max+dx_i,y_max+dy_i ].

Specifically, the block article detection result is combined with the full-view article detection result, and a non-maximum suppression algorithm (NMS) is performed to obtain a final article detection result.

With further reference to fig. 8, as an implementation of the method shown in the foregoing figures, the present application provides an embodiment of an article detection apparatus based on millimeter wave images, which corresponds to the method embodiment shown in fig. 2, and which is particularly applicable to various electronic devices.

The embodiment of the application provides an article detection device based on millimeter wave images, which comprises:

The data acquisition module 1 is configured to acquire millimeter wave images of the front side and the back side of the human body, input the millimeter wave images of the front side and the back side of the human body into a trained backbone network, and extract front characteristics and back characteristics respectively;

the feature expansion module 2 is configured to respectively perform interactive fusion expansion on the front features and the back features based on an attention mechanism to obtain an expanded front feature image and an expanded back feature image;

The comprehensive detection module 3 is configured to input the extended front characteristic image and/or the extended back characteristic image into a trained multi-task neural network architecture based on non-homologous data, and output a full-image object detection result and a human skeleton detection result;

The block detection module 4 is configured to divide the extended front characteristic image and/or the extended back characteristic image into a plurality of characteristic blocks based on the human skeleton detection result, and respectively perform secondary detection on each characteristic block by adopting an independent detection head to obtain a secondary detection result, and perform position mapping on the extended front characteristic image and/or the extended back characteristic image to obtain a block article detection result;

And the merging module 5 is configured to perform non-maximum suppression on the block article detection result and the full-image article detection result to obtain a final article detection result.

Referring now to fig. 9, there is illustrated a schematic diagram of a computer apparatus 900 suitable for use in an electronic device (e.g., a server or terminal device as illustrated in fig. 1) for implementing an embodiment of the present application. The electronic device shown in fig. 9 is only an example, and should not impose any limitation on the functions and scope of use of the embodiments of the present application.

As shown in fig. 9, the computer apparatus 900 includes a Central Processing Unit (CPU) 901 and a Graphics Processor (GPU) 902, which can perform various appropriate actions and processes according to a program stored in a Read Only Memory (ROM) 903 or a program loaded from a storage section 909 into a Random Access Memory (RAM) 904. In the RAM904, various programs and data required for the operation of the apparatus 900 are also stored. The CPU 901, GPU902, ROM 903, and RAM904 are connected to each other by a bus 905. An input/output (I/O) interface 906 is also connected to bus 905.

Connected to the I/O interface 906 are an input section 907 including a keyboard, a mouse, and the like, an output section 908 including a speaker, a Liquid Crystal Display (LCD), and the like, a storage section 909 including a hard disk, and the like, and a communication section 910 including a network interface card, such as a LAN card, a modem, and the like. The communication section 910 performs communication processing via a network such as the internet. The drive 911 may also be connected to the I/O interface 906 as needed. A removable medium 912 such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, or the like is installed as needed on the drive 911 so that a computer program read out therefrom is installed into the storage section 909 as needed.

In particular, according to embodiments of the present disclosure, the processes described above with reference to flowcharts may be implemented as computer software programs. For example, embodiments of the present disclosure include a computer program product comprising a computer program embodied on a computer readable medium, the computer program comprising program code for performing the method shown in the flowcharts. In such embodiments, the computer program may be downloaded and installed from a network via the communication portion 910, and/or installed from the removable medium 912. The above-described functions defined in the method of the present application are performed when the computer program is executed by a Central Processing Unit (CPU) 901 and a Graphics Processor (GPU) 902.

It should be noted that the computer readable medium according to the present application may be a computer readable signal medium or a computer readable medium, or any combination of the two. The computer readable medium can be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor apparatus, device, or means, or a combination of any of the foregoing. More specific examples of a computer-readable medium may include, but are not limited to, an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the context of this document, a computer readable medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution apparatus, device, or apparatus. In the present application, however, the computer-readable signal medium may include a data signal propagated in baseband or as part of a carrier wave, with the computer-readable program code embodied therein. Such a propagated data signal may take any of a variety of forms, including, but not limited to, electro-magnetic, optical, or any suitable combination of the foregoing. A computer readable signal medium may be any computer readable medium that is not a computer readable medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution apparatus, device, or apparatus. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to wireless, wireline, optical fiber cable, RF, etc., or any suitable combination of the foregoing.

Computer program code for carrying out operations of the present application may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, smalltalk, C ++ and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to an external computer (for example, through the Internet using an Internet service provider).

The flowcharts and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of apparatus, methods and computer program products according to various embodiments of the present application. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based devices which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

The modules involved in the embodiments of the present application may be implemented in software or in hardware. The described modules may also be provided in a processor.

As another aspect, the present application also provides a computer-readable medium that may be included in the electronic device described in the above embodiment, or may exist alone without being incorporated into the electronic device. The computer readable medium carries one or more programs, when the one or more programs are executed by the electronic equipment, the electronic equipment is enabled to acquire millimeter wave images of the front side of a human body and millimeter wave images of the back side of the human body, input the millimeter wave images of the front side of the human body and the millimeter wave images of the back side of the human body into a trained backbone network to extract front features and back features respectively, carry out interactive fusion expansion on the front features and the back features respectively based on an attention mechanism to obtain an expanded front feature image and an expanded back feature image, input the expanded front feature image and/or the expanded back feature image into a trained multi-task neural network architecture based on non-homologous data to output a full-image object detection result and a human skeleton detection result, divide the expanded front feature image and/or the expanded back feature image into a plurality of feature blocks based on the human skeleton detection result, respectively carry out secondary detection on each feature block by adopting an independent detection head to obtain a secondary detection result, carry out position mapping on the expanded front feature image and/or the expanded back feature image to obtain a block object detection result, and carry out non-final object detection result on the full-image.

The above description is only illustrative of the preferred embodiments of the present application and of the principles of the technology employed. It will be appreciated by persons skilled in the art that the scope of the application referred to in the present application is not limited to the specific combinations of the technical features described above, but also covers other technical features formed by any combination of the technical features described above or their equivalents without departing from the inventive concept described above. Such as the above-mentioned features and the technical features disclosed in the present application (but not limited to) having similar functions are replaced with each other.

Claims

1. A method for detecting objects based on millimeter wave images, comprising the following steps:

S1, obtaining a millimeter wave image of the front side of a human body and a millimeter wave image of the back side of a human body, and inputting the millimeter wave image of the front side of a human body and the millimeter wave image of the back side of a human body into a trained backbone network to respectively extract front features and back features;

S2, interactively fusing and expanding the front features and the back features based on the attention mechanism to obtain an expanded front feature image and an expanded back feature image, specifically including:

A query vector generated by the front feature is compared with a key vector generated by the back feature for similarity, and a first attention matrix is generated by a softmax operation, and the value vector of the back feature is sampled by the first attention matrix, and the sampled back feature is flipped and concatenated with the front feature to obtain an expanded front feature image;

A query vector generated by the back feature is compared with a key vector generated by the front feature for similarity, and a second attention matrix is generated by a softmax operation, and the second attention matrix is used to sample the value vector of the front feature, and the sampled front feature is flipped and concatenated with the back feature to obtain an expanded back feature image;

S3, inputting the expanded front feature image and/or the expanded back feature image into a trained multi-task neural network architecture based on non-homologous data, and outputting a full-image object detection result and a human skeleton detection result;

S4, dividing the expanded front feature image and/or the expanded back feature image into a plurality of feature blocks based on the human skeleton detection result, and performing secondary detection on each feature block using an independent detection head to obtain a secondary detection result, and performing position mapping on the expanded front feature image and/or the expanded back feature image to obtain a block object detection result;

S5, performing non-maximum suppression on the block object detection results and the full-image object detection results to obtain a final object detection result.

2. The object detection method based on millimeter wave images according to claim 1 is characterized in that the backbone network adopts a swin-transformer structure.

3. According to the object detection method based on millimeter wave images according to claim 1, it is characterized in that the multi-task neural network architecture based on non-homologous data includes task heads corresponding to object detection tasks, human skeleton detection tasks and other visual tasks respectively.

4. The object detection method based on millimeter wave images according to claim 3 is characterized in that the training process of the multi-task neural network architecture based on non-homologous data in step S3 specifically includes:

Sampling a non-homologous data set through a sampler, obtaining a task response vector of each sample in the non-homologous data set, wherein the task response vector corresponds to the annotation of the object detection task, the human skeleton detection task and other visual tasks in the data set, and concatenating the task response vector of each sample to obtain a task response matrix;

Inputting the expanded front feature image and the expanded back feature image corresponding to the non-homologous data set into a multi-task neural network architecture based on non-homologous data with multiple task heads, performing forward propagation through the multiple task heads respectively, generating a loss value corresponding to each task head, and concatenating the loss values corresponding to all task heads into a loss vector;

The task response matrix and the loss vector are dot-multiplied to obtain the final loss, and back-propagation is performed until the expected effect is achieved or the training end condition is met.

5. The object detection method based on millimeter wave images according to claim 1, characterized in that the step S4 specifically comprises:

S41, dividing the expanded front feature image and/or the expanded back feature image into a plurality of regions according to the human skeleton detection result;

S42, sampling the expanded front feature image and/or the expanded back feature image through a RoIAlign operation based on the coordinates of each region, thereby obtaining a plurality of feature blocks, and recording a starting position of each feature block in the expanded front feature image and/or the expanded back feature image;

S43, performing secondary detection on each feature block using an independent detection head to obtain a secondary detection result of each feature block;

S44, adding the detection frame coordinates in the secondary detection result of each feature block to the starting position of each feature block in the expanded front feature image and/or the expanded back feature image to obtain the block object detection result.

6. The object detection method based on millimeter wave images according to claim 5 is characterized in that in the step S43, it also includes responding to the first feature block and the second feature block having symmetrical parts, flipping the second feature block and inputting it into the detection head corresponding to the first feature block for secondary detection.

7. The object detection method based on millimeter wave images according to any one of claims 1 to 6, characterized in that the objects include contraband.

8. An object detection device based on millimeter wave images, comprising:

A data acquisition module is configured to acquire a millimeter wave image of the front side of a human body and a millimeter wave image of the back side of a human body, and input the millimeter wave image of the front side of a human body and the millimeter wave image of the back side of a human body into a trained backbone network to extract front features and back features respectively;

The feature expansion module is configured to interactively fuse and expand the front features and the back features based on the attention mechanism to obtain an expanded front feature image and an expanded back feature image, specifically including:

A comprehensive detection module is configured to input the expanded front feature image and/or the expanded back feature image into a trained multi-task neural network architecture based on non-homologous data, and output a full-image object detection result and a human skeleton detection result;

A block detection module is configured to divide the expanded front feature image and/or the expanded back feature image into a plurality of feature blocks based on the human skeleton detection result, and perform secondary detection on each feature block using an independent detection head to obtain a secondary detection result, and perform position mapping on the expanded front feature image and/or the expanded back feature image to obtain a block object detection result;

The merging module is configured to perform non-maximum suppression on the block object detection results and the full-image object detection results to obtain a final object detection result.

9. An electronic device comprising:

one or more processors;

a storage device for storing one or more programs,

When the one or more programs are executed by the one or more processors, the one or more processors implement the method according to any one of claims 1 to 7.

10. A computer-readable storage medium having a computer program stored thereon, wherein when the program is executed by a processor, the method according to any one of claims 1 to 7 is implemented.