CN115496993B

CN115496993B - Target detection method, device, equipment and storage medium based on frequency domain fusion

Info

Publication number: CN115496993B
Application number: CN202211103064.5A
Authority: CN
Inventors: 何良雨; 王戬鑫; 刘彤; 张文刚
Original assignee: Fengrui Lingchuang Zhuhai Technology Co ltd
Current assignee: Fengrui Lingchuang Zhuhai Technology Co ltd
Priority date: 2022-09-09
Filing date: 2022-09-09
Publication date: 2023-07-14
Anticipated expiration: 2042-09-09
Also published as: CN115496993A

Abstract

The present invention relates to the field of artificial intelligence, and in particular, to a method, apparatus, device, and storage medium for detecting a target based on frequency domain fusion. The method comprises the steps of randomly shielding partial data in an image to be detected, reserving Y groups of data, carrying out s-level slow-fading high-low frequency conversion on each channel data of each group of data in the Y groups of data, inputting the obtained s1 high-frequency data feature vectors and s2 low-frequency data feature vectors into a trained frequency domain self-attention neural network, and outputting region class information and region size information of a target detection region in the image to be detected.

Description

Target detection method, device, equipment and storage medium based on frequency domain fusion

Technical Field

The present invention relates to the field of artificial intelligence, and in particular, to a method, apparatus, device, and storage medium for detecting a target based on frequency domain fusion.

Background

Currently, in visual detection of deep learning, an image in a spatial domain is generally input into a deep learning network as input data, and the deep learning network learns information contained in the image. When the type of the target is determined, the most important part is how to accurately determine the position and the size of the target, the current detection algorithm generally selects a rectangular frame form of a target detection surrounding frame to express the position information and the size information of the target in the picture, when the image contains a large amount of redundant information, the image is input into a deep learning network, the demand of parameters in the network structure is increased, more parameters increase the calculation amount of target detection in the target detection process, and the target detection efficiency is reduced, so that how to improve the target detection efficiency in the target detection process is a problem to be solved.

Disclosure of Invention

In view of the foregoing, it is necessary to provide a method, an apparatus, a computer device and a storage medium for detecting a target based on frequency domain fusion, so as to solve the problem of low detection efficiency of the target area.

A first aspect of an embodiment of the present application provides a method for detecting a target based on frequency domain fusion, where the method includes:

according to a preset mask rule, randomly masking part of data from the image to be detected and then reserving Y groups of data; the Y-group data are Y-column data in the image to be detected or Y-row data in the image to be detected, wherein the image to be detected comprises X channel data, and X and Y are integers larger than 1;

performing s-level slow-fading high-low frequency transformation on each channel data of each group of data in the Y groups of data to obtain s-layer frequency domain data feature vectors in each channel data of each group of data; wherein the frequency domain data feature vectors include s1 high frequency data feature vectors and s2 low frequency data feature vectors; wherein s, s1, s2 are integers greater than 1, s1 is less than s, s2 is less than s;

and inputting the s1 high-frequency data characteristic vectors and the s2 low-frequency data characteristic vectors into a trained frequency domain self-attention neural network, and outputting the region category information and the region size information of the target detection region in the image to be detected.

A second aspect of embodiments of the present application provides a target detection apparatus based on frequency domain fusion, the apparatus including:

The shielding sample reserving module is used for randomly shielding part of data from the image to be detected according to a preset mask rule and then reserving Y groups of data; the Y-group data are Y-column data in the image to be detected or Y-row data in the image to be detected, wherein the image to be detected comprises X channel data, Y is an integer greater than 1;

the slow attenuation high-low frequency conversion module is used for carrying out s-level slow attenuation high-low frequency conversion on each channel data of each group of data in the Y groups of data to obtain s-layer frequency domain data feature vectors in each channel data of each group of data; wherein the frequency domain data feature vectors include s1 high frequency data feature vectors and s2 low frequency data feature vectors; wherein s, s1, s2 are integers greater than 1, s1 is less than s, s2 is less than s;

the detection module is used for inputting the s1 high-frequency data characteristic vectors and the s2 low-frequency data characteristic vectors into the trained frequency domain self-attention neural network and outputting the region type information and the region size information of the target detection region in the image to be detected.

In a third aspect, an embodiment of the present invention provides a computer device, where the computer device includes a processor, a memory, and a computer program stored in the memory and executable on the processor, where the processor implements the method for detecting an object based on frequency domain fusion according to the first aspect when the computer program is executed.

In a fourth aspect, an embodiment of the present invention provides a computer readable storage medium storing a computer program, which when executed by a processor implements the frequency domain fusion-based object detection method according to the first aspect.

Compared with the prior art, the invention has the beneficial effects that:

according to a preset mask rule, randomly masking part of data from an image to be detected, and then reserving Y groups of data as Y columns of data in the image to be detected or Y rows of data in the image to be detected, wherein the image to be detected comprises X channel data, Y and X are integers larger than 1, and performing s-level slow-fading high-low frequency transformation on each channel data of each group of data in the Y groups of data to obtain s-layer frequency domain data feature vectors in each channel data in each group of data; the frequency domain data feature vector comprises s1 high frequency data feature vectors and s2 low frequency data feature vectors, wherein s, s1, s2 are integers larger than 1, s1 is smaller than s, s2 is smaller than s, s1 high frequency data feature vectors and s2 low frequency data feature vectors are input into a trained frequency domain self-attention neural network, region class information and region size information of a target detection region in an image to be detected are output.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings that are needed in the description of the embodiments of the present invention will be briefly described below, it being obvious that the drawings in the following description are only some embodiments of the present invention, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

Fig. 1 is a schematic diagram of an application environment of a target detection method based on frequency domain fusion according to an embodiment of the present invention;

fig. 2 is a flow chart of a method for detecting targets based on frequency domain fusion according to an embodiment of the invention;

FIG. 3 is a schematic diagram of a frequency domain self-focusing neural network according to an embodiment of the present invention;

FIG. 4 is a schematic diagram of a fusion module according to an embodiment of the present invention;

FIG. 5 is a schematic diagram of a class and size prediction module according to an embodiment of the present invention;

FIG. 6 is a schematic diagram of a frequency domain self-focusing neural network according to an embodiment of the present invention;

fig. 7 is a schematic diagram of a first frequency-domain self-attention feature vector extraction network according to an embodiment of the present invention.

FIG. 8 is a schematic diagram of a second frequency-domain self-attention feature vector extraction network according to an embodiment of the present invention;

FIG. 9 is a schematic diagram of an encoder according to an embodiment of the present invention;

fig. 10 is a flowchart of a method for detecting a target based on frequency domain fusion according to an embodiment of the present invention;

FIG. 11 is a schematic structural diagram of a target detection device based on frequency domain fusion according to an embodiment of the present invention;

fig. 12 is a schematic structural diagram of a computer device according to an embodiment of the present invention.

Detailed Description

The following description of the embodiments of the present invention will be made clearly and fully with reference to the accompanying drawings, in which it is evident that the embodiments described are some, but not all embodiments of the invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

It should be understood that the terms "comprises" and/or "comprising," when used in this specification and the appended claims, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.

It should also be understood that the term "and/or" as used in the present specification and the appended claims refers to any and all possible combinations of one or more of the associated listed items, and includes such combinations.

As used in the present description and the appended claims, the term "if" may be interpreted as "when..once" or "in response to a determination" or "in response to detection" depending on the context. Similarly, the phrase "if a determination" or "if a [ described condition or event ] is detected" may be interpreted in the context of meaning "upon determination" or "in response to determination" or "upon detection of a [ described condition or event ]" or "in response to detection of a [ described condition or event ]".

Furthermore, the terms "first," "second," "third," and the like in the description of the present specification and in the appended claims, are used for distinguishing between descriptions and not necessarily for indicating or implying a relative importance.

Reference in the specification to "one embodiment" or "some embodiments" or the like means that a particular feature, structure, or characteristic described in connection with the embodiment is included in one or more embodiments of the invention. Thus, appearances of the phrases "in one embodiment," "in some embodiments," "in other embodiments," and the like in the specification are not necessarily all referring to the same embodiment, but mean "one or more but not all embodiments" unless expressly specified otherwise. The terms "comprising," "including," "having," and variations thereof mean "including but not limited to," unless expressly specified otherwise.

The embodiment of the invention can acquire and process the related data based on the frequency domain fusion technology. The frequency domain fusion (Artificial Intelligence, AI) is a theory, method, technique and application system that uses a digital computer or a machine controlled by a digital computer to simulate, extend and extend human intelligence, sense the environment, acquire knowledge and use the knowledge to obtain the best results.

The frequency domain fusion basic technology generally comprises technologies such as a sensor, a special frequency domain fusion chip, cloud computing, distributed storage, big data processing technology, an operation/interaction system, electromechanical integration and the like. The frequency domain fusion software technology mainly comprises a computer vision technology, a robot technology, a biological recognition technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and other directions.

It should be understood that the sequence numbers of the steps in the following embodiments do not mean the order of execution, and the execution order of the processes should be determined by the functions and the internal logic, and should not be construed as limiting the implementation process of the embodiments of the present invention.

In order to illustrate the technical scheme of the invention, the following description is made by specific examples.

The object detection method based on frequency domain fusion provided by the embodiment of the invention can be applied to an application environment as shown in fig. 1, wherein a client communicates with a server. The clients include, but are not limited to, palm top computers, desktop computers, notebook computers, ultra-mobile personal computer (UMPC), netbooks, personal digital assistants (personal digital assistant, PDA), and the like. The server may be implemented by a stand-alone server or a server cluster formed by a plurality of servers.

Referring to fig. 2, a flow chart of a target detection method based on frequency domain fusion according to an embodiment of the present invention is shown, where the target detection method may be applied to a server in fig. 1, and the server is connected to a corresponding client to provide model training service for the client. As shown in fig. 2, the object detection method may include the following steps.

S201: and randomly masking part of data from the image to be detected according to a preset mask rule, and then retaining Y groups of data.

In step S201, according to a preset mask rule, part of data is randomly masked from an image to be detected, and then Y group data is reserved, wherein the Y group data is Y column data in the image to be detected or Y row data in the image to be detected, the image to be detected includes X channel data, and Y and X are integers greater than 1.

In this embodiment, when the Y column data in the image to be detected is sampled, the data is collected according to the width direction of the image to be detected, the width of the image to be detected is W, the image to be sampled is divided into Y areas in the width direction, each area contains W/Y columns, 1 column of data is sampled in each area to obtain n groups of data, each group of data contains 3 channel data, and then 3Y data are obtained, where the 3Y value in this embodiment is similar to W/4.

S202: and performing s-level slow-fading high-low frequency transformation on each channel data of each group of data in the Y groups of data to obtain s-layer frequency domain data feature vectors in each channel data of each group of data.

In step S202, processing the collected Y groups of data by using an S-level slow-decay high-low frequency conversion to obtain an S-layer frequency domain data feature vector in each channel data in each group of data; wherein the frequency domain data feature vectors include s1 high frequency data feature vectors and s2 low frequency data feature vectors; wherein s is not less than 1 and not more than 1 s is not less than 1 and not more than 2 s is not less than 1.

In this embodiment, the slow-fading high-low frequency transform may process each channel data in each set of data to obtain a corresponding frequency domain data feature vector, and the s-level slow-fading high-low frequency transform performs s-level slow-fading high-low frequency transform to obtain s-layer frequency domain data feature vectors, where each layer includes the frequency domain data feature vector of each channel data in the Y sets of data. Corresponding high-frequency data feature vectors and low-frequency data feature vectors can be obtained through each time of slow fading high-frequency and low-frequency transformation, and s high-frequency data feature vectors and s low-frequency data feature vectors can be obtained through s times of slow fading high-frequency and low-frequency transformation at most. The high frequency data feature vector and low frequency data feature vector calculation process is as follows:

The iterative formula of the s-level slow-decay high-low frequency transformation is as follows:

s is the number of transformed stages, s.epsilon.N. Setting an initial stage

For inputting the kth term of discrete signals, where k ε Z ⁺ The v high-frequency coefficients of the 1 st level can be obtained sequentially through an iterative formula>

And low frequency coefficient->

V high-frequency coefficients of S-th stage>

And low frequency coefficient->

Wherein the high frequency attenuation coefficient->

The formula of (2) is:

omega is the frequency coefficient, wherein the low frequency slow decay coefficient

The formula of (2) is:

quasi-functions of both high-frequency slow-decay coefficients and low-frequency slow-decay coefficients for slow-decay high-low frequency transformation

Is constructed. Quasi-function->

Is a function taking n as an independent variable, and the formula is as follows:

and respectively forming a high-frequency data characteristic vector and a low-frequency data characteristic vector according to the high-frequency coefficient and the low-frequency coefficient obtained by the slow-decay high-low frequency conversion.

Optionally, performing s-level slow-fading high-low frequency transformation on each channel data in each set of data to obtain s-layer frequency domain data feature vectors in each channel data in each set of data, including:

acquiring each channel data in each group of data;

and respectively inputting each channel data in each group of data into an s-level slow-decay high-low frequency transformation model, and outputting s1 high-frequency data feature vectors and s2 low-frequency data feature vectors corresponding to each channel data in each group of data.

In this embodiment, Y-column data is sampled, each column of data contains 3 channel data, so that the acquired data is 3 times of the sampled column data, 3-level slow-fading high-low frequency conversion is performed on the image to be detected, a first-layer high-frequency data feature vector is acquired, a second-layer high-frequency data feature vector and a third-layer low-frequency data feature vector are acquired, if the height of the image to be detected is H, the sizes of the three obtained frequency domain data feature vectors are 3Y H/2, 3Y H/4 and 3Y H/8 respectively, wherein 3Y is the number of channels of the Y-column data.

When the 3-stage slow-fading high-low frequency conversion is performed on the image to be detected, a high-frequency data feature vector and a low-frequency data feature vector with the size of 3Y x H/2 can be obtained respectively during the first-stage slow-fading high-low frequency conversion, the sizes of the high-frequency data feature vector and the low-frequency data feature vector of each channel data in each column of data are half of the sizes of the high-frequency data feature vector and the low-frequency data feature vector in the image to be sampled, the low-frequency data feature vector obtained in the first layer is processed during the second-stage slow-fading high-low frequency conversion, the high-frequency data feature vector and the low-frequency data feature vector in the second layer are obtained on the basis of the low-frequency data feature vector, the sizes of the high-frequency data feature vector and the low-frequency data feature vector of each channel data in each column of data are half of the low-frequency data feature vector in the first layer, the sizes of the high-frequency data feature vector and the low-frequency data feature vector in each column of data are 3Y x H/4, the high-frequency data feature vector and the low-frequency data feature vector in the third-stage slow-fading high-frequency data and the low-frequency data are half of the sizes of the high-frequency data feature vector in the second layer, and the high-frequency feature vector in the low-frequency data feature vector in the third-stage slow-fading high-frequency data and the low frequency data feature vector is obtained on the basis of the second layer. The low frequency data feature vector after each level of the slow-down high-low frequency transformation is the basis of the next level of the slow-down high-low frequency transformation, the low frequency data feature vector is decomposed, and in the using process, the high frequency data feature vector of the first layers and the low frequency data feature vector of the last layer are generally used.

S203: and inputting the s1 high-frequency data characteristic vectors and the s2 low-frequency data characteristic vectors into the trained frequency domain self-attention neural network, and outputting the region category information and the region size information of the target detection region in the image to be detected.

In step S203, the size of S1 high frequency data feature vectors is the size of the channel number of the Y-group data, the size of each layer of high frequency data feature vectors, the size of S2 low frequency data feature vectors is the size of the channel number of the Y-group data, the trained frequency domain self-attention neural network is a neural network trained with frequency domain features,

in this embodiment, the trained frequency-domain self-focusing neural network includes input terminals s1+s2, each of which inputs one of the high-frequency data feature vectors or one of the low-frequency data feature vectors. And outputting the region category information and the region size information of the target detection region of the image to be detected in the trained frequency domain self-attention neural network through feature extraction of the trained frequency domain self-attention neural network. The high-frequency data feature vector of the first layer, the high-frequency data feature vector of the second layer and the low-frequency data feature vector of the third layer obtained in this embodiment, the trained frequency-domain self-attention neural network includes 3 input ends, and the high-frequency data feature vector of the first layer, the high-frequency data feature vector of the second layer and the low-frequency data feature vector of the third layer are sequentially input.

Optionally, the frequency domain self-focusing neural network includes:

the system comprises a first frequency domain self-attention feature vector extraction network, a second frequency domain self-attention feature vector extraction network, a fusion module and a category and size prediction module. Referring to fig. 3, a schematic structure of a frequency-domain self-focusing neural network according to an embodiment of the present invention is shown.

The first frequency domain self-attention feature vector extraction network is used for inputting the high-frequency data feature vector and outputting a first target frequency domain self-attention feature vector;

a second frequency domain self-attention feature vector extraction network for inputting the low frequency data feature vector and outputting a second target frequency domain self-attention feature vector;

the fusion module is used for fusing the first target frequency domain self-attention feature vector and the second target frequency domain self-attention feature vector to obtain a fused frequency domain feature vector;

and the category and size prediction module is used for predicting the fused frequency domain feature vector to obtain the region category information and the region size information of the target detection region of the image to be detected.

In this embodiment, the first frequency domain self-attention feature vector extraction network is used for inputting the high frequency data feature vector, inputting the high frequency data feature vector into the first frequency domain self-attention feature vector extraction network, and outputting the first target frequency domain self-attention feature vector.

The second frequency domain self-attention feature vector extraction network is used for inputting the low frequency data feature vector, inputting the third frequency data feature vector into the second frequency domain self-attention feature vector extraction network and outputting the second target frequency domain self-attention feature vector.

And the fusion module is used for fusing the first target frequency domain self-attention feature vector and the second target frequency domain self-attention feature vector to obtain a fused frequency domain feature vector.

Referring to fig. 4, a schematic structural diagram of a fusion module according to an embodiment of the invention is shown.

The fusion module comprises a connecting layer, a first convolution layer, a second convolution layer, a first regular normalization layer, a first activation layer, a second activation layer, a residual layer and a summation layer.

Before connecting the first target frequency domain self-attention feature vector and the second target frequency domain self-attention feature vector, the first target frequency domain self-attention feature vector and the second target frequency domain self-attention feature vector are respectively subjected to dimension expansion, the first target frequency domain self-attention feature vector and the second target frequency domain self-attention feature vector are expanded into three-dimensional vectors, the connecting layer is connected with the first target frequency domain self-attention feature vector and the second target frequency domain self-attention feature vector after the dimension expansion, and a connecting vector is obtained, wherein the first dimension is used for representing the number of connecting vectors, and the second dimension and the size of an original vector of a third dimension table, so that in the embodiment, the connecting vector after the connecting layer is performed is [2, H/4, W/4], and 2 in the first dimension is the number of connecting vectors.

The connection vector obtains a first fusion feature vector with preset depth through the first convolution layer and the first activation layer, the first fusion feature vector with preset depth passes through the second convolution layer, the regular normalization layer and the second activation layer, a second fusion feature vector with preset depth is obtained, the first fusion feature vector with preset depth and the second fusion feature vector with preset depth are characterized to be connected in a residual mode through the residual layer, and the target fusion feature vector is obtained through the summation layer.

The category and size prediction module is used for predicting the input fusion feature vector and outputting two prediction results, namely a region category prediction result and a region size information result of the target detection region.

Referring to fig. 5, a schematic structure diagram of a class and size prediction module according to an embodiment of the invention is shown.

Optionally, the category and size prediction module includes: the first prediction sub-module and the second prediction sub-module;

the first prediction submodule is used for inputting the fusion frequency domain feature vector, predicting the region type information of the target detection region of the image to be detected through the fusion frequency domain feature vector, and outputting a region type information prediction result of the target detection region of the image to be detected;

The second prediction submodule is used for inputting the fusion frequency domain feature vector, predicting the region size information of the target detection region of the image to be detected through the fusion frequency domain feature vector, and outputting a region size information prediction result of the target detection region of the image to be detected.

The category and size prediction module is used for predicting the input fusion feature vector, wherein the category and size prediction module comprises a first prediction sub-module and a second prediction sub-module, the structural composition of the first prediction sub-module and the structural composition of the second prediction sub-module are the same, and the first prediction sub-module and the second prediction sub-module comprise a third convolution layer, a fourth convolution layer and a second regular normalization layer. And the fusion feature vector sequentially passes through a third convolution layer, a second regular normalization layer and a fourth convolution layer to carry out convolution normalization processing, and a predicted value corresponding to the first prediction sub-module and a predicted value corresponding to the second prediction sub-module are output.

In this embodiment, the first prediction submodule is configured to predict region type information of the target detection region, and the second prediction submodule is configured to predict region size information of the target detection region. For example, when the total number of the types of the target detection areas is 10, a certain target object type is 1, the area center point of one target detection area is [100, 50], the area type index of the target detection area is 0, and the type value is 1, and the output first submodule predicted value is [0,50,100] =1. If the region size information of the target detection region is 50 in width and 25 in height, the output second submodule predicted values are [0,50,100] =50 and [1,50,100] =25.

In another embodiment, the frequency-domain self-attention neural network includes a first frequency-domain self-attention feature vector extraction network, a second frequency-domain self-attention feature vector extraction network, a third frequency-domain self-attention feature vector extraction network, a fusion module, and a class and size prediction module.

Referring to fig. 6, a schematic structure of a frequency-domain self-focusing neural network according to an embodiment of the present invention is shown.

In this embodiment, the first frequency domain self-attention feature vector extraction network is configured to input a first high frequency data feature vector and output a first target frequency domain self-attention feature vector;

a third frequency domain self-attention feature vector extraction network for inputting a second high frequency data feature vector and outputting a third target frequency domain self-attention feature vector;

the fusion module is used for fusing the first target frequency domain self-attention feature vector, and the second target frequency domain self-attention feature vector and the third target frequency domain self-attention feature vector to obtain a fused frequency domain feature vector;

Wherein the first high frequency data feature vector and the second high frequency data feature vector are any two high frequency data feature vectors which are different from each other among the s1 high frequency data feature vectors.

Optionally, the first frequency domain self-attention feature vector extraction network includes: a first high frequency signal encoder, a first size reorganizing module and a first inverting module, wherein,

the input end of the first high-frequency signal encoder is used for inputting a first high-frequency data characteristic vector, and the output end of the first high-frequency signal encoder is used for outputting an encoded first frequency domain self-attention characteristic vector;

the first size reorganizing module is used for reorganizing the first frequency domain self-attention feature vector to obtain a first frequency domain self-attention feature vector with a preset size;

the first transfer module is used for transferring the first frequency domain self-attention characteristic vector with the preset size to obtain a first target frequency domain self-attention characteristic vector.

Referring to fig. 7, a schematic diagram of a first frequency-domain self-attention feature vector extraction network according to an embodiment of the present invention is shown.

In this embodiment, the first high-frequency signal encoder is configured to input a high-frequency data feature vector, input the high-frequency data feature vector into the first high-frequency signal encoder, and output a first frequency-domain self-attention feature vector. Wherein the high frequency data feature vector may be any one of s1 high frequency data feature vectors.

The first high-frequency signal encoder includes at least one encoder.

The first size reorganizing module is used for reorganizing the data output by the first high-frequency signal encoder, because the frequency domain data feature vector input by each input end in the frequency domain self-attention neural network is the frequency domain data feature vector in different layers, the size of the frequency domain data feature vector in each layer is different, in order to facilitate subsequent data processing, the first size reorganizing module reorganizes the data output by the first high-frequency signal encoder into feature vectors with preset sizes, and the preset sizes in the embodiment are one fourth of the width and the height of the image to be detected and are [ W/4, H/4].

The first transferring module is configured to transpose a first frequency-domain self-attention feature vector of a preset size, where the shape of the first frequency-domain self-attention feature vector of the preset size output by the first size reorganizing module is [ width, height ], and the shape of the first frequency-domain self-attention feature vector of the preset size obtained by the first transferring module is [ height, width ], and the transposed first frequency-domain self-attention feature vector of the preset size in this embodiment is [ H/4, w/4], which is used as the first target frequency-domain self-attention feature vector.

Optionally, the second frequency domain self-attention feature vector extraction network includes: a second low frequency signal encoder, a second size reorganizing module and a second transpose module, wherein,

the input end of the second low-frequency signal encoder is used for inputting the low-frequency data characteristic vector, and the output end of the second low-frequency signal encoder is used for outputting the encoded second frequency domain self-attention characteristic vector;

the second size reorganizing module is used for reorganizing the second frequency domain self-attention feature vector to obtain a second frequency domain self-attention feature vector with a preset size;

the second transpose module is configured to transpose the second frequency-domain self-attention feature vector with the preset size to obtain a second target frequency-domain self-attention feature vector.

Referring to fig. 8, a schematic diagram of a second frequency-domain self-attention feature vector extraction network according to an embodiment of the present invention is shown.

In this embodiment, the second low frequency signal encoder is configured to input a low frequency data feature vector, input the low frequency data feature vector into the second low frequency signal encoder, and output the encoded second frequency domain self-attention feature vector.

The second size reorganizing module is used for reorganizing the data output by the second low-frequency signal encoder, the size of the data output by the second low-frequency signal encoder is equal to the size of the low-frequency data feature vector input by the input end of the second low-frequency signal encoder, the reorganized feature vector is a feature vector with the preset size through the second size reorganizing module, and the size of the reorganized feature vector is equal to the reorganized size in the first size reorganizing module.

The second transposition module is used for transposing a second frequency domain self-attention characteristic vector with a preset size, the shape of the second frequency domain self-attention characteristic vector with the preset size output by the second size reorganization module is [ width x height ], and the transposed third frequency domain self-attention characteristic vector with the preset size is [ height x width ] and is used as a second target frequency domain self-attention characteristic vector.

The first high-frequency signal encoder and the second low-frequency signal encoder have the same structure and comprise a multi-head self-attention module, a standardized fusion module, a full-connection module and a standardized deriving module.

Referring to fig. 9, a schematic diagram of an encoder according to an embodiment of the present invention is provided.

In this embodiment, the first high-frequency signal encoder and the second low-frequency signal encoder have the same structure and are all composed of 3 encoders, and each encoder comprises a multi-head self-attention module, a standardized fusion module, a full-connection module and a standardized derivation module. The multi-head self-attention module extracts the self-attention characteristic of the input high-frequency data characteristic or low-frequency data characteristic and outputs a frequency domain self-attention characteristic vector. The standardized fusion module is used for adding and performing standardized processing on the input in the multi-head self-attention module and the output in the multi-head self-attention module, performing nonlinear transformation processing on the frequency domain self-attention feature vector subjected to standardized processing through the full-connection module, adding and performing standardized processing on the input and the output of the full-connection module, and outputting a frequency domain self-attention feature vector with the same size as the high frequency data feature vector or the low frequency data feature vector input by the input end of the encoder.

According to a preset mask rule, randomly masking part of data from the image to be detected and then reserving Y groups of data; the method comprises the steps that Y groups of data are Y-column data in an image to be detected or Y-line data in the image to be detected, the image to be detected comprises X channel data, Y is an integer larger than 1, and s-level slow-fading high-low frequency transformation is carried out on each channel data of each group of data in the Y groups of data to obtain s-layer frequency domain data feature vectors in each channel data in each group of data; the frequency domain data feature vector comprises s1 high frequency data feature vector and s2 low frequency data feature vector, wherein s, s1, s2 are integers larger than 1, s1 is smaller than s, s2 is smaller than s, s1 high frequency data feature vector and s2 low frequency data feature vector are input into a trained frequency domain self-focusing neural network, region category information and region size information of a target detection region in an image to be detected are output.

Referring to fig. 10, a flow chart of a method for detecting a target based on frequency domain fusion according to an embodiment of the present invention, as shown in fig. 10, the method for detecting a target based on frequency domain fusion may include the following steps:

s1001: according to a preset mask rule, randomly masking part of data from the image to be detected and then reserving Y groups of data;

s1002: performing s-level slow-fading high-low frequency transformation on each channel data of each group of data in the Y groups of data to obtain s-layer frequency domain data feature vectors in each channel data in each group of data;

the contents of the steps S1001 to S1002 are the same as those of the steps S201 to S202, and reference may be made to the descriptions of the steps S201 to S202, which are not repeated herein.

S1003: acquiring region type tag data corresponding to a target detection region of a sample image and first channel tag data and second channel tag data in a region size channel;

s1004: and training the pre-constructed frequency domain self-attention neural network by taking the region type label data, the first channel label data and the second channel label data as priori data to obtain a trained frequency domain self-attention neural network.

In this embodiment, when training a pre-constructed frequency domain self-focusing neural network, first, sample data with a label is acquired, where the sample data with a label corresponds to an output of the pre-constructed frequency domain self-focusing neural network, for example, the pre-constructed frequency domain self-focusing neural network output includes a prediction submodule, where a first prediction submodule is a region type of a prediction target detection region, acquires region type label data, a second prediction submodule is region size information of the prediction target detection region, and the size information includes two parts, and two channel label data need to be acquired respectively.

And training a pre-constructed frequency domain self-focusing neural network by taking the region type label data as first label data, taking the first channel label data and the second channel label data as second label data, taking the first label data and the second label data as first priori data and second priori data, constructing a loss function according to the difference value between the output of a first prediction submodule of the pre-constructed frequency domain self-focusing neural network and the first priori data and the sum of the difference value between the output of a second prediction submodule of the pre-constructed frequency domain self-focusing neural network and the second priori data, calculating the loss value of each training, and updating the network by utilizing the loss value and gradient back propagation to obtain the trained frequency domain self-focusing neural network.

S1005: and inputting the s1 high-frequency data characteristic vectors and the s2 low-frequency data characteristic vectors into the trained frequency domain self-attention neural network, and outputting the region category information and the region size information of the target detection region in the image to be detected.

The content of the step S1005 is the same as that of the step S203, and reference is made to the description of the step S203, which is not repeated here.

Referring to fig. 11, fig. 11 is a schematic structural diagram of a target detection device based on frequency domain fusion according to an embodiment of the present invention. The terminal in this embodiment includes units for performing the steps in the embodiments corresponding to fig. 2 to 9. Please refer to fig. 2 to fig. 10 and the related descriptions in the embodiments corresponding to fig. 2 to fig. 10. For convenience of explanation, only the portions related to the present embodiment are shown. Referring to fig. 11, the detecting device 110 includes: the sampling module 111 is shielded, the high-low frequency conversion module 112 is slowly attenuated, and the detection module 113 is provided.

The masking sample-reserving module 111 is configured to randomly mask part of the data from the image to be detected according to a preset masking rule, and then reserve Y sets of data; the Y-group data are Y-column data in an image to be detected or Y-row data in the image to be detected, the image to be detected comprises X channel data, and Y and X are integers larger than 1;

the slow attenuation high-low frequency conversion module 112 is configured to perform s-level slow attenuation high-low frequency conversion on each channel data of each group of data in the Y groups of data, so as to obtain s-layer frequency domain data feature vectors in each channel data in each group of data; wherein the frequency domain data feature vectors include s1 high frequency data feature vectors and s2 low frequency data feature vectors; wherein s, s1, s2 are integers greater than 1, s1 is less than s, s2 is less than s;

the detection module 113 is configured to input s1 high-frequency data feature vectors and s2 low-frequency data feature vectors into the trained frequency domain self-attention neural network, and output region type information and region size information of a target detection region in the image to be detected.

Optionally, the slow fading high-low frequency conversion module 112 includes:

a channel data acquisition unit configured to acquire each channel data in each set of data;

The signal acquisition unit is used for respectively inputting each channel data in each group of data into the s-level slow-fading high-low frequency transformation model and outputting s1 high-frequency data characteristic vectors and s2 low-frequency data characteristic vectors corresponding to each channel data in each group of data.

Optionally, the frequency domain self-focusing neural network includes:

the device comprises a first high-frequency signal encoder, a first size reorganization module, a first transfer module, a fusion module, a type and size prediction module, wherein the input end of the first high-frequency signal encoder is used for inputting a first high-frequency data characteristic vector, and the output end of the first high-frequency signal encoder is used for outputting an encoded first frequency domain self-attention characteristic vector;

the first transfer module is used for transferring the first frequency domain self-attention characteristic vector with the preset size to obtain the transferred first frequency domain self-attention characteristic vector with the preset size.

Optionally, the frequency domain self-focusing neural network further includes:

a second high frequency signal encoder, a second size reorganizing module and a second transpose module, a low frequency signal encoder, a third size reorganizing module and a third transpose module;

The input end of the second high-frequency signal encoder is used for inputting a second high-frequency data characteristic vector, and the output end of the second high-frequency signal encoder is used for outputting an encoded second frequency domain self-attention characteristic vector;

the second size reorganizing module is used for reorganizing the second frequency domain self-attention feature vector to obtain a preset size second frequency domain self-attention feature vector;

the second transposition module is used for transposing the preset-size second frequency domain self-attention characteristic vector to obtain a transposed preset-size second frequency domain self-attention characteristic vector;

the input end of the low-frequency signal encoder is used for inputting the low-frequency data characteristic vector, and the output end of the low-frequency signal encoder is used for outputting the encoded third frequency domain self-attention characteristic vector;

the third-size recombination module is used for recombining the third frequency-domain self-attention feature vector to obtain a third frequency-domain self-attention feature vector with a preset size;

the third transposition module is used for transposing the third frequency domain self-attention characteristic vector with the preset size to obtain a transposed third frequency domain self-attention characteristic vector with the preset size;

the fusion module is used for fusing the first frequency domain self-attention feature vector with the preset size, the second frequency domain self-attention feature vector with the preset size and the third frequency domain self-attention feature vector with the preset size to obtain a fused frequency domain feature vector;

The category and size prediction module is used for predicting and fusing the frequency domain feature vectors to obtain the region category information and the region size information of the target detection region of the power detection image.

Optionally, the first high-frequency signal encoder, the second high-frequency signal encoder and the low-frequency signal encoder have the same structure and each comprise a multi-head self-attention module, a standardized fusion module, a full-connection module and a standardized derivation module;

the multi-head self-attention module is used for inputting frequency domain data feature vectors;

the standardized fusion module is used for adding the input in the multi-head self-attention module and the output in the multi-head self-attention module and performing standardized processing;

the full-connection module is used for carrying out nonlinear transformation on the standardized processing result;

and the standardized export module is used for adding the input of the full-connection module and the output of the full-connection module and performing standardized processing.

Optionally, the detecting device 110 further includes:

the initial image to be detected acquisition module is used for acquiring the initial image to be detected.

Optionally, the detecting device 110 further includes:

the label data acquisition module is used for acquiring the area category label data corresponding to the sample image target detection area and the first channel label data and the second channel label data in the area size channel;

The training module is used for training the pre-constructed frequency domain self-attention neural network by taking the region type label data, the first channel label data and the second channel label data as prior data to obtain a trained frequency domain self-attention neural network.

It should be noted that, because the content of information interaction and execution process between the above units is based on the same concept as the method embodiment of the present invention, specific functions and technical effects thereof may be referred to in the method embodiment section, and will not be described herein.

Fig. 12 is a schematic structural diagram of a computer device according to an embodiment of the present invention. As shown in fig. 12, the computer device of this embodiment includes: at least one processor (only one shown in fig. 12), a memory, and a computer program stored in the memory and executable on the at least one processor, the processor executing the computer program to perform the steps of any of the various frequency domain fusion-based object detection method embodiments described above.

The computer device may include, but is not limited to, a processor, a memory. It will be appreciated by those skilled in the art that fig. 12 is merely an example of a computer device and is not intended to be limiting, and that a computer device may include more or fewer components than shown, or may combine certain components, or different components, such as may also include a network interface, a display screen, an input device, and the like.

The processor may be a CPU, but may also be other general purpose processors, digital signal processors (Digital Signal Processor, DSP), application specific integrated circuits (Application Specific Integrated Circuit, ASIC), off-the-shelf programmable gate arrays (Field-Programmable Gate Array, FPGA) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, or the like. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.

The memory includes a readable storage medium, an internal memory, etc., where the internal memory may be the memory of the computer device, the internal memory providing an environment for the execution of an operating system and computer-readable instructions in the readable storage medium. The readable storage medium may be a hard disk of a computer device, and in other embodiments may be an external storage device of the computer device, for example, a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), etc. that are provided on the computer device. Further, the memory may also include both internal storage units and external storage devices of the computer device. The memory is used to store an operating system, application programs, boot loader (BootLoader), data, and other programs such as program codes of computer programs, and the like. The memory may also be used to temporarily store data that has been output or is to be output.

It will be apparent to those skilled in the art that, for convenience and brevity of description, only the above-described division of the functional units and modules is illustrated, and in practical application, the above-described functional distribution may be performed by different functional units and modules according to needs, i.e. the internal structure of the apparatus is divided into different functional units or modules to perform all or part of the above-described functions. The functional units and modules in the embodiment may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit, where the integrated units may be implemented in a form of hardware or a form of a software functional unit. In addition, the specific names of the functional units and modules are only for distinguishing from each other, and are not used for limiting the protection scope of the present invention. The specific working process of the units and modules in the above device may refer to the corresponding process in the foregoing method embodiment, which is not described herein again. The integrated units, if implemented in the form of software functional units and sold or used as stand-alone products, may be stored in a computer readable storage medium. Based on such understanding, the present invention may implement all or part of the flow of the method of the above-described embodiment, and may be implemented by a computer program to instruct related hardware, and the computer program may be stored in a computer readable storage medium, where the computer program, when executed by a processor, may implement the steps of the method embodiment described above. Wherein the computer program comprises computer program code, which may be in the form of source code, object code, executable files or in some intermediate form, etc. The computer readable medium may include at least: any entity or device capable of carrying computer program code, a recording medium, a computer Memory, a Read-Only Memory (ROM), a random access Memory (Random Access Memory, RAM), an electrical carrier signal, a telecommunications signal, and a software distribution medium. Such as a U-disk, removable hard disk, magnetic or optical disk, etc. In some jurisdictions, computer readable media may not be electrical carrier signals and telecommunications signals in accordance with legislation and patent practice.

The present invention may also be implemented as a computer program product for implementing all or part of the steps of the method embodiments described above, when the computer program product is run on a computer device, causing the computer device to execute the steps of the method embodiments described above.

In the foregoing embodiments, the descriptions of the embodiments are emphasized, and in part, not described or illustrated in any particular embodiment, reference is made to the related descriptions of other embodiments.

Those of ordinary skill in the art will appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention.

In the embodiments provided by the present invention, it should be understood that the disclosed apparatus/computer device and method may be implemented in other manners. For example, the apparatus/computer device embodiments described above are merely illustrative, e.g., the division of modules or units is merely a logical functional division, and there may be additional divisions when actually implemented, e.g., multiple units or components may be combined or integrated into another system, or some features may be omitted or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed may be an indirect coupling or communication connection via interfaces, devices or units, which may be in electrical, mechanical or other forms.

The units described as separate units may or may not be physically separate, and units shown as units may or may not be physical units, may be located in one place, or may be distributed over a plurality of network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.

The above embodiments are only for illustrating the technical solution of the present invention, and are not limiting; although the invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit and scope of the technical solutions of the embodiments of the present invention, and are intended to be included in the scope of the present invention.

Claims

1. The target detection method based on frequency domain fusion is characterized by comprising the following steps of:

according to a preset mask rule, dividing an image to be detected into Y areas on average in the width direction, and sampling 1 column of data in each area to obtain Y groups of data; the image to be detected comprises X channel data, wherein X and Y are integers larger than 1;

Performing s-level slow-fading high-low frequency transformation on each channel data of each group of data in the Y groups of data to obtain s-layer frequency domain data feature vectors in each channel data of each group of data; wherein the frequency domain data feature vectors include s1 high frequency data feature vectors and s2 low frequency data feature vectors; wherein s1 is less than or equal to s, s2 is less than or equal to 1, and the calculation formula of the s-level slow-decay high-low frequency transformation is as follows:

s is the number of stages of the transformation,

let->

For inputting the kth term of the discrete signal, where k +.>

The +.1-level can be obtained sequentially by iterative formula>

High frequency coefficient->

And low frequency coefficient->

Grade s>

High frequency coefficient->

And low frequency coefficient->

，/>

Is a high-frequency slow attenuation coefficient->

Is low-frequency slow fading coefficient->

；

Inputting the s1 high-frequency data feature vectors and the s2 low-frequency data feature vectors into a trained frequency domain self-attention neural network, and outputting region category information and region size information of a target detection region in the image to be detected, wherein the frequency domain self-attention neural network comprises:

2. The frequency domain fusion-based object detection method as claimed in claim 1, wherein the first frequency domain self-attention feature vector extraction network comprises: a first high frequency signal encoder, a first size reorganizing module and a first inverting module, wherein,

3. The frequency domain fusion-based object detection method as claimed in claim 1, wherein the second frequency domain self-attention feature vector extraction network comprises: a second low frequency signal encoder, a second size reorganizing module and a second transpose module, wherein,

4. The method for detecting an object based on frequency domain fusion as claimed in claim 1, wherein said class and size prediction module comprises: the first prediction sub-module and the second prediction sub-module;

5. The method for detecting a target based on frequency domain fusion as claimed in claim 1, wherein said performing s-level slow-fading high-low frequency transform on each channel data in each set of data to obtain s-layer frequency domain data feature vectors in each channel data in each set of data comprises:

acquiring each channel data in each group of data;

6. The method for detecting a target based on frequency domain fusion according to claim 1, wherein before inputting the s1 high frequency data feature vectors and the s2 low frequency data feature vectors into the trained frequency domain self-attention neural network, outputting the region category information and the region size information of the target detection region in the image to be detected further comprises:

acquiring region type label data corresponding to the target detection region of the sample image, and first channel label data representing the width and second channel label data representing the height in a region size channel;

and training the pre-constructed frequency domain self-attention neural network by taking the region category label data, the first channel label data and the second channel label data as priori data to obtain a trained frequency domain self-attention neural network.

7. A target detection apparatus based on frequency domain fusion, the apparatus comprising:

the shielding sample reserving module is used for dividing the image to be detected into Y areas in the width direction on average according to a preset mask rule, and sampling 1 column of data in each area to obtain Y groups of data; the image to be detected comprises X channel data, Y and X are integers larger than 1;

The slow attenuation high-low frequency conversion module is used for carrying out s-level slow attenuation high-low frequency conversion on each channel data of each group of data in the Y groups of data to obtain s-layer frequency domain data feature vectors in each channel data of each group of data; wherein the frequency domain data feature vectors include s1 high frequency data feature vectors and s2 low frequency data feature vectors; wherein s1 is less than or equal to s, s2 is less than or equal to 1, and the calculation formula of the s-level slow-decay high-low frequency transformation is as follows:

s is the number of stages of the transformation,

let->

For inputting the kth term of the discrete signal, where k +.>

The +.1-level can be obtained sequentially by iterative formula>

High frequency coefficient->

And low frequency coefficient->

Grade s>

High frequency coefficient->

And low frequency coefficient->

，/>

Is a high-frequency slow attenuation coefficient->

Is low-frequency slow fading coefficient->

；

The detection module is configured to input the s1 high-frequency data feature vectors and the s2 low-frequency data feature vectors into a trained frequency domain self-attention neural network, and output region type information and region size information of a target detection region in the image to be detected, where the frequency domain self-attention neural network includes:

8. A computer device comprising a processor, a memory and a computer program stored in the memory and executable on the processor, the processor implementing the frequency domain fusion based object detection method according to any one of claims 1 to 6 when the computer program is executed.

9. A computer readable storage medium storing a computer program, wherein the computer program when executed by a processor implements the frequency domain fusion based object detection method according to any one of claims 1 to 6.