CN109657729B

CN109657729B - Image feature fusion, feature map processing and gesture recognition method, device and system

Info

Publication number: CN109657729B
Application number: CN201811608178.9A
Authority: CN
Inventors: 彭琦翔; 王志成; 俞刚
Original assignee: Beijing Kuangshi Technology Co Ltd
Current assignee: Beijing Kuangshi Technology Co Ltd
Priority date: 2018-12-26
Filing date: 2018-12-26
Publication date: 2021-05-07
Anticipated expiration: 2038-12-26
Also published as: CN109657729A

Abstract

The invention provides an image feature fusion method, a feature map processing method, a gesture recognition method, a device and a system, which relate to the technical field of image processing, wherein the image feature fusion method comprises the following steps: acquiring a feature map to be fused; executing pixel shift operation on the feature graph to be fused to obtain a plurality of shift feature graphs; and performing feature fusion operation on the plurality of offset feature graphs and the feature graph to be fused to obtain a fusion feature graph. The invention can effectively improve the comprehensiveness of the feature information covered by the fused feature map.

Description

Image feature fusion, feature map processing and gesture recognition method, device and system

Technical Field

The invention relates to the technical field of image processing, in particular to a method, a device and a system for image feature fusion, feature map processing and gesture recognition.

Background

Feature maps are used in many tasks in the field of image recognition. However, the conventional feature map generation method is not good, and it is difficult to clearly and comprehensively present feature information in a target image in many cases, so that the result of a task executed based on the feature map is not accurate. Taking the human body posture recognition task as an example, a feature map of existing human body joint points can be generated according to a target image, and then human body posture recognition is carried out based on the feature map. When the target image has the problems that human joints are blocked, the motion amplitude of a human body is too large, the posture is strange and rare, and the like, the joint point position is difficult to accurately and reliably reflected by the characteristic information of the characteristic diagram obtained according to the traditional characteristic diagram generation mode, so that the subsequent human posture recognition result is inaccurate.

Disclosure of Invention

In view of the above, the present invention provides an image feature fusion method, a feature map processing device, and a feature recognition system, which can improve the comprehensiveness of feature information covered by a feature map.

In order to achieve the above purpose, the embodiment of the present invention adopts the following technical solutions:

in a first aspect, an embodiment of the present invention provides an image feature fusion method, including: acquiring a feature map to be fused; executing pixel shift operation on the feature graph to be fused to obtain a plurality of shift feature graphs; and performing feature fusion operation on the plurality of offset feature maps and the feature map to be fused to obtain a fusion feature map.

Further, the step of performing a pixel shift operation on the feature map to be fused to obtain a plurality of shifted feature maps includes: and shifting each pixel of the feature graph to be fused in a preset shifting direction by a preset pixel shifting amount, and keeping the boundary pixel value of the feature graph to be fused in an original value to obtain each shifting feature graph corresponding to the preset shifting direction.

Further, the preset offset direction includes a plurality of kinds of an up direction, a down direction, a left direction, a right direction, an upper left direction, a lower left direction, an upper right direction, and a lower right direction, and the preset pixel offset amount is at least one pixel.

Further, the step of performing a feature fusion operation on the plurality of offset feature maps and the feature map to be fused to obtain a fused feature map includes: performing feature fusion operation on the plurality of offset feature maps and the feature map to be fused by adopting a weighted fusion algorithm to obtain a fusion feature map; wherein, the weights corresponding to different offset characteristic graphs are the same or different; and the weight of the feature map to be fused is greater than the weight of each offset feature map.

In a second aspect, an embodiment of the present invention provides a feature map processing method, including: acquiring a target characteristic diagram to be processed; inputting the target feature map into a down-sampling network, performing down-sampling operation based on the target feature map through the down-sampling network, and performing the image feature fusion method according to any one of the first aspect to obtain a thumbnail feature map; inputting the thumbnail feature map into an upsampling network, executing the image feature fusion method according to any one of the first aspect based on the thumbnail feature map through the upsampling network, and executing an upsampling operation to obtain a first enlarged feature map; and determining the first amplified feature map as the feature map of the processed target feature map.

Further, the down-sampling network comprises at least one down-sampling sub-network which is connected in sequence, and each down-sampling sub-network comprises a middle down-sampling layer, a first offset layer and a first fusion layer which are connected in sequence; the output end of the last down-sampling sub-network is also connected with a terminal down-sampling layer; the input of the middle down-sampling layer is a first original feature map, and the output of the middle down-sampling layer is a middle thumbnail feature map; the input of the first offset layer is the intermediate thumbnail feature map, and the output of the first offset layer is a plurality of first offset feature maps; the input of the first fusion layer is the intermediate thumbnail feature map and a plurality of the first offset feature maps, and the output of the first fusion layer is a first fusion feature map; the input of the terminal down-sampling layer is a first fused feature map output by the last down-sampling sub-network, and the output of the terminal down-sampling layer is a terminal thumbnail feature map; when the down-sampling sub-network is a head end down-sampling sub-network, the first original feature map is the target feature map; when the down-sampling subnetwork is a non-head end down-sampling subnetwork, the first original feature map is the first fused feature map output by a previous-stage down-sampling subnetwork.

Further, the step of obtaining a thumbnail feature map by performing a downsampling operation based on the target feature map through the downsampling network and performing the image feature fusion method according to any one of the first aspect includes: for each downsampling sub-network, performing downsampling operation on a first original feature map input by the downsampling layer to obtain an intermediate thumbnail feature map; executing pixel offset operation on the intermediate thumbnail feature maps through the first offset layer to obtain a plurality of first offset feature maps; performing a feature fusion operation on the intermediate thumbnail feature map and the plurality of first offset feature maps through the first fusion layer to obtain a first fusion feature map; and performing down-sampling operation on the first fused feature map obtained by the last down-sampling sub-network through the terminal down-sampling layer to obtain the terminal abbreviated feature map.

Further, the up-sampling network comprises at least one up-sampling sub-network which is connected in sequence, and each up-sampling sub-network comprises a second offset layer, a second fusion layer, an up-sampling layer and an overlapping layer which are connected in sequence; each superposition layer is correspondingly connected with a middle down-sampling layer in the down-sampling sub-network; a second offset layer of the upsampling subnetwork at the head end coupled to the terminal downsampling layer; the input of the second offset layer is a second original characteristic diagram, and the output of the second offset layer is a plurality of second offset characteristic diagrams; the input of the second fusion layer is the second original feature map and a plurality of second offset feature maps, and the output of the second fusion layer is a second fusion feature map; the input of the up-sampling layer is the second fused feature map, and the output of the up-sampling layer is a second amplified feature map; the input of the superposition layer is the second amplified characteristic diagram and the intermediate thumbnail characteristic diagram, and the output of the superposition layer is the first amplified characteristic diagram; when the up-sampling sub-network is the up-sampling sub-network at the head end, the second original feature map is the tail-end abbreviated feature map output by the tail-end down-sampling layer; and when the up-sampling sub-network is a non-head end up-sampling sub-network, the second original feature map is the first amplified feature map output by a previous-stage up-sampling sub-network.

Further, the number of upsampling sub-networks comprised by the upsampling network is equal to the number of downsampling sub-networks comprised by the downsampling network; the step of performing the image feature fusion method according to any one of the first aspect based on the thumbnail feature map and performing the upsampling operation to obtain a first enlarged feature map through the upsampling network includes: for each upsampling sub-network, performing pixel shift operation on a second original feature map input to the upsampling sub-network through the second shift layer to obtain a plurality of second shift feature maps; performing feature fusion operation on the second original feature map and the second offset feature maps through the second fusion layer to obtain a second fusion feature map; performing upsampling operation on the second fusion characteristic diagram by adopting a bilinear interpolation algorithm through the upsampling layer to obtain a second amplified characteristic diagram; and overlapping the second amplified characteristic diagram and the intermediate thumbnail characteristic diagram through the overlapping layer to obtain the first amplified characteristic diagram.

In a third aspect, an embodiment of the present invention provides a gesture recognition method, including: acquiring a whole-body image of a target object; generating a target feature map based on the whole-body image; generating a first enlarged feature map of the target feature map by using the feature map processing method according to any one of the second aspects; recognizing the posture of the target object based on the first enlarged feature map.

Further, the step of acquiring a whole-body image of the target object includes: acquiring an image to be identified; and detecting a target object contained in the image to be identified, and intercepting a whole-body image of the target object.

Further, the step of generating a target feature map based on the whole-body image includes: scaling the whole-body image to a specified size; and performing feature extraction on the zoomed whole-body image to obtain a target feature map.

Further, the step of recognizing the pose of the target object based on the first enlarged feature map comprises: performing a convolution operation on the first enlarged feature map to generate a thermodynamic map of the target feature map; a pose of the target object is identified based on the thermodynamic diagram.

Further, the step of performing a convolution operation on the first enlarged feature map to generate a thermodynamic map of the target feature map includes: acquiring the areas of a plurality of key points to be detected in the target characteristic diagram; the key points to be detected comprise joint points of the target object and/or preset points on a designated part of the target object; performing convolution operation on the first amplified characteristic diagram based on the areas of the key points to be detected to obtain a plurality of thermodynamic diagrams; and each key point to be detected corresponds to a thermodynamic diagram.

Further, the step of identifying the pose of the target object based on the thermodynamic diagram comprises: for each thermodynamic diagram, calculating the response value of a pixel point contained in the thermodynamic diagram by adopting a Gaussian fuzzy method, and taking the pixel point with the maximum response value as a key point to be detected corresponding to the thermodynamic diagram; and identifying the posture of the target object according to the key points to be detected corresponding to each thermodynamic diagram.

Further, the step of recognizing the posture of the target object according to the key point to be detected corresponding to each thermodynamic diagram includes: acquiring thermodynamic diagram coordinates of the key points to be detected corresponding to each thermodynamic diagram; determining the original coordinates of each key point to be detected in the image to be identified based on a preset mapping relation and thermodynamic diagram coordinates of each key point to be detected; and recognizing the posture of the target object according to the original coordinates of the key points to be detected in the image to be recognized.

In a fourth aspect, an embodiment of the present invention provides an image feature fusion apparatus, including: the to-be-fused feature map acquisition module is used for acquiring a to-be-fused feature map; the pixel migration module is used for executing pixel migration operation on the feature graph to be fused to obtain a plurality of migration feature graphs; and the feature fusion module is used for executing feature fusion operation on the plurality of offset feature maps and the feature map to be fused to obtain a fusion feature map.

In a fifth aspect, an embodiment of the present invention provides a feature map processing apparatus, including: the target characteristic diagram acquisition module is used for acquiring a target characteristic diagram; a first executing module, configured to input the target feature map into a down-sampling network, perform a down-sampling operation on the target feature map through the down-sampling network, and perform the image feature fusion method according to any one of the first aspect to obtain a thumbnail feature map; a second execution module, configured to input the thumbnail feature map into an upsampling network, execute the image feature fusion method according to any one of the first aspects based on the thumbnail feature map through the upsampling network, and execute an upsampling operation to obtain a first enlarged feature map; and the characteristic map determining module is used for determining the first amplified characteristic map as the characteristic map of the processed target characteristic map.

In a sixth aspect, an embodiment of the present invention provides a gesture recognition apparatus, including: the image acquisition module is used for acquiring a whole body image of the target object; a feature map generation module for generating a target feature map based on the whole-body image; a first enlarged feature map generation module, configured to generate a first enlarged feature map of the target feature map by using the feature map processing method according to any one of the second aspects; and the gesture recognition module is used for recognizing the gesture of the target object based on the first amplified feature map.

In a seventh aspect, an embodiment of the present invention provides a gesture recognition system, where the system includes: the device comprises an image acquisition device, a processor and a storage device; the image acquisition device is used for acquiring a whole body image of the target object; the storage device has stored thereon a computer program which, when executed by the processor, performs the gesture recognition method of any of the preceding third aspects.

In an eighth aspect, the present invention provides a computer-readable storage medium, on which a computer program is stored, where the computer program is executed by a processor to perform the image feature fusion method according to any one of the foregoing first aspects, or to perform the feature map processing method according to any one of the foregoing second aspects, or to perform the steps of the gesture recognition method according to any one of the foregoing third aspects.

The embodiment of the invention provides an image feature fusion method and device, which can be used for executing pixel offset operation on a feature graph to be fused to obtain a plurality of offset feature graphs; and then, performing feature fusion operation on the plurality of offset feature maps and the feature map to be fused to obtain a fusion feature map. According to the image feature fusion method provided by the embodiment, the comprehensiveness of the feature information covered by the feature map can be effectively improved by performing the pixel shift operation and the feature fusion operation on the feature map.

The embodiment of the invention provides a feature map processing method and a device, which can firstly input a target feature map into a down-sampling network, and execute down-sampling operation and the image feature fusion method based on the target feature map through the down-sampling network to obtain a thumbnail feature map; and then inputting the thumbnail feature map into an up-sampling network, and executing the image feature fusion method and the up-sampling operation based on the thumbnail feature map through the up-sampling network to obtain a feature map of the processed target feature map. In the feature map processing method provided in this embodiment, the image feature fusion method is performed in both the down-sampling and up-sampling processes of the target feature map, so that the processed feature map can include more comprehensive feature information.

The embodiment of the invention provides a posture recognition method and a posture recognition device, which can firstly acquire a whole-body image of a target object, generate a target feature map based on the whole-body image, then generate a first amplified feature map of the target feature map by adopting a feature map processing method, and further recognize the posture of the target object based on the first amplified feature map. The method provided by the embodiment mainly performs gesture recognition based on the feature map covering more comprehensive feature information, and can effectively improve the accuracy of human gesture recognition.

Additional features and advantages of the invention will be set forth in the description which follows, and in part will be obvious from the description, or may be learned by the practice of the above-described technology of the disclosure.

In order to make the aforementioned and other objects, features and advantages of the present invention comprehensible, preferred embodiments accompanied with figures are described in detail below.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and other drawings can be obtained by those skilled in the art without creative efforts.

Fig. 1 is a schematic structural diagram of an electronic device according to an embodiment of the present invention;

FIG. 2 is a flow chart of an image feature fusion method according to an embodiment of the present invention;

FIG. 3 is a schematic diagram illustrating a pixel shifting operation provided by an embodiment of the invention;

FIG. 4 is a schematic diagram illustrating another pixel shifting operation provided by embodiments of the present invention;

FIG. 5 is a flow chart of a feature map processing method provided by an embodiment of the invention;

FIG. 6 is a schematic structural diagram of a first feature diagram processing network according to an embodiment of the present invention;

FIG. 7 is a diagram illustrating a second exemplary feature graph processing network according to an embodiment of the present invention;

FIG. 8 is a schematic diagram illustrating a U-shaped network structure provided by an embodiment of the present invention;

FIG. 9 is a flow chart of a gesture recognition method provided by an embodiment of the invention;

fig. 10 is a block diagram illustrating an image feature fusion apparatus according to an embodiment of the present invention;

fig. 11 is a block diagram illustrating a feature map processing apparatus according to an embodiment of the present invention;

fig. 12 is a block diagram illustrating a structure of a gesture recognition apparatus according to an embodiment of the present invention.

Detailed Description

To make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions of the present invention will be clearly and completely described below with reference to the accompanying drawings, and it is apparent that the described embodiments are some, but not all embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Feature maps are used in many tasks in the field of image recognition. However, due to the poor generation manner of the existing feature map, it is difficult for the feature map to fully present the feature information in the target image, so that the result of the task (such as the result of human body posture recognition) executed based on the feature map is not accurate. Based on this, embodiments of the present invention provide an image feature fusion method, a feature map processing method, a gesture recognition method, and a gesture recognition system, and the embodiments of the present invention are described in detail below.

The first embodiment is as follows:

first, an example electronic device 100 for implementing an image feature fusion method, a feature map processing method, and a pose recognition method, apparatus, and system according to embodiments of the present invention will be described with reference to fig. 1.

As shown in fig. 1, an electronic device 100 includes one or more processors 102, one or more memory devices 104, an input device 106, an output device 108, and an image capture device 110, which are interconnected via a bus system 112 and/or other type of connection mechanism (not shown). It should be noted that the components and structure of the electronic device 100 shown in fig. 1 are exemplary only, and not limiting, and the electronic device may have other components and structures as desired.

The processor 102 may be a Central Processing Unit (CPU) or other form of processing unit having data processing capabilities and/or instruction execution capabilities, and may control other components in the electronic device 100 to perform desired functions.

The storage 104 may include one or more computer program products that may include various forms of computer-readable storage media, such as volatile memory and/or non-volatile memory. The volatile memory may include, for example, Random Access Memory (RAM), cache memory (cache), and/or the like. The non-volatile memory may include, for example, Read Only Memory (ROM), hard disk, flash memory, etc. On which one or more computer program instructions may be stored that may be executed by processor 102 to implement client-side functionality (implemented by the processor) and/or other desired functionality in embodiments of the invention described below. Various applications and various data, such as various data used and/or generated by the applications, may also be stored in the computer-readable storage medium.

The input device 106 may be a device used by a user to input instructions and may include one or more of a keyboard, a mouse, a microphone, a touch screen, and the like.

The output device 108 may output various information (e.g., images or sounds) to the outside (e.g., a user), and may include one or more of a display, a speaker, and the like.

The image capture device 110 may take images (e.g., photographs, videos, etc.) desired by the user and store the taken images in the storage device 104 for use by other components.

Exemplary electronic devices for implementing an image feature fusion method, a feature map processing method, and a gesture recognition method, apparatus, and system according to embodiments of the present invention may be implemented on a smart terminal such as a smart phone, a tablet computer, a computer, or the like.

Example two:

referring to a flowchart of an image feature fusion method shown in fig. 2, the method specifically includes the following steps:

step S202, acquiring a feature map to be fused. The feature map to be fused mentioned in this embodiment may be a feature map with a resolution size of W × H obtained by performing feature extraction on an original image.

Step S204, executing pixel shift operation on the feature graph to be fused to obtain a plurality of shift feature graphs.

It can be understood that the feature graph to be fused with the size of W × H includes W × H pixel points, and the pixel value of each pixel point is determined by performing convolution operation on the pixel values of a circle of pixel points around the current pixel point; that is, each pixel value represents characteristic information of a peripheral portion region. Based on the method, pixel shift operation can be performed on the feature graph to be fused to obtain a plurality of shift feature graphs in preset shift directions. Because each pixel point of the offset feature map is obtained by offsetting each corresponding pixel point in the feature map to be fused, the feature information of the surrounding nearby area can be obtained for the pixel point at each specific position in the feature map to be fused, and therefore the comprehensiveness of the feature information covered by the feature map to be fused can be effectively improved.

And step S206, performing feature fusion operation on the plurality of offset feature maps and the feature map to be fused to obtain a fusion feature map.

In an actual application scenario, in order to extract useful information in each feature map to the maximum extent and finally synthesize the feature maps into feature maps containing rich feature information so as to improve the accuracy and reliability of the feature information covered by the feature maps, the embodiment performs a feature fusion operation on a plurality of offset feature maps and feature maps to be fused to obtain a fusion feature map. In specific implementation, in the process of executing the feature fusion operation, weights may be set for the feature graph to be fused and the offset feature graphs corresponding to the multiple offset directions, respectively, so as to implement weighted fusion of the multiple graphs. The setting of the weight may be determined according to actual requirements, such as, assuming that the feature map to be fused needs to occupy a greater proportion in the fusion result, the weight of the feature map to be fused may be made greater than the weight of each first offset feature map.

According to the image feature fusion method provided by the embodiment of the invention, pixel migration operation can be firstly executed on feature graphs to be fused to obtain a plurality of migration feature graphs; and then, performing feature fusion operation on the plurality of offset feature maps and the feature map to be fused to obtain a fusion feature map. According to the image feature fusion method provided by the embodiment, the comprehensiveness of the feature information covered by the feature map can be effectively improved by performing the pixel shift operation and the feature fusion operation on the feature map.

Further, in this embodiment, a specific implementation manner of obtaining a plurality of offset feature maps may be performed by referring to the following steps: and shifting each pixel of the feature graph to be fused in a preset shifting direction by a preset pixel shifting amount, and keeping the boundary pixel value of the feature graph to be fused in an original value to obtain a shifting feature graph corresponding to each preset shifting direction.

In some embodiments, the preset offset direction includes a plurality of directions of up, down, left, right, up-left, down-left, up-right, and down-right, and the preset pixel offset amount is at least one pixel.

It is understood that the plurality of offset profiles obtained are different based on different preset offset directions. The following explains a manner of generating the offset profile: for convenience of illustration, fig. 3 in the present embodiment only exemplifies the pixel shift operation by taking the feature diagram to be fused of 3 × 3 as an example. Firstly, shifting each pixel in the feature graph to be fused of 3-by-3 upwards by a preset pixel offset; the preset pixel offset may be one or more pixels, and the example of offsetting one pixel is described herein. And keeping the lower boundary pixel value of the feature map to be fused of 3 x 3 as an original value to obtain an upward shift feature map corresponding to the upward direction. The up-shifted feature map may represent richer feature information in the up direction of the feature map to be fused. Through the same pixel shift operation, a lower shift characteristic diagram corresponding to the lower direction, a left shift characteristic diagram corresponding to the left direction and a right shift characteristic diagram corresponding to the right direction are respectively obtained. At this time, a total of five feature maps with the resolution of 3 × 3 are obtained: the feature graph to be fused, the upper offset feature graph, the lower offset feature graph, the left offset feature graph and the right offset feature graph. The four offset feature maps can represent richer feature information of the feature map to be fused in four directions, namely, up, down, left and right.

On the basis of the generation mode of the offset characteristic diagram, an upper left offset characteristic diagram, a lower left offset characteristic diagram, an upper right offset characteristic diagram and a lower right offset characteristic diagram can be further derived. For convenience of understanding, taking the generation of the top left offset feature map as an example for explanation, referring to fig. 4, first, each pixel in the feature map to be fused of 3 × 3 may be offset in the upward direction by a preset pixel offset amount, and the lower boundary pixel value of the feature map to be fused of 3 × 3 is kept as an original value, so as to obtain a top offset feature map corresponding to the upward direction; and then, shifting each pixel in the upper shift characteristic diagram to the left by a preset pixel shift amount, and keeping the right boundary pixel value of the upper shift characteristic diagram as an original value to obtain an upper left shift characteristic diagram. Of course, the feature map to be fused may be shifted to the left to obtain a left shifted feature map, and then the left shifted feature map may be shifted to the upper side to obtain an upper left shifted feature map; the above-mentioned back and forth order of the two offset directions does not affect the final result. Similarly, a left lower offset feature map, a right upper offset feature map, and a right lower offset feature map can be obtained, and are not described herein again. At this time, there are nine feature maps with resolutions of 3 × 3, that is: the feature graph to be fused, the upper offset feature graph, the lower offset feature graph, the left offset feature graph, the right offset feature graph, the left upper offset feature graph, the left lower offset feature graph, the right upper offset feature graph and the right lower offset feature graph. The eight offset feature maps can represent richer feature information of the feature map to be fused in eight directions, namely, up, down, left, right, up-left, down-left, up-right and down-right. It can be seen that eight offset feature maps can more comprehensively enrich the feature information representing the feature maps than four offset feature maps.

Further, the embodiment of obtaining the specific implementation manner of the fusion feature map may be executed by referring to the following steps: performing feature fusion operation on the plurality of offset feature maps and the feature map to be fused by adopting a weighted fusion algorithm to obtain a fusion feature map; wherein, the weights corresponding to different offset characteristic graphs are the same or different. Because the feature information represented by the feature map to be fused is original real information, in order to ensure that the feature map after fusion is close to the original feature map to the maximum extent, the weight of the feature map to be fused can be made greater than the weight of each offset feature map.

Based on the weights of the feature graph to be fused and each offset feature graph, a weighted fusion algorithm shown in the following formula (1) can be adopted to perform feature fusion operation on the feature graph to be fused and the offset feature graphs with the resolution of 3 × 3, so as to obtain a fusion feature graph with the resolution of 3 × 3:

A*＝αA+[∑(β_iA_{offset i})] (1)

Wherein A is a fusion characteristic graph, A is a characteristic graph to be fused, alpha is a weight value of the characteristic graph to be fused, and A_{Offset i}For shifting the feature map, β_iAnd i is a preset offset direction, and alpha is larger than beta i.

In summary, the image feature fusion method provided by the embodiment of the present invention can effectively improve the comprehensiveness of the feature information covered by the feature map by performing the pixel shift operation and the feature fusion operation on the feature map.

Example three:

based on the image feature fusion method provided in the previous embodiment, the embodiment of the present invention further provides a feature map processing method flowchart as shown in fig. 5, and the method specifically includes the following steps:

step S502, a target characteristic diagram to be processed is obtained. The target feature map mentioned in this embodiment may be generated by the neural network based on the target image including the target object, for example, the target image including the target object is input into a feature extraction network such as a convolutional neural network Resnet, and the feature extraction network performs feature extraction on the target image to obtain the target feature map of the target image. In practical applications, the target object may be a human being, but may also be other objects such as an animal. In some embodiments, the target image may contain only one target object, and in other embodiments, the target image may contain multiple target objects simultaneously. If the target image only contains one target object and the image collected by the camera contains a plurality of target objects, the collected image can be cut to obtain the target image corresponding to each target object.

Step S504, inputting the target feature map into the down-sampling network, performing down-sampling operation based on the target feature map through the down-sampling network, and performing the image feature fusion method as provided in the second embodiment to obtain the thumbnail feature map.

Specifically, the acquired target feature map may be input to a down-sampling network, and the target feature map may be first down-sampled by the down-sampling network to an intermediate thumbnail feature map of a specified resolution. For example, for a target feature map with a resolution size of W × H, s-fold down-sampling is performed to obtain an intermediate thumbnail feature map with a resolution size of (W/s) × (H/s), where s is a common divisor of W and H.

The image feature fusion method provided in the second embodiment is performed on the intermediate thumbnail feature graph, that is, the intermediate thumbnail feature graph is used as a feature graph to be fused, a pixel shift operation is performed on the intermediate thumbnail feature graph to obtain a plurality of first shift feature graphs, and a feature fusion operation is performed on the plurality of first shift feature graphs and the intermediate thumbnail feature graph to obtain a first fusion feature graph. For details of the process of performing the pixel shift operation and the feature fusion operation, please refer to embodiment two, which is not repeated herein.

Step S506, inputting the thumbnail feature map into an upsampling network, executing the image feature fusion method provided in the second embodiment based on the thumbnail feature map through the upsampling network, and executing an upsampling operation to obtain a first enlarged feature map.

Specifically, the image feature fusion method may be first performed on the thumbnail feature maps input thereto through the upsampling network, and then the upsampling operation may be performed. The image feature fusion method performed by the up-sampling network is the same as the image feature fusion method performed by the down-sampling network (i.e., the image feature fusion method in the second embodiment), and is not described herein again.

The upsampling operation referred to in this embodiment may be understood as an operation of enlarging the thumbnail feature map, that is, on the basis of the pixels of the existing thumbnail feature map, a new element is inserted between the pixels by using an interpolation algorithm such as a bilinear interpolation algorithm. By the up-sampling operation, the thumbnail feature map can be restored to the first enlarged feature map of the higher resolution size.

Step S508, determining the first enlarged feature map as the feature map after the target feature map is processed.

According to the feature map processing method provided by the embodiment of the invention, the target feature map can be input into a down-sampling network, and a down-sampling operation and image feature fusion method is executed based on the target feature map through the down-sampling network to obtain a thumbnail feature map; and then inputting the thumbnail feature map into an up-sampling network, and executing an image feature fusion method and an up-sampling operation based on the thumbnail feature map through the up-sampling network to obtain a feature map of the processed target feature map. In the feature map processing method provided in this embodiment, the image feature fusion method provided in the foregoing embodiment is performed in both the processes of down-sampling and up-sampling the target feature map, so that the processed feature map can include more comprehensive feature information.

For convenience of understanding, in this embodiment, first of all, an infrastructure of a feature map processing network is given, and specifically, referring to a schematic structural diagram of the first feature map processing network shown in fig. 6, the following describes basic steps of processing a feature map with reference to fig. 6: the characteristic diagram processing network mainly comprises a down-sampling network and an up-sampling network; further, as shown in fig. 6, a gesture recognition network is connected after the up-sampling network; the gesture recognition network is an example network which applies a first amplification feature map output by the sampling network; the down-sampling network and the up-sampling network are connected in a U-shaped mode. The down-sampling network comprises at least one down-sampling sub-network connected in series, each sub-network comprising only one down-sampling layer for performing the down-sampling operation. The upsampling network comprises at least one upsampling subnetwork connected in turn, each upsampling subnetwork comprising only one upsampling layer for performing an upsampling operation.

The down-sampling sub-network can be realized by a convolutional neural network, and the down-sampling layer can be realized by a multilayer convolutional layer. The down-sampling network makes the target characteristic diagram input to the down-sampling network pass through a plurality of down-sampling sub-networks, thereby generating intermediate abbreviated characteristic diagrams with different resolutions; the down-sampling network as shown in fig. 6 comprises four down-sampling sub-networks, it being understood that the down-sampling network comprises four down-sampling layers. The input target feature map passes through four down-sampling layers, and four intermediate thumbnail feature maps with different resolutions are generated. For example, the resolution of the original feature map input to the first-end down-sampling layer is 256 × 192, the down-sampling operation performed by the first-end down-sampling layer generates a feature map with a resolution of 64 × 48, the feature map with a resolution of 64 × 48 is generated by the down-sampling operation performed by the second down-sampling layer, the feature map with a resolution of 32 × 24 is generated by the down-sampling operation performed by the third down-sampling layer, the feature map with a resolution of 16 × 12 is generated by the feature map with a resolution of 8 × 6 by the down-sampling operation performed by the fourth down-sampling layer.

And the down-sampling network inputs the finally output feature map with the resolution of 8 x 6 into the up-sampling network, and the up-sampling network performs up-sampling operation on the feature map with the resolution of 8 x 6 through a plurality of up-sampling sub-networks by adopting a bilinear difference algorithm, so that the amplified feature maps with different resolutions are generated. The upsampling network as shown in fig. 6 comprises three upsampling sub-networks, it being understood that the upsampling network comprises three upsampling layers. The input 8 x 6 feature maps are passed through three upsampling layers, which will produce three different resolution magnified feature maps. For example, the above-mentioned upsampling operation of the feature map with the resolution of 8 × 6 through the top upsampling layer may generate the feature map with the resolution of 16 × 12, the upsampling operation of the feature map with the 16 × 12 through the second upsampling layer may generate the feature map with the resolution of 32 × 24, and the upsampling operation of the feature map with the 32 × 24 through the third upsampling layer may generate the feature map with the resolution of 64 × 48.

In addition, in the up-sampling process of the up-sampling network, in order to reduce an error between the feature map generated in the down-sampling process and the feature map generated in the up-sampling process, feature superposition is also required to be performed on the feature map with different resolutions generated by the down-sampling network and the feature map with different resolutions generated by the up-sampling network, and the specific implementation manner is as follows: the up-sampling network performs up-sampling on the feature map of 8 x 6 output by the down-sampling network, then outputs a feature map of 16 x 12, the feature map of 16 x 12 is subjected to feature superposition with the feature map of 16 x 12 output by the third down-sampling layer of the down-sampling network, the feature map of 16 x 12 after superposition enters the second up-sampling layer of the up-sampling network, the feature map of 16 x 12 after superposition is subjected to feature superposition by the second up-sampling layer, the feature map of 32 x 24 is output, the feature map of 32 x 24 is subjected to feature superposition with the feature map of 32 x 24 output by the second down-sampling layer, the feature map of 32 x 24 after superposition enters the third up-sampling layer, … … is performed, and the like, finally, the feature map with the resolution of 64 x 48 is output by the up-sampling network, and the feature map of 64 x 48 is input to the attitude identification network.

The 64 x 48 feature diagram input by the gesture recognition network passes through a layer of convolutional neural network, and then a gesture recognition result can be generated.

On the basis of fig. 6, this embodiment provides a schematic structural diagram of a second characteristic diagram processing network as shown in fig. 7. The difference with fig. 6 is that the down-sampling network in fig. 7 comprises at least one down-sampling sub-network connected in sequence; fig. 7 shows only one downsampling subnetwork in a dashed frame, and fig. 7 specifically includes 3 downsampling subnetworks; in practical application, the number of the down-sampling sub-networks can be flexibly set according to requirements. Each down-sampling sub-network comprises a down-sampling layer and a Merge layer which are connected in sequence, and it can also be understood that a Merge layer is additionally arranged between the down-sampling layers, and the Merge layer is used for executing the image feature fusion method provided by the second embodiment. Similarly, the upsampling network includes at least one upsampling sub-network connected in sequence, only one upsampling sub-network is outlined by a dashed line in fig. 7 as an illustration, and fig. 7 specifically includes 3 upsampling sub-networks; in practical application, the number of the up-sampling sub-networks can be flexibly set according to requirements. Each up-sampling sub-network comprises a Merge layer, an up-sampling layer and an overlapping layer which are sequentially connected. Generally, the number of upsampling sub-networks is the same as the number of downsampling sub-networks.

Compared with fig. 6, fig. 7 can clearly illustrate that a Merge layer for performing the image feature fusion method is added.

In a specific embodiment, referring to a U-type network structure diagram shown in fig. 8, a down-sampling network includes at least one down-sampling sub-network connected in sequence, each down-sampling sub-network includes an intermediate down-sampling layer, a first offset layer and a first fusion layer connected in sequence; wherein the output of the last down-sampling sub-network is further connected to an end down-sampling layer. The first bias layer and the first fusion layer are combined to be equivalent to the Merge layer. The illustration in fig. 8, showing only three down-sampling sub-networks and an end down-sampling layer connected to the output of the third down-sampling sub-network, is merely symbolic, and should not be seen as limiting.

For ease of understanding, the various network layers that comprise the downsampling subnetwork are first explained as follows: the input of the middle down-sampling layer is a first original feature map, and the output of the middle down-sampling layer is a middle thumbnail feature map; the input of the first offset layer is a middle thumbnail feature map, and the output of the first offset layer is a plurality of first offset feature maps; the input of the first fusion layer is a middle thumbnail feature map and a plurality of first offset feature maps, and the output of the first fusion layer is a first fusion feature map; the input of the terminal down-sampling layer is the first fused feature map output by the last down-sampling sub-network, and the output of the terminal down-sampling layer is the terminal abbreviated feature map. When the down-sampling sub-network is the down-sampling sub-network at the head end, the first original feature map is a target feature map; when the down-sampling sub-network is a non-head end down-sampling sub-network, the first original feature map is a first fused feature map output by a previous level down-sampling sub-network.

In a specific embodiment, referring to fig. 8, the upsampling network includes at least one upsampling sub-network connected in sequence, and each upsampling sub-network includes a second offset layer, a second fusion layer, an upsampling layer and an overlay layer connected in sequence; the second bias layer and the second fusion layer are combined to be equivalent to the Merge layer. Each superposition layer is correspondingly connected with a middle down-sampling layer in a down-sampling sub-network; the second offset layer of the head-end upsampling subnetwork is connected to the tail-end downsampling layer. Three upsampling sub-networks are shown symbolically in fig. 8 only and should not be considered as limiting.

For ease of understanding, the various network layers that comprise the upsampling sub-network are first explained as follows:

the input of the second offset layer is a second original characteristic diagram, and the output of the second offset layer is a plurality of second offset characteristic diagrams; the input of the second fusion layer is a second original feature map and a plurality of second offset feature maps, and the output of the second fusion layer is a second fusion feature map; the input of the upper sampling layer is a second fusion characteristic diagram, and the output of the upper sampling layer is a second amplification characteristic diagram; the input of the superposition layer is a second amplified characteristic diagram and a middle thumbnail characteristic diagram, and the output of the superposition layer is a first amplified characteristic diagram; when the up-sampling sub-network is the up-sampling sub-network at the head end, the second original feature map is an end thumbnail feature map output by an end down-sampling layer; when the up-sampling sub-network is a non-head end up-sampling sub-network, the second original feature map is a first amplified feature map output by a previous level up-sampling sub-network.

Based on the down-sampling network provided above, the specific implementation of obtaining the thumbnail feature map by performing the down-sampling operation based on the target feature map through the down-sampling network and performing the image feature fusion method as in the second embodiment can be performed by referring to the following steps, it should be understood that the following steps are all steps performed for each down-sampling sub-network, that is, for each down-sampling sub-network, there are:

step 1, a down-sampling operation is carried out on a first original feature map input by a down-sampling layer to obtain an intermediate thumbnail feature map;

step 2, pixel offset operation is carried out on the intermediate thumbnail characteristic graphs through the first offset layer, and a plurality of first offset characteristic graphs are obtained;

step 3, performing feature fusion operation on the intermediate thumbnail feature map and the plurality of first offset feature maps by adopting a weighted fusion algorithm through the first fusion layer to obtain a first fusion feature map;

and 4, performing down-sampling operation on the first fusion feature map obtained by the last down-sampling sub-network through the terminal down-sampling layer to obtain a terminal thumbnail feature map.

The following describes an exemplary implementation of the above steps 1 to 4 with reference to fig. 8: the first raw feature map input to the head-end down-sampling subnetwork is the target feature map, assuming the target feature map has a resolution of 256 × 192. The intermediate down-sampling layer of the head-end down-sampling sub-network performs down-sampling on the 256 × 192 target feature map, outputs an intermediate thumbnail feature map with a resolution of 64 × 48, and inputs the 64 × 48 intermediate thumbnail feature map to the first shifting layer and the up-sampling network connected thereto, respectively.

In step 2, the first shifting layer performs a pixel shifting operation on the intermediate thumbnail feature maps of 64 × 48 according to a preset shifting rule to obtain a plurality of first shifting feature maps. As can be seen from the description of the pixel shift operation in the second embodiment, the pixel shift operation can be understood as follows: firstly, shifting each pixel of a 64-by-48 intermediate thumbnail feature map to a preset shifting direction by a preset pixel shifting amount, and keeping a boundary pixel value of the intermediate thumbnail feature map at an original value to obtain a first shifting feature map corresponding to each preset shifting direction; the preset offset direction comprises multiple types of directions, namely an upper direction, a lower direction, a left direction, a right direction, an upper left direction, a lower left direction, an upper right direction and a lower right direction, and the preset pixel offset is at least one pixel.

In step 3, a weighted fusion algorithm in the second embodiment is adopted to perform a feature fusion operation on the intermediate abbreviated feature map with the resolution of 64 × 48 and the plurality of first offset feature maps to obtain a first fused feature map with the resolution of 64 × 48; the specific process is not described.

The first fused feature map output by the first fused layer of the head-end down-sampling subnetwork at a resolution of 64 x 48 enters the second down-sampling subnetwork. The second down-sampling sub-network performs down-sampling operation on the first fused feature map with the resolution of 64 x 48 input to the second down-sampling sub-network through the intermediate down-sampling layer to obtain an intermediate abbreviated feature map with the resolution of 32 x 24, and inputs the intermediate abbreviated feature map with the resolution of 32 x 24 into the first shifting layer and the up-sampling sub-network connected with the intermediate abbreviated feature map respectively; the first shifting layer performs pixel shifting operation on the 32 × 24 intermediate thumbnail feature graphs to obtain a plurality of 32 × 24 first shifting feature graphs; and the first fusion layer performs a feature fusion operation on the intermediate thumbnail feature map with the resolution of 32 × 24 and the plurality of first offset feature maps by adopting a weighted fusion algorithm to obtain a first fusion feature map with the resolution of 32 × 24. … …, repeating the steps until step 4, outputting the end abbreviated feature map with resolution of 8 × 6 through the end down-sampling layer, and inputting the end abbreviated feature map of 8 × 6 into the upper sampling sub-network, when the specific implementation is, inputting the end abbreviated feature map of 8 × 6 into the second offset layer of the first end up-sampling sub-network through the end down-sampling layer.

Further, in this embodiment, a specific implementation manner of obtaining the first enlarged feature map by performing the image feature fusion method provided in the second embodiment on the thumbnail feature map through the upsampling network and performing the upsampling operation may be performed with reference to the following steps, where the following steps are performed for each upsampling sub-network, that is, for each upsampling sub-network, the following steps a to D are performed:

and step A, executing pixel offset operation on the second original characteristic diagram input to the second original characteristic diagram through a second offset layer to obtain a plurality of second offset characteristic diagrams.

And step B, performing feature fusion operation on the second original feature map and the plurality of second offset feature maps through the second fusion layer to obtain a second fusion feature map.

And step C, performing up-sampling operation on the second fusion characteristic graph by an up-sampling layer by adopting a bilinear interpolation algorithm to obtain a second amplified characteristic graph.

And D, overlapping the second amplified characteristic diagram and the middle thumbnail characteristic diagram through an overlapping layer to obtain a first amplified characteristic diagram.

For ease of understanding, a specific implementation of the above steps A-D is described below in conjunction with FIG. 8. As shown in fig. 8, the second original feature map input by the second offset layer of the head-end upsampling subnetwork is an end abbreviated feature map with resolution of 8 × 6 output by the end downsampling layer, and the second offset layer performs pixel offset operation on the end abbreviated feature map of 8 × 6 to obtain a plurality of 8 × 6 second offset feature maps; a second fusion layer in the sampling subnetwork on the head end adopts a weighted fusion algorithm to execute feature fusion operation on the tail end abbreviated feature graph with the resolution of 8 × 6 and the second offset feature graph to obtain a second fusion feature graph with the resolution of 8 × 6; the upsampling layer performs upsampling operation on the 8 × 6 second fusion feature map by adopting a bilinear interpolation algorithm to obtain a second amplified feature map with the resolution of 16 × 12; the superposition layer superposes the 16 x 12 intermediate thumbnail feature map output by the intermediate down-sampling layer of the third down-sampling sub-network and the 16 x 12 second enlarged feature map output by the up-sampling layer to obtain the 16 x 12 first enlarged feature map. The second shifting layer pair of the second upsampling subnetwork performs a pixel shifting operation on the first magnified feature map of 16 × 12 to obtain a plurality of second shifted feature maps of 16 × 12, … …, and so on, until the third upsampling subnetwork enters step D, and the second magnified feature map of 64 × 48 is superimposed with the intermediate thumbnail feature map of 64 × 48 by the superimposing layer to obtain the first magnified feature map of 64 × 48; the 64 x 48 first magnified feature map will be input to the gesture recognition network.

In summary, in the feature map processing method provided in this embodiment, the image feature fusion method is performed in both the down-sampling and up-sampling processes of the target feature map, so that the processed feature map can include more comprehensive feature information.

Example four:

based on the first enlarged feature map obtained by the feature map processing method provided in the third embodiment, the embodiment of the present invention further provides a flowchart of a gesture recognition method shown in fig. 9, where the method specifically includes the following steps:

in step S902, a whole-body image of the target object is acquired.

The whole-body image can be directly obtained (such as a whole-body image directly uploaded by a user), or an image to be identified containing a target object can be obtained through an image acquisition device such as a camera; then, a target object included in the image to be recognized is detected, and a whole-body image of the target object is intercepted.

In specific implementation, the image to be recognized can be detected by a Top-down (Top-down) detection method through a Megdet network, so as to obtain a target object contained in the image to be recognized; segmenting an image to be identified into a plurality of whole-body images containing the target object according to the target object; for example, if there are three pedestrians in the image to be recognized, the image to be recognized is intercepted by the target object, and three small images including the whole body of the pedestrian are obtained. The top-down method can be understood as decomposing a complex big problem (namely the image to be identified) into relatively simple small problems (namely the small images), finding out the key and the key of each problem, and then qualitatively and quantitatively describing the problems by using accurate thinking; its core essence is "decomposition".

In step S904, a target feature map is generated based on the whole-body image.

In particular embodiments, the whole-body image may be scaled to a specified size (e.g., 256 × 192); and inputting the scaled whole-body image into a convolutional neural network such as a Resnet network for feature extraction to obtain a target feature map.

Step S906, generating a first enlarged characteristic diagram of the target characteristic diagram by adopting a characteristic diagram processing method. Specifically, the target feature map usually has a plurality of key points to be detected, where the key points to be detected may include joint points of the target object, such as a knee, an ankle, and a shoulder joint, and/or preset points on a specified portion of the target object, such as a point on a calf, a point on a back, and the like.

In step S908, the posture of the target object is recognized based on the first enlarged feature map.

Specifically, a thermodynamic diagram of the target feature map may be generated based on the first enlarged feature map; and recognizing the posture of the target object based on the thermodynamic diagram, wherein the coordinates of the key points to be detected can be determined according to the thermodynamic diagram, so that the posture of the target object is recognized based on the key points to be detected.

The human body posture identification method provided by the embodiment can firstly acquire a whole-body image of a target object, generate a target feature map based on the whole-body image, then generate a first amplified feature map of the target feature map by adopting a feature map processing method, and further identify the posture of the target object based on the first amplified feature map. The human body posture recognition method provided by the embodiment can effectively improve the accuracy of human body posture recognition based on the first amplified characteristic diagram which covers more comprehensive characteristic information.

In the process of generating the first enlarged feature map of the target feature map by using the feature map processing method provided in the foregoing embodiment, a convolution operation may be performed on the first enlarged feature map to generate a thermodynamic map of the target feature map; the thermodynamic diagram is a diagram showing the location of a point of interest in a special form such as a highlight or a color. The thermodynamic diagram obtained by the embodiment can clearly identify the interest point, such as the joint point position or other key point positions (such as the midpoint position of the lower arm) of the target object. Then recognizing the posture of the target object based on the thermodynamic diagram; in the process of generating the thermodynamic diagram based on the first enlarged feature map, a corresponding thermodynamic diagram can be generated for each key point to be detected, that is, only one key point is represented in each thermodynamic diagram.

Based on this, this embodiment provides a specific implementation of performing a convolution operation on the first enlarged feature map to generate a thermodynamic diagram of the target feature map, which may be performed by referring to the following steps:

firstly, acquiring the areas of a plurality of key points to be detected in a target characteristic diagram; the key points to be detected comprise joint points of the target object and/or preset points on the designated part of the target object;

then, performing convolution operation on the first amplification characteristic diagram based on the areas of the plurality of key points to be detected to obtain a plurality of thermodynamic diagrams; wherein, each key point to be detected corresponds to a thermodynamic diagram. And inputting the areas of the n (n is more than or equal to 1) key points to be detected into a convolutional neural network as parameters, carrying out convolution operation on the first amplified characteristic diagram by the convolutional neural network based on the areas of the key points to be detected, outputting n thermodynamic diagrams, and enabling each key point to be detected to correspond to one thermodynamic diagram.

Further, this embodiment provides a specific implementation manner for recognizing the posture of the target object based on thermodynamic diagram, which may be performed by referring to the following steps:

step 1, calculating response values of pixel points contained in each thermodynamic diagram by adopting a Gaussian fuzzy method, and taking the pixel point with the maximum response value as a key point to be detected corresponding to the thermodynamic diagram;

and 2, recognizing the posture of the target object according to the key points to be detected corresponding to each thermodynamic diagram. In the concrete implementation of the step, firstly, acquiring thermodynamic diagram coordinates of the key points to be detected corresponding to each thermodynamic diagram; determining the original coordinates of each key point to be detected in the image to be identified based on a preset mapping relation and thermodynamic diagram coordinates of each key point to be detected; and finally, recognizing the posture of the target object according to the original coordinates of the key points to be detected in the image to be recognized.

Namely, after the thermodynamic diagrams of each key point to be detected are subjected to Gaussian blur, the point with the maximum response value in each thermodynamic diagram is taken as the predicted coordinate of the joint point, the coordinate is mapped back to the coordinate of the image to be recognized, and the posture of the target object is represented by the coordinate of the image to be recognized.

In summary, the human body posture identification method provided by this embodiment performs posture identification mainly based on the feature map covering more comprehensive feature information, and can effectively improve the accuracy of human body posture identification.

Example five:

as to the image feature fusion method provided in the second embodiment, an embodiment of the present invention provides an image feature fusion device, referring to a structural block diagram of an image feature fusion device shown in fig. 10, including:

a feature map to be fused obtaining module 1002, configured to obtain a feature map to be fused;

the pixel shifting module 1004 is configured to perform pixel shifting operation on the feature maps to be fused to obtain multiple shifted feature maps;

and the feature fusion module 1006 is configured to perform a feature fusion operation on the multiple offset feature maps and the feature map to be fused to obtain a fusion feature map.

The image feature fusion device provided by this embodiment can first perform a pixel shift operation on a feature graph to be fused to obtain a plurality of shift feature graphs; and then, performing feature fusion operation on the plurality of offset feature maps and the feature map to be fused to obtain a fusion feature map. The image feature fusion device provided by the embodiment can effectively improve the comprehensiveness of the feature information covered by the feature map by performing the pixel shift operation and the feature fusion operation on the feature map.

In an embodiment, the pixel shifting module 1004 is configured to shift each pixel of the feature map to be fused by a preset pixel shift amount in a preset shifting direction, and maintain a boundary pixel value of the feature map to be fused as an original value to obtain a shifted feature map corresponding to each preset shifting direction.

The preset offset direction comprises multiple types of directions, namely an upper direction, a lower direction, a left direction, a right direction, an upper left direction, a lower left direction, an upper right direction and a lower right direction, and the preset pixel offset is at least one pixel.

In an embodiment, the feature fusion module 1006 is configured to perform a feature fusion operation on the multiple offset feature maps and the feature map to be fused by using a weighted fusion algorithm, so as to obtain a fusion feature map; wherein, the weights corresponding to different offset characteristic graphs are the same or different; and the weight of the feature map to be fused is greater than that of each offset feature map.

The apparatus provided in this embodiment has the same implementation principle and technical effects as those of the foregoing embodiment, and for the sake of brief description, reference may be made to the corresponding contents in the foregoing second embodiment of the method where no part of the embodiment of the apparatus is mentioned.

Example six:

as to the feature map processing method provided in the third embodiment, an embodiment of the present invention provides a feature map processing apparatus, and referring to a block diagram of a structure of the feature map processing apparatus shown in fig. 11, the feature map processing method includes:

a target feature map obtaining module 1102, configured to obtain a target feature map;

a first executing module 1104, configured to input the target feature map into a down-sampling network, perform a down-sampling operation based on the target feature map through the down-sampling network, and perform the image feature fusion method as provided in the second embodiment to obtain a thumbnail feature map;

a second executing module 1106, configured to input the thumbnail feature map into an upsampling network, execute the image feature fusion method provided in the second embodiment based on the thumbnail feature map through the upsampling network, and perform an upsampling operation to obtain a first enlarged feature map;

a feature map determining module 1108, configured to determine the first enlarged feature map as a feature map after the target feature map is processed.

The feature map processing device provided by the embodiment of the invention can firstly input the target feature map into a down-sampling network, and execute down-sampling operation and an image feature fusion method based on the target feature map through the down-sampling network to obtain a thumbnail feature map; and then inputting the thumbnail feature map into an up-sampling network, and executing an image feature fusion method and an up-sampling operation based on the thumbnail feature map through the up-sampling network to obtain a feature map of the processed target feature map. In the feature map processing apparatus provided in this embodiment, the image feature fusion method is performed in both the down-sampling and up-sampling processes of the target feature map, so that the processed feature map can include more comprehensive feature information.

In one embodiment, the down-sampling network comprises at least one down-sampling sub-network connected in sequence, each down-sampling sub-network comprising an intermediate down-sampling layer, a first offset layer and a first fusion layer connected in sequence; the output end of the last down-sampling sub-network is also connected with a terminal down-sampling layer; the input of the middle down-sampling layer is a first original feature map, and the output of the middle down-sampling layer is a middle thumbnail feature map; the input of the first offset layer is a middle thumbnail feature map, and the output of the first offset layer is a plurality of first offset feature maps; the input of the first fusion layer is a middle thumbnail feature map and a plurality of first offset feature maps, and the output of the first fusion layer is a first fusion feature map; the input of the terminal down-sampling layer is a first fused feature map output by the last down-sampling sub-network, and the output of the terminal down-sampling layer is a terminal abbreviated feature map; when the down-sampling sub-network is the down-sampling sub-network at the head end, the first original feature map is a target feature map; when the down-sampling sub-network is a non-head end down-sampling sub-network, the first original feature map is a first fused feature map output by a previous level down-sampling sub-network.

In one embodiment, the first performing module 1104 is configured to, for each downsampling subnetwork, perform downsampling on the first original feature map input to the downsampling subnetwork through a downsampling layer, so as to obtain an intermediate abbreviated feature map; performing pixel offset operation on the intermediate thumbnail feature map through a first offset layer to obtain a plurality of first offset feature maps; performing feature fusion operation on the intermediate thumbnail feature map and the plurality of first offset feature maps through the first fusion layer to obtain a first fusion feature map; and performing down-sampling operation on the first fused feature map obtained by the last down-sampling sub-network through the terminal down-sampling layer to obtain a terminal thumbnail feature map.

In one embodiment, the up-sampling network comprises at least one up-sampling sub-network connected in sequence, and each up-sampling sub-network comprises a second offset layer, a second fusion layer, an up-sampling layer and an overlapping layer connected in sequence; each superposition layer is correspondingly connected with a middle down-sampling layer in a down-sampling sub-network; the second offset layer of the head-end upsampling subnetwork is connected with the tail-end downsampling layer; the input of the second offset layer is a second original characteristic diagram, and the output of the second offset layer is a plurality of second offset characteristic diagrams; the input of the second fusion layer is a second original feature map and a plurality of second offset feature maps, and the output of the second fusion layer is a second fusion feature map; the input of the upper sampling layer is a second fusion characteristic diagram, and the output of the upper sampling layer is a second amplification characteristic diagram; the input of the superposition layer is a second amplified characteristic diagram and a middle thumbnail characteristic diagram, and the output of the superposition layer is a first amplified characteristic diagram; when the up-sampling sub-network is the up-sampling sub-network at the head end, the second original feature map is an end thumbnail feature map output by an end down-sampling layer; when the up-sampling sub-network is a non-head end up-sampling sub-network, the second original feature map is a first amplified feature map output by a previous level up-sampling sub-network.

In one embodiment, the up-sampling network comprises a number of up-sampling sub-networks equal to a number of down-sampling sub-networks comprised by the down-sampling network; the second executing module 1106 is configured to, for each upsampling sub-network, execute a pixel shifting operation on the second original feature map input thereto through a second shifting layer, so as to obtain a plurality of second shifted feature maps; performing feature fusion operation on the second original feature map and the plurality of second offset feature maps through a second fusion layer to obtain a second fusion feature map; performing upsampling operation on the second fusion characteristic diagram by adopting a bilinear interpolation algorithm through an upsampling layer to obtain a second amplified characteristic diagram; and overlapping the second amplified characteristic diagram and the intermediate thumbnail characteristic diagram through an overlapping layer to obtain a first amplified characteristic diagram.

The device provided in this embodiment has the same implementation principle and technical effects as those of the foregoing embodiments, and for brief description, reference may be made to corresponding contents in the third embodiment of the method for describing the device without reference to the embodiments.

Example seven:

as to the gesture recognition method provided in the fourth embodiment, an embodiment of the present invention provides a gesture recognition apparatus, referring to a structural block diagram of the gesture recognition apparatus shown in fig. 12, including:

an image acquisition module 1202 for acquiring a whole-body image of the target object.

A feature map generation module 1204, configured to generate a target feature map based on the whole-body image.

A first enlarged feature map generating module 1206, configured to generate a first enlarged feature map of the target feature map by using the feature map processing method according to the third embodiment.

A gesture recognition module 1208 for recognizing a gesture of the target object based on the first enlarged feature map.

The human body posture recognition device provided by the embodiment can firstly acquire a whole-body image of a target object, generate a target feature map based on the whole-body image, then generate a first amplified feature map of the target feature map by adopting a feature map processing method, and further recognize the posture of the target object based on the first amplified feature map. The human body posture recognition device provided by the embodiment mainly performs posture recognition based on the characteristic diagram covering more comprehensive characteristic information, and can effectively improve the accuracy of human body posture recognition.

In an implementation manner, the image obtaining module 1202 is further configured to obtain an image to be identified; and detecting a target object contained in the image to be recognized, and intercepting a whole-body image of the target object.

In one implementation, the above-mentioned feature map generation module 1204 is further configured to scale the whole-body image to a specified size; and performing feature extraction on the zoomed whole-body image to obtain a target feature map.

In one implementation, the gesture recognition module 1208 is further configured to perform a convolution operation on the first enlarged feature map to generate a thermodynamic map of the target feature map; identifying a pose of a target object based on thermodynamic diagrams

In an implementation manner, the gesture recognition module 1208 is further configured to obtain areas where a plurality of key points to be detected are located in the target feature map; the key points to be detected comprise joint points of the target object and/or preset points on the designated part of the target object; performing convolution operation on the first amplification characteristic diagram based on the areas of the key points to be detected to obtain a plurality of thermodynamic diagrams; wherein, each key point to be detected corresponds to a thermodynamic diagram.

In one implementation, the gesture recognition module 1208 is further configured to: for each thermodynamic diagram, calculating the response value of a pixel point contained in the thermodynamic diagram by adopting a Gaussian fuzzy method, and taking the pixel point with the maximum response value as a key point to be detected corresponding to the thermodynamic diagram; and recognizing the posture of the target object according to the key points to be detected corresponding to each thermodynamic diagram.

In one implementation, the gesture recognition module 1208 is further configured to: acquiring thermodynamic diagram coordinates of the key points to be detected corresponding to each thermodynamic diagram; determining the original coordinates of each key point to be detected in the image to be identified based on a preset mapping relation and thermodynamic diagram coordinates of each key point to be detected; and recognizing the posture of the target object according to the original coordinates of the key points to be detected in the image to be recognized.

The device provided in this embodiment has the same implementation principle and technical effects as those of the fourth embodiment, and for the sake of brief description, reference may be made to corresponding contents in the fourth embodiment of the method where no part of the embodiment of the device is mentioned.

Example eight:

the present embodiment provides a gesture recognition system, including: the device comprises an image acquisition device, a processor and a storage device; an image acquisition device for acquiring a whole-body image of a target object; the storage device has stored thereon a computer program which, when executed by a processor, performs a gesture recognition method.

It can be clearly understood by those skilled in the art that, for convenience and brevity of description, the specific working process of the system described above may refer to the corresponding process in the foregoing embodiments, and is not described herein again.

Further, the present embodiment provides a computer-readable storage medium, on which a computer program is stored, and when the computer program is executed by a processor, the method for fusing image features according to the second embodiment is executed, or the method for processing a feature map according to the third embodiment is executed, or the steps of the method for recognizing a pose according to the fourth embodiment are executed.

The computer program product of the image feature fusion method, the feature map processing method, the gesture recognition device and the system provided by the embodiments of the present invention includes a computer readable storage medium storing a program code, and instructions included in the program code may be used to execute the method described in the foregoing method embodiments, and specific implementation may refer to the method embodiments, and will not be described herein again.

The functions, if implemented in the form of software functional units and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.

Finally, it should be noted that: the above-mentioned embodiments are only specific embodiments of the present invention, which are used for illustrating the technical solutions of the present invention and not for limiting the same, and the protection scope of the present invention is not limited thereto, although the present invention is described in detail with reference to the foregoing embodiments, those skilled in the art should understand that: any person skilled in the art can modify or easily conceive the technical solutions described in the foregoing embodiments or equivalent substitutes for some technical features within the technical scope of the present disclosure; such modifications, changes or substitutions do not depart from the spirit and scope of the embodiments of the present invention, and they should be construed as being included therein. Therefore, the protection scope of the present invention shall be subject to the protection scope of the appended claims.

Claims

1. An image feature fusion method, comprising:

acquiring a feature map to be fused;

executing pixel shift operation on the feature graph to be fused to obtain a plurality of shift feature graphs; executing pixel shift operation on each pixel of the feature graph to be fused according to a preset shift direction to obtain a plurality of shift feature graphs;

and performing feature fusion operation on the plurality of offset feature maps and the feature map to be fused to obtain a fusion feature map.

2. The method according to claim 1, wherein the step of performing a pixel shift operation on the feature map to be fused to obtain a plurality of shifted feature maps comprises:

and shifting each pixel of the feature graph to be fused in a preset shifting direction by a preset pixel shifting amount, and keeping the boundary pixel value of the feature graph to be fused in an original value to obtain each shifting feature graph corresponding to the preset shifting direction.

3. The method according to claim 2, wherein the preset offset direction includes a plurality of kinds of up direction, down direction, left direction, right direction, up left direction, down left direction, up right direction, and down right direction, and the preset pixel offset amount is at least one pixel.

4. The method according to claim 1, wherein the step of performing a feature fusion operation on the plurality of offset feature maps and the feature map to be fused to obtain a fused feature map comprises:

performing feature fusion operation on the plurality of offset feature maps and the feature map to be fused by adopting a weighted fusion algorithm to obtain a fusion feature map; wherein, the weights corresponding to different offset characteristic graphs are the same or different; and the weight of the feature map to be fused is greater than the weight of each offset feature map.

5. A method for processing a feature map, comprising:

acquiring a target characteristic diagram to be processed;

inputting the target feature map into a down-sampling network, performing down-sampling operation based on the target feature map through the down-sampling network, and performing the image feature fusion method according to any one of claims 1 to 4 to obtain a thumbnail feature map;

inputting the thumbnail feature map into an upsampling network, executing the image feature fusion method according to any one of claims 1 to 4 based on the thumbnail feature map through the upsampling network, and executing an upsampling operation to obtain a first amplified feature map;

and determining the first amplified feature map as the feature map of the processed target feature map.

6. The method of claim 5, wherein the down-sampling network comprises at least one down-sampling sub-network connected in sequence, each of the down-sampling sub-networks comprising an intermediate down-sampling layer, a first offset layer, and a first fusion layer connected in sequence; the output end of the last down-sampling sub-network is also connected with a terminal down-sampling layer;

the input of the middle down-sampling layer is a first original feature map, and the output of the middle down-sampling layer is a middle thumbnail feature map;

the input of the first offset layer is the intermediate thumbnail feature map, and the output of the first offset layer is a plurality of first offset feature maps;

the input of the first fusion layer is the intermediate thumbnail feature map and a plurality of the first offset feature maps, and the output of the first fusion layer is a first fusion feature map;

the input of the terminal down-sampling layer is a first fused feature map output by the last down-sampling sub-network, and the output of the terminal down-sampling layer is a terminal thumbnail feature map;

when the down-sampling sub-network is a head end down-sampling sub-network, the first original feature map is the target feature map; when the down-sampling subnetwork is a non-head end down-sampling subnetwork, the first original feature map is the first fused feature map output by a previous-stage down-sampling subnetwork.

7. The method according to claim 6, wherein the step of performing a downsampling operation based on the target feature map through the downsampling network and performing the image feature fusion method according to any one of claims 1 to 4 to obtain the thumbnail feature map comprises:

for each downsampling sub-network, performing downsampling operation on a first original feature map input by the downsampling layer to obtain an intermediate thumbnail feature map; and the number of the first and second groups,

executing pixel offset operation on the intermediate thumbnail feature maps through the first offset layer to obtain a plurality of first offset feature maps; and the number of the first and second groups,

performing a feature fusion operation on the intermediate thumbnail feature map and the plurality of first offset feature maps through the first fusion layer to obtain a first fusion feature map;

and performing down-sampling operation on the first fused feature map obtained by the last down-sampling sub-network through the terminal down-sampling layer to obtain the terminal abbreviated feature map.

8. The method of claim 6, wherein the upsampling network comprises at least one upsampling sub-network connected in sequence, each of the upsampling sub-networks comprising a second offset layer, a second fusion layer, an upsampling layer, and an overlay layer connected in sequence; each superposition layer is correspondingly connected with a middle down-sampling layer in the down-sampling sub-network; a second offset layer of the upsampling subnetwork at the head end coupled to the terminal downsampling layer; wherein,

the input of the second offset layer is a second original characteristic diagram, and the output of the second offset layer is a plurality of second offset characteristic diagrams;

the input of the second fusion layer is the second original feature map and a plurality of second offset feature maps, and the output of the second fusion layer is a second fusion feature map;

the input of the up-sampling layer is the second fused feature map, and the output of the up-sampling layer is a second amplified feature map;

the input of the superposition layer is the second amplified characteristic diagram and the intermediate thumbnail characteristic diagram, and the output of the superposition layer is the first amplified characteristic diagram;

when the up-sampling sub-network is the up-sampling sub-network at the head end, the second original feature map is the tail-end abbreviated feature map output by the tail-end down-sampling layer; and when the up-sampling sub-network is a non-head end up-sampling sub-network, the second original feature map is the first amplified feature map output by a previous-stage up-sampling sub-network.

9. The method of claim 8, wherein the upsampling network comprises a number of upsampling sub-networks equal to a number of downsampling sub-networks comprised by the downsampling network; the steps of performing the image feature fusion method according to any one of claims 1 to 4 based on the thumbnail feature map through the up-sampling network and performing an up-sampling operation to obtain a first enlarged feature map include:

for each upsampling sub-network, performing pixel shift operation on a second original feature map input to the upsampling sub-network through the second shift layer to obtain a plurality of second shift feature maps; and the number of the first and second groups,

performing feature fusion operation on the second original feature map and the second offset feature maps through the second fusion layer to obtain a second fusion feature map; and the number of the first and second groups,

performing upsampling operation on the second fusion characteristic diagram by adopting a bilinear interpolation algorithm through the upsampling layer to obtain a second amplified characteristic diagram;

and overlapping the second amplified characteristic diagram and the intermediate thumbnail characteristic diagram through the overlapping layer to obtain the first amplified characteristic diagram.

10. A gesture recognition method, comprising:

acquiring a whole-body image of a target object;

generating a target feature map based on the whole-body image;

generating a first enlarged feature map of the target feature map using the feature map processing method of any one of claims 5 to 9;

recognizing the posture of the target object based on the first enlarged feature map.

11. The method of claim 10, wherein the step of acquiring a whole-body image of the target subject comprises:

acquiring an image to be identified;

and detecting a target object contained in the image to be identified, and intercepting a whole-body image of the target object.

12. The method of claim 11, wherein the step of generating a target feature map based on the whole-body image comprises:

scaling the whole-body image to a specified size;

and performing feature extraction on the zoomed whole-body image to obtain a target feature map.

13. The method of claim 11, wherein the step of identifying the pose of the target object based on the first magnified feature map comprises:

performing a convolution operation on the first enlarged feature map to generate a thermodynamic map of the target feature map;

a pose of the target object is identified based on the thermodynamic diagram.

14. The method of claim 13, wherein the step of performing a convolution operation on the first magnified feature map to generate a thermodynamic map of the target feature map comprises:

acquiring the areas of a plurality of key points to be detected in the target characteristic diagram; the key points to be detected comprise joint points of the target object and/or preset points on a designated part of the target object;

performing convolution operation on the first amplified characteristic diagram based on the areas of the key points to be detected to obtain a plurality of thermodynamic diagrams; and each key point to be detected corresponds to a thermodynamic diagram.

15. The method of claim 14, wherein the step of identifying the pose of the target object based on the thermodynamic diagram comprises:

for each thermodynamic diagram, calculating the response value of a pixel point contained in the thermodynamic diagram by adopting a Gaussian fuzzy method, and taking the pixel point with the maximum response value as a key point to be detected corresponding to the thermodynamic diagram;

and identifying the posture of the target object according to the key points to be detected corresponding to each thermodynamic diagram.

16. The method according to claim 14, wherein the step of recognizing the posture of the target object according to the key points to be detected corresponding to each thermodynamic diagram comprises:

acquiring thermodynamic diagram coordinates of the key points to be detected corresponding to each thermodynamic diagram;

determining the original coordinates of each key point to be detected in the image to be identified based on a preset mapping relation and thermodynamic diagram coordinates of each key point to be detected;

and recognizing the posture of the target object according to the original coordinates of the key points to be detected in the image to be recognized.

17. An image feature fusion apparatus, comprising:

the to-be-fused feature map acquisition module is used for acquiring a to-be-fused feature map;

the pixel migration module is used for executing pixel migration operation on the feature graph to be fused to obtain a plurality of migration feature graphs; executing pixel shift operation on each pixel of the feature graph to be fused according to a preset shift direction to obtain a plurality of shift feature graphs;

and the feature fusion module is used for executing feature fusion operation on the plurality of offset feature maps and the feature map to be fused to obtain a fusion feature map.

18. A feature map processing apparatus, comprising:

the target characteristic diagram acquisition module is used for acquiring a target characteristic diagram;

a first executing module, configured to input the target feature map into a down-sampling network, perform a down-sampling operation based on the target feature map through the down-sampling network, and perform the image feature fusion method according to any one of claims 1 to 4 to obtain a thumbnail feature map;

a second execution module, configured to input the thumbnail feature map into an upsampling network, execute the image feature fusion method according to any one of claims 1 to 4 based on the thumbnail feature map through the upsampling network, and execute an upsampling operation to obtain a first enlarged feature map;

and the characteristic map determining module is used for determining the first amplified characteristic map as the characteristic map of the processed target characteristic map.

19. An attitude recognition apparatus characterized by comprising:

the image acquisition module is used for acquiring a whole body image of the target object;

a feature map generation module for generating a target feature map based on the whole-body image;

a first enlarged feature map generation module for generating a first enlarged feature map of the target feature map by using the feature map processing method according to any one of claims 5 to 9;

and the gesture recognition module is used for recognizing the gesture of the target object based on the first amplified feature map.

20. A gesture recognition system, the system comprising: the device comprises an image acquisition device, a processor and a storage device;

the image acquisition device is used for acquiring a whole body image of the target object;

the storage device has stored thereon a computer program which, when executed by the processor, performs the gesture recognition method of any one of the preceding claims 10 to 16.

21. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the image feature fusion method according to any one of the preceding claims 1 to 4 or the steps of the feature map processing method according to any one of the preceding claims 5 to 9.